Big Halloween Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70special

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam Practice Test

Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Testing Engine

  • Product Type: Testing Engine
$37.5  $124.99

PDF Study Guide

  • Product Type: PDF Study Guide
$33  $109.99
Question 1

39 of 55.

A Spark developer is developing a Spark application to monitor task performance across a cluster.

One requirement is to track the maximum processing time for tasks on each worker node and consolidate this information on the driver for further analysis.

Which technique should the developer use?

Options:

A.

Broadcast a variable to share the maximum time among workers.

B.

Configure the Spark UI to automatically collect maximum times.

C.

Use an RDD action like reduce() to compute the maximum time.

D.

Use an accumulator to record the maximum time on the driver.

Question 2

5 of 55.

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

Options:

A.

A job contains multiple tasks, and each task contains multiple stages.

B.

A stage contains multiple jobs, and each job contains multiple tasks.

C.

A stage contains multiple tasks, and each task contains multiple jobs.

D.

A job contains multiple stages, and each stage contains multiple tasks.

Question 3

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

Options:

A.

By configuring the option checkpointLocation during readStream

B.

By configuring the option recoveryLocation during the SparkSession initialization

C.

By configuring the option recoveryLocation during writeStream

D.

By configuring the option checkpointLocation during writeStream

Question 4

49 of 55.

In the code block below, aggDF contains aggregations on a streaming DataFrame:

aggDF.writeStream \

.format("console") \

.outputMode("???") \

.start()

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

A.

AGGREGATE

B.

COMPLETE

C.

REPLACE

D.

APPEND

Question 5

Which command overwrites an existing JSON file when writing a DataFrame?

Options:

A.

df.write.mode("overwrite").json("path/to/file")

B.

df.write.overwrite.json("path/to/file")

C.

df.write.json("path/to/file", overwrite=True)

D.

df.write.format("json").save("path/to/file", mode="overwrite")

Question 6

Given this code:

.withWatermark("event_time", "10 minutes")

.groupBy(window("event_time", "15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Options:

A.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

B.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.

Question 7

A Spark engineer must select an appropriate deployment mode for the Spark jobs.

What is the benefit of using cluster mode in Apache Spark™?

Options:

A.

In cluster mode, resources are allocated from a resource manager on the cluster, enabling better performance and scalability for large jobs

B.

In cluster mode, the driver is responsible for executing all tasks locally without distributing them across the worker nodes.

C.

In cluster mode, the driver runs on the client machine, which can limit the application's ability to handle large datasets efficiently.

D.

In cluster mode, the driver program runs on one of the worker nodes, allowing the application to fully utilize the distributed resources of the cluster.

Question 8

A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

Options:

A.

df.orderBy(col("age").asc(), col("salary").asc()).show()

B.

df.sort("age", "salary", ascending=[True, True]).show()

C.

df.sort("age", "salary", ascending=[False, True]).show()

D.

df.orderBy("age", "salary", ascending=[True, False]).show()

Question 9

36 of 55.

What is the main advantage of partitioning the data when persisting tables?

Options:

A.

It compresses the data to save disk space.

B.

It automatically cleans up unused partitions to optimize storage.

C.

It ensures that data is loaded into memory all at once for faster query execution.

D.

It optimizes by reading only the relevant subset of data from fewer partitions.

Question 10

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

Options:

A.

Locate the executor logs on the Spark master node, typically under the /tmp directory.

B.

Use the command spark-submit with the —verbose flag to print the logs to the console.

C.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.

D.

Fetch the logs by running a Spark job with the spark-sql CLI tool.

Question 11

A data scientist has identified that some records in the user profile table contain null values in any of the fields, and such records should be removed from the dataset before processing. The schema includes fields like user_id, username, date_of_birth, created_ts, etc.

The schema of the user profile table looks like this:

Which block of Spark code can be used to achieve this requirement?

Options:

Options:

A.

filtered_df = users_raw_df.na.drop(thresh=0)

B.

filtered_df = users_raw_df.na.drop(how='all')

C.

filtered_df = users_raw_df.na.drop(how='any')

D.

filtered_df = users_raw_df.na.drop(how='all', thresh=None)

Question 12

23 of 55.

A data scientist is working with a massive dataset that exceeds the memory capacity of a single machine. The data scientist is considering using Apache Spark™ instead of traditional single-machine languages like standard Python scripts.

Which two advantages does Apache Spark™ offer over a normal single-machine language in this scenario? (Choose 2 answers)

Options:

A.

It can distribute data processing tasks across a cluster of machines, enabling horizontal scalability.

B.

It requires specialized hardware to run, making it unsuitable for commodity hardware clusters.

C.

It processes data solely on disk storage, reducing the need for memory resources.

D.

It eliminates the need to write any code, automatically handling all data processing.

E.

It has built-in fault tolerance, allowing it to recover seamlessly from node failures during computation.

Question 13

A data scientist is working with a Spark DataFrame called customerDF that contains customer information. The DataFrame has a column named email with customer email addresses. The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

Options:

A.

customerDF.select(

col("email").substr(0, 5).alias("username"),

col("email").substr(-5).alias("domain")

)

B.

customerDF.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

C.

customerDF.withColumn("username", substring_index(col("email"), "@", 1)) \

.withColumn("domain", substring_index(col("email"), "@", -1))

D.

customerDF.select(

regexp_replace(col("email"), "@", "").alias("username"),

regexp_replace(col("email"), "@", "").alias("domain")

)

Question 14

A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.

Which change should be made to solve the issue?

Options:

A.

Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges

B.

Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy

C.

Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges

D.

Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

Question 15

A Spark application is experiencing performance issues in client mode because the driver is resource-constrained.

How should this issue be resolved?

Options:

A.

Add more executor instances to the cluster

B.

Increase the driver memory on the client machine

C.

Switch the deployment mode to cluster mode

D.

Switch the deployment mode to local mode

Question 16

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:

A.

Replace .bucketBy() with .partitionBy("event_year", "event_month")

B.

Change the bucket count (42) to a lower number

C.

Add .sortBy() after .bucketBy()

D.

Replace .bucketBy() with .partitionBy("event_year") only

Question 17

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Increase the size of the dataset to create more partitions

D.

Enable dynamic resource allocation to scale resources as needed

Question 18

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

Options:

A.

The Spark engine requires manual intervention to start executing transformations.

B.

Only actions trigger the execution of the transformation pipeline.

C.

Transformations are executed immediately to build the lineage graph.

D.

The Spark engine optimizes the execution plan during the transformations, causing delays.

E.

Transformations are evaluated lazily.

Question 19

9 of 55.

Given the code fragment:

import pyspark.pandas as ps

pdf = ps.DataFrame(data)

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

Options:

A.

pdf.to_pandas()

B.

pdf.to_spark()

C.

pdf.to_dataframe()

D.

pdf.spark()

Question 20

Given this view definition:

df.createOrReplaceTempView("users_vw")

Which approach can be used to query the users_vw view after the session is terminated?

Options:

Options:

A.

Query the users_vw using Spark

B.

Persist the users_vw data as a table

C.

Recreate the users_vw and query the data using Spark

D.

Save the users_vw definition and query using Spark

Question 21

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:

Low number of Active Tasks

Many tasks complete in milliseconds

Fewer tasks than available CPUs

Which approach should be used to adjust the partitioning for optimal resource allocation?

Options:

A.

Set the number of partitions equal to the total number of CPUs in the cluster

B.

Set the number of partitions to a fixed value, such as 200

C.

Set the number of partitions equal to the number of nodes in the cluster

D.

Set the number of partitions by dividing the dataset size (1 TB) by a reasonable partition size, such as 128 MB

Question 22

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Options:

A.

Optimize the data processing logic by repartitioning the DataFrame.

B.

Modify the Spark configuration to disable garbage collection

C.

Increase the memory allocated to the Spark Driver.

D.

Cache large DataFrames to persist them in memory.

Question 23

19 of 55.

A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function not available in the standard Spark functions library.

The existing UDF code is:

import hashlib

from pyspark.sql.types import StringType

def shake_256(raw):

return hashlib.shake_256(raw.encode()).hexdigest(20)

shake_256_udf = udf(shake_256, StringType())

The developer replaces this UDF with a Pandas UDF for better performance:

@pandas_udf(StringType())

def shake_256(raw: str) -> str:

return hashlib.shake_256(raw.encode()).hexdigest(20)

However, the developer receives this error:

TypeError: Unsupported signature: (raw: str) -> str

What should the signature of the shake_256() function be changed to in order to fix this error?

Options:

A.

def shake_256(raw: str) -> str:

B.

def shake_256(raw: [pd.Series]) -> pd.Series:

C.

def shake_256(raw: pd.Series) -> pd.Series:

D.

def shake_256(raw: [str]) -> [str]:

Question 24

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

Options:

A.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

B.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

C.

Shuffle join because no broadcast hints were provided.

D.

Broadcast join, as df2 is smaller than the default broadcast threshold.

Question 25

14 of 55.

A developer created a DataFrame with columns color, fruit, and taste, and wrote the data to a Parquet directory using:

df.write.partitionBy("color", "taste").parquet("/path/to/output")

What is the result of this code?

Options:

A.

It appends new partitions to an existing Parquet file.

B.

It throws an error if there are null values in either partition column.

C.

It creates separate directories for each unique combination of color and taste.

D.

It stores all data in a single Parquet file.

Question 26

A data engineer needs to write a DataFrame df to a Parquet file, partitioned by the column country, and overwrite any existing data at the destination path.

Which code should the data engineer use to accomplish this task in Apache Spark?

Options:

A.

df.write.mode("overwrite").partitionBy("country").parquet("/data/output")

B.

df.write.mode("append").partitionBy("country").parquet("/data/output")

C.

df.write.mode("overwrite").parquet("/data/output")

D.

df.write.partitionBy("country").parquet("/data/output")

Question 27

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

Options:

A.

Set the spark.network.timeout property to allow tasks more time to complete without being killed.

B.

Increase the executor memory allocation in the Spark configuration.

C.

Reduce the size of the data partitions to improve task scheduling.

D.

Increase the number of executor instances to handle more concurrent tasks.

Question 28

A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

Options:

A.

groupBy

B.

filter

C.

select

D.

coalesce

Question 29

A Spark DataFrame df is cached using the MEMORY_AND_DISK storage level, but the DataFrame is too large to fit entirely in memory.

What is the likely behavior when Spark runs out of memory to store the DataFrame?

Options:

A.

Spark duplicates the DataFrame in both memory and disk. If it doesn't fit in memory, the DataFrame is stored and retrieved from the disk entirely.

B.

Spark splits the DataFrame evenly between memory and disk, ensuring balanced storage utilization.

C.

Spark will store as much data as possible in memory and spill the rest to disk when memory is full, continuing processing with performance overhead.

D.

Spark stores the frequently accessed rows in memory and less frequently accessed rows on disk, utilizing both resources to offer balanced performance.

Question 30

46 of 55.

A data engineer is implementing a streaming pipeline with watermarking to handle late-arriving records.

The engineer has written the following code:

inputStream \

.withWatermark("event_time", "10 minutes") \

.groupBy(window("event_time", "15 minutes"))

What happens to data that arrives after the watermark threshold?

Options:

A.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

B.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event time will be processed and included in the windowed aggregation.

Question 31

44 of 55.

A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming.

They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.

Which code snippet fulfills this requirement?

Options:

A.

query = df.writeStream \

.outputMode("append") \

.trigger(processingTime="5 seconds") \

.start()

B.

query = df.writeStream \

.outputMode("append") \

.trigger(continuous="5 seconds") \

.start()

C.

query = df.writeStream \

.outputMode("append") \

.trigger(once=True) \

.start()

D.

query = df.writeStream \

.outputMode("append") \

.start()

Question 32

17 of 55.

A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to Spark 3.5 has improved the runtime of some scheduled Spark applications.

Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled.

Which operation should AQE be implementing to automatically improve the Spark application performance?

Options:

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

Question 33

1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.

The first attempt does read the text files, but each record contains a single line. This code is shown below:

txt_path = "/datasets/raw_txt/*"

df = spark.read.text(txt_path) # one row per line by default

df = df.withColumn("file_path", input_file_name()) # add full path

Which code change can be implemented in a DataFrame that meets the data scientist's requirements?

Options:

A.

Add the option wholetext to the text() function.

B.

Add the option lineSep to the text() function.

C.

Add the option wholetext=False to the text() function.

D.

Add the option lineSep=", " to the text() function.

Question 34

38 of 55.

A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json.

The file contains records with varying schemas, and the engineer wants to create an external table in Spark SQL that:

    Reads directly from /data/input.json.

    Infers the schema automatically.

    Merges differing schemas.

Which code snippet should the engineer use?

Options:

A.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeSchema 'true');

B.

CREATE TABLE users

USING json

OPTIONS (path '/data/input.json');

C.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', inferSchema 'true');

D.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeAll 'true');

Question 35

10 of 55.

What is the benefit of using Pandas API on Spark for data transformations?

Options:

A.

It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.

B.

It is available only with Python, thereby reducing the learning curve.

C.

It runs on a single node only, utilizing memory efficiently.

D.

It computes results immediately using eager execution.

Question 36

4 of 55.

A developer is working on a Spark application that processes a large dataset using SQL queries. Despite having a large cluster, the developer notices that the job is underutilizing the available resources. Executors remain idle for most of the time, and logs reveal that the number of tasks per stage is very low. The developer suspects that this is causing suboptimal cluster performance.

Which action should the developer take to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Enable dynamic resource allocation to scale resources as needed

D.

Increase the size of the dataset to create more partitions

Question 37

40 of 55.

A developer wants to refactor older Spark code to take advantage of built-in functions introduced in Spark 3.5.

The original code:

from pyspark.sql import functions as F

min_price = 110.50

result_df = prices_df.filter(F.col("price") > min_price).agg(F.count("*"))

Which code block should the developer use to refactor the code?

Options:

A.

result_df = prices_df.filter(F.col("price") > F.lit(min_price)).agg(F.count("*"))

B.

result_df = prices_df.where(F.lit("price") > min_price).groupBy().count()

C.

result_df = prices_df.withColumn("valid_price", when(col("price") > F.lit(min_price), True))

D.

result_df = prices_df.filter(F.lit(min_price) > F.col("price")).count()

Question 38

A data engineer needs to write a Streaming DataFrame as Parquet files.

Given the code:

Which code fragment should be inserted to meet the requirement?

A)

B)

C)

D)

Which code fragment should be inserted to meet the requirement?

Options:

A.

.format("parquet")

.option("location", "path/to/destination/dir")

B.

CopyEdit

.option("format", "parquet")

.option("destination", "path/to/destination/dir")

C.

.option("format", "parquet")

.option("location", "path/to/destination/dir")

D.

.format("parquet")

.option("path", "path/to/destination/dir")

Question 39

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set for spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

Which type of join will Adaptive Query Execution (AQE) choose in this case?

Options:

A.

A Cartesian join

B.

A shuffled hash join

C.

A broadcast nested loop join

D.

A sort-merge join

Question 40

The following code fragment results in an error:

Which code fragment should be used instead?

A)

B)

C)

D)

Options: