Offering Free access to Databricks Certification Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Questions Pool Bank

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Testing Engine

Product Type: Testing Engine

$43.75 ~~$124.99~~

Add to Cart

PDF + Testing Engine

Product Type: PDF + Testing Engine

$61.25 ~~$174.99~~

Add to Cart

PDF Study Guide

Product Type: PDF Study Guide

$38.5 ~~$109.99~~

Add to Cart

Question 1

Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?

Options:

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

1.itemsDf.withColumnRenamed("attributes", "feature0")

2.itemsDf.withColumnRenamed("supplier", "feature1")

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Question 2

Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?

Options:

DataFrame.repartition(12)

DataFrame.coalesce(6).shuffle()

DataFrame.coalesce(6)

DataFrame.coalesce(6, shuffle=True)

DataFrame.repartition(6)

Question 3

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and

transactionDate (in this order). Find the error.

Code block:

transactionsDf.coalesce(14, ("storeId", "transactionDate"))

Options:

The parentheses around the column names need to be removed and .select() needs to be appended to the code block.

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.

(Correct)

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.

Operator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.

Operator coalesce needs to be replaced by repartition.

Question 4

The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

1.from pyspark import StorageLevel

2.transactionsDf.__1__(StorageLevel.__2__).__3__

Options:

1. cache

2. MEMORY_ONLY_2

3. count()

1. persist

2. DISK_ONLY_2

3. count()

1. persist

2. MEMORY_ONLY_2

3. select()

1. cache

2. DISK_ONLY_2

3. count()

1. persist

2. MEMORY_ONLY_2

3. count()

Question 5

Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?

Options:

spark.mode("parquet").read("/FileStore/imports.parquet")

spark.read.path("/FileStore/imports.parquet", source="parquet")

spark.read().parquet("/FileStore/imports.parquet")

spark.read.parquet("/FileStore/imports.parquet")

spark.read().format('parquet').open("/FileStore/imports.parquet")

Question 6

Which of the following statements about RDDs is incorrect?

Options:

An RDD consists of a single partition.

The high-level DataFrame API is built on top of the low-level RDD API.

RDDs are immutable.

RDD stands for Resilient Distributed Dataset.

RDDs are great for precisely instructing Spark on how to do a query.

Question 7

The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__

Options:

1. select

2. "storeId"

3. print_schema()

1. limit

2. 1

3. columns

1. select

2. "storeId"

3. printSchema()

1. limit

2. "storeId"

3. printSchema()

1. select

2. storeId

3. dtypes

Question 8

The code block shown below should write DataFrame transactionsDf as a parquet file to path storeDir, using brotli compression and replacing any previously existing file. Choose the answer that

correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__.format("parquet").__2__(__3__).option(__4__, "brotli").__5__(storeDir)

Options:

1. save

2. mode

3. "ignore"

4. "compression"

5. path

1. store

2. with

3. "replacement"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. "compression"

5. save

(Correct)

1. save

2. mode

3. "replace"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. compression

5. parquet

Question 9

Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?

Options:

spark.read.json(filePath)

spark.read.path(filePath, source="json")

spark.read().path(filePath)

spark.read().json(filePath)

spark.read.path(filePath)

Question 10

Which of the following statements about the differences between actions and transformations is correct?

Options:

Actions are evaluated lazily, while transformations are not evaluated lazily.

Actions generate RDDs, while transformations do not.

Actions do not send results to the driver, while transformations do.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

Actions can trigger Adaptive Query Execution, while transformation cannot.

Question 11

The code block shown below should return a copy of DataFrame transactionsDf without columns value and productId and with an additional column associateId that has the value 5. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, __3__).__4__(__5__, 'value')

Options:

1. withColumn

2. 'associateId'

3. 5

4. remove

5. 'productId'

1. withNewColumn

2. associateId

3. lit(5)

4. drop

5. productId

1. withColumn

2. 'associateId'

3. lit(5)

4. drop

5. 'productId'

1. withColumnRenamed

2. 'associateId'

3. 5

4. drop

5. 'productId'

1. withColumn

2. col(associateId)

3. lit(5)

4. drop

5. col(productId)

Question 12

Which of the following is not a feature of Adaptive Query Execution?

Options:

Replace a sort merge join with a broadcast join, where appropriate.

Coalesce partitions to accelerate data processing.

Split skewed partitions into smaller partitions to avoid differences in partition processing time.

Reroute a query in case of an executor failure.

Collect runtime statistics during query execution.

Question 13

The code block displayed below contains an error. The code block should return a new DataFrame that only contains rows from DataFrame transactionsDf in which the value in column predError is

at least 5. Find the error.

Code block:

transactionsDf.where("col(predError) >= 5")

Options:

The argument to the where method should be "predError >= 5".

Instead of where(), filter() should be used.

The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where("col(predError) >= 5").

The argument to the where method cannot be a string.

Instead of >=, the SQL operator GEQ should be used.

Question 14

Which of the following describes tasks?

Options:

A task is a command sent from the driver to the executors in response to a transformation.

Tasks transform jobs into DAGs.

A task is a collection of slots.

A task is a collection of rows.

Tasks get assigned to the executors by the driver.

Question 15

Which of the following code blocks uses a schema fileSchema to read a parquet file at location filePath into a DataFrame?

Options:

spark.read.schema(fileSchema).format("parquet").load(filePath)

spark.read.schema("fileSchema").format("parquet").load(filePath)

spark.read().schema(fileSchema).parquet(filePath)

spark.read().schema(fileSchema).format(parquet).load(filePath)

spark.read.schema(fileSchema).open(filePath)

Question 16

Which of the following DataFrame operators is never classified as a wide transformation?

Options:

DataFrame.sort()

DataFrame.aggregate()

DataFrame.repartition()

DataFrame.select()

DataFrame.join()

Answer:

Explanation:

Explanation

As a general rule: After having gone through the practice tests you probably have a good feeling for what classifies as a wide and what classifies as a narrow transformation. If you are unsure, feel

free to play around in Spark and display the explanation of the Spark execution plan via DataFrame.[operation, for example sort()].explain(). If repartitioning is involved, it would count as a wide

transformation.

DataFrame.select()

Correct! A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select()

operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each

partition can be worked on independently. Thus, you do not cause a wide transformation.

DataFrame.repartition()

Incorrect. When you repartition a DataFrame, you redefine partition boundaries. Data will flow across your cluster and end up in different partitions after the repartitioning is completed. This is

known as a shuffle and, in turn, is classified as a wide transformation.

DataFrame.aggregate()

No. When you aggregate, you may compare and summarize data across partitions. In the process, data are exchanged across the cluster, and newly formed output partitions depend on one or more

input partitions. This is a typical characteristic of a shuffle, meaning that the aggregate operation may classify as a wide transformation.

DataFrame.join()

Wrong. Joining multiple DataFrames usually means that large amounts of data are exchanged across the cluster, as new partitions are formed. This is a shuffle and therefore DataFrame.join()

counts as a wide transformation.

DataFrame.sort()

False. When sorting, Spark needs to compare many rows across all partitions to each other. This is an expensive operation, since data is exchanged across the cluster and new partitions are

formed as data is reordered. This process classifies as a shuffle and, as a result, DataFrame.sort() counts as wide transformation.

More info: Understanding Apache Spark Shuffle | Philipp Brunenberg

Question 17

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.

Code block:

transactionsDf.filter(col('predError').in([3, 6])).count()

Options:

The number of rows cannot be determined with the count() operator.

Instead of filter, the select method should be used.

The method used on column predError is incorrect.

Instead of a list, the values need to be passed as single arguments to the in operator.

Numbers 3 and 6 need to be passed as string variables.

Question 18

Which of the following statements about garbage collection in Spark is incorrect?

Options:

Garbage collection information can be accessed in the Spark UI's stage detail view.

Optimizing garbage collection performance in Spark may limit caching ability.

Manually persisting RDDs in Spark prevents them from being garbage collected.

In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.

Serialized caching is a strategy to increase the performance of garbage collection.

Answer:

Explanation:

Manually persisting RDDs in Spark prevents them from being garbage collected.

This statement is incorrect, and thus the correct answer to the question. Spark's garbage collector will remove even persisted objects, albeit in an "LRU" fashion. LRU stands for least recently used.

So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first.

See the linked StackOverflow post below for more information.

Serialized caching is a strategy to increase the performance of garbage collection.

This statement is correct. The more Java objects Spark needs to collect during garbage collection, the longer it takes. Storing a collection of many Java objects, such as a DataFrame with a

complex schema, through serialization as a single byte array thus increases performance. This means that garbage collection takes less time on a serialized DataFrame than an unserialized

DataFrame.

Optimizing garbage collection performance in Spark may limit caching ability.

This statement is correct. A full garbage collection run slows down a Spark application. When taking about "tuning" garbage collection, we mean reducing the amount or duration of these slowdowns.

A full garbage collection run is triggered when the Old generation of the Java heap space is almost full. (If you are unfamiliar with this concept, check out the link to the Garbage Collection Tuning docs below.) Thus, one measure to avoid triggering a garbage collection run is to prevent the Old generation share of the heap space to be almost full.

To achieve this, one may decrease its size. Objects with sizes greater than the Old generation space will then be discarded instead of cached (stored) in the space and helping it to be "almost full".

This will decrease the number of full garbage collection runs, increasing overall performance.

Inevitably, however, objects will need to be recomputed when they are needed. So, this mechanism only works when a Spark application needs to reuse cached data as little as possible.

Garbage collection information can be accessed in the Spark UI's stage detail view.

This statement is correct. The task table in the Spark UI's stage detail view has a "GC Time" column, indicating the garbage collection time needed per task.

In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.

This statement is correct. The G1 garbage collector, also known as garbage first garbage collector, is an alternative to the default Parallel garbage collector.

While the default Parallel garbage collector divides the heap into a few static regions, the G1 garbage collector divides the heap into many small regions that are created dynamically. The G1

garbage collector has certain advantages over the Parallel garbage collector which improve performance particularly for Spark workloads that require high throughput and low latency.

The G1 garbage collector is not enabled by default, and you need to explicitly pass an argument to Spark to enable it. For more information about the two garbage collectors, check out the

Databricks article linked below.

Question 19

Which of the following code blocks returns DataFrame transactionsDf sorted in descending order by column predError, showing missing values last?

Options:

transactionsDf.sort(asc_nulls_last("predError"))

transactionsDf.orderBy("predError").desc_nulls_last()

transactionsDf.sort("predError", ascending=False)

transactionsDf.desc_nulls_last("predError")

transactionsDf.orderBy("predError").asc_nulls_last()

Question 20

Which of the following statements about storage levels is incorrect?

Options:

The cache operator on DataFrames is evaluated like a transformation.

In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.

Caching can be undone using the DataFrame.unpersist() operator.

MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.

DISK_ONLY will not use the worker node's memory.

Question 21

Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type

double?

Options:

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

1. from pyspark.sql import types as T

2. spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

Answer:

Explanation:

Explanation

spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

Correct. This command uses the Spark Session's createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified

as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.

The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be

inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column

"season" contain only strings, Spark will cast the column appropriately as string.

Find out more about SparkSession.createDataFrame() via the link below.

spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

No, the SparkSession does not have a newDataFrame method.

from pyspark.sql import types as T

spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())]))

No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.

spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python's pandas package, in which this would be correct syntax. To create

a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.

Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.

spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})

No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.

More info: pyspark.sql.SparkSession.createDataFrame — PySpark 3.1.1 documentation and Data Types - Spark 3.1.2 Documentation

Static notebook | Dynamic notebook: See test 1, QUESTION NO: 41 (Databricks import instructions)

Question 22

Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate

format for this kind of data?

Options:

1.spark.read.schema(

2. StructType(

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)

5. )).load(filePath)

1.spark.read.schema([

2. StructField("transactionId", NumberType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath)

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", StringType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).parquet(filePath)

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).format("parquet").load(filePath)

1.spark.read.schema([

2. StructField("transactionId", IntegerType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath, format="parquet")

Question 23

The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the

code block to accomplish this.

transactionsDf.__1__(__2__)

Options:

1. filter

2. "transactionId", "predError", "value", "f"

1. select

2. "transactionId, predError, value, f"

1. select

2. ["transactionId", "predError", "value", "f"]

1. where

2. col("transactionId"), col("predError"), col("value"), col("f")

1. select

2. col(["transactionId", "predError", "value", "f"])

Question 24

Which of the following is the idea behind dynamic partition pruning in Spark?

Options:

Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.

Dynamic partition pruning concatenates columns of similar data types to optimize join performance.

Dynamic partition pruning performs wide transformations on disk instead of in memory.

Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.

Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.

Question 25

The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in

the code block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__(__4__)

Options:

1. filter

2. "storeId"==25

3. collect

4. 5

1. filter

2. col("storeId")==25

3. toLocalIterator

4. 5

1. select

2. storeId==25

3. head

4. 5

1. filter

2. col("storeId")==25

3. take

4. 5

1. filter

2. col("storeId")==25

3. collect

4. 5

Question 26

Which of the following code blocks generally causes a great amount of network traffic?

Options:

DataFrame.select()

DataFrame.coalesce()

DataFrame.collect()

DataFrame.rdd.map()

DataFrame.count()

Question 27

The code block displayed below contains an error. The code block should produce a DataFrame with color as the only column and three rows with color values of red, blue, and green, respectively.

Find the error.

Code block:

1.spark.createDataFrame([("red",), ("blue",), ("green",)], "color")

Instead of calling spark.createDataFrame, just DataFrame should be called.

Options:

The commas in the tuples with the colors should be eliminated.

The colors red, blue, and green should be expressed as a simple Python list, and not a list of tuples.

Instead of color, a data type should be specified.

The "color" expression needs to be wrapped in brackets, so it reads ["color"].

Load More Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions

Summer Special Flat 65% Limited Time Discount offer - Ends in 0d 00h 00m 00s - Coupon code: netdisc

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Exam Practice Test

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Testing Engine

PDF + Testing Engine

PDF Study Guide

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer: