Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python sample Question + Exam 2025 Practice Exam Dumps

Question # 4

Given a CSV file with the content:

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

[Row(name='bambi'), Row(name='alladin', age=20)]

[Row(name='alladin', age=20)]

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

The code throws an error due to a schema mismatch.

Full Access

Question # 5

A developer runs:

What is the result?

Options:

It stores all data in a single Parquet file.

It throws an error if there are null values in either partition column.

It appends new partitions to an existing Parquet file.

It creates separate directories for each unique combination of color and fruit.

Full Access

Question # 6

An engineer has two DataFrames: df1 (small) and df2 (large). A broadcast join is used:

python

CopyEdit

from pyspark.sql.functions import broadcast

result = df2.join(broadcast(df1), on='id', how='inner')

What is the purpose of using broadcast() in this scenario?

Options:

It filters the id values before performing the join.

It increases the partition size for df1 and df2.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

It ensures that the join happens only when the id values are identical.

Full Access

Question # 7

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:

Low number of Active Tasks

Many tasks complete in milliseconds

Fewer tasks than available CPUs

Which approach should be used to adjust the partitioning for optimal resource allocation?

Set the number of partitions equal to the total number of CPUs in the cluster

Set the number of partitions to a fixed value, such as 200

Set the number of partitions equal to the number of nodes in the cluster

Set the number of partitions by dividing the dataset size (1 TB) by a reasonable partition size, such as 128 MB

Full Access

Question # 8

15 of 55.

A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:

name

count

timestamp

Delhi

2024-09-19T10:11

Delhi

2024-09-19T10:12

London

2024-09-19T10:15

Paris

2024-09-19T10:18

Paris

2024-09-19T10:20

Washington

2024-09-19T10:22

Which operation is supported with streaming_df?

streaming_df.count()

streaming_df.filter("count < 30")

streaming_df.select(countDistinct("name"))

streaming_df.show()

Full Access

Question # 9

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

It provides a way to run Spark applications remotely in any programming language

It can be used to interact with any remote cluster using the REST API

It allows for remote execution of Spark jobs

It is primarily used for data ingestion into Spark from external sources

Full Access

Question # 10

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

regions = dict(

regions_df

.select('region', 'region_id')

.sort('region_id')

.take(3)

)

regions = dict(

regions_df

.select('region_id', 'region')

.sort('region_id')

.take(3)

)

regions = dict(

regions_df

.select('region_id', 'region')

.limit(3)

.collect()

)

regions = dict(

regions_df

.select('region', 'region_id')

.sort(desc('region_id'))

.take(3)

)

Full Access

Question # 11

A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.

Which code snippet can be used to meet this requirement?

df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate")

df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate")

Full Access

Question # 12

18 of 55.

An engineer has two DataFrames â€” df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join:

from pyspark.sql.functions import broadcast

df_result = df2.join(broadcast(df1), on="id", how="inner")

What is the purpose of using broadcast() in this scenario?

It increases the partition size for df1 and df2.

It ensures that the join happens only when the id values are identical.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

It filters the id values before performing the join.

Full Access

Question # 13

A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.

Which operation is AQE implementing to improve performance?

Dynamically switching join strategies

Collecting persistent table statistics and storing them in the metastore for future use

Improving the performance of single-stage Spark jobs

Optimizing the layout of Delta files on disk

Full Access

Question # 14

41 of 55.

A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.

The DataFrame has columns:

id | Name | count | timestamp

---------------------------------

1 | USA | 10

2 | India | 20

3 | England | 50

4 | India | 50

5 | France | 20

6 | India | 10

7 | USA | 30

8 | USA | 40

Which code fragment should the engineer use to sort the data in the Name and count columns?

df1.orderBy(col("count").desc(), col("Name").asc())

df1.sort("Name", "count")

df1.orderBy("Name", "count")

df1.orderBy(col("Name").desc(), col("count").asc())

Full Access

Question # 15

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Execute their pyspark shell with the option --remote "https://localhost "

Execute their pyspark shell with the option --remote "sc://localhost"

Set the environment variable SPARK_REMOTE="sc://localhost" before starting the pyspark shell

Add .remote("sc://localhost") to their SparkSession.builder calls in their Spark code

Ensure the Spark property spark.connect.grpc.binding.port is set to 15002 in the application code

Full Access

Question # 16

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

By configuring the option checkpointLocation during readStream

By configuring the option recoveryLocation during the SparkSession initialization

By configuring the option recoveryLocation during writeStream

By configuring the option checkpointLocation during writeStream

Full Access

Question # 17

8 of 55.

A data scientist at a large e-commerce company needs to process and analyze 2 TB of daily customer transaction data. The company wants to implement real-time fraud detection and personalized product recommendations.

Currently, the company uses a traditional relational database system, which struggles with the increasing data volume and velocity.

Which feature of Apache Spark effectively addresses this challenge?

Ability to process small datasets efficiently

In-memory computation and parallel processing capabilities

Support for SQL queries on structured data

Built-in machine learning libraries

Full Access

Question # 18

In the code block below, aggDF contains aggregations on a streaming DataFrame:

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

complete

append

replace

aggregate

Full Access

Question # 19

A developer initializes a SparkSession:

spark = SparkSession.builder \

.appName("Analytics Application") \

.getOrCreate()

Which statement describes the spark SparkSession?

The getOrCreate() method explicitly destroys any existing SparkSession and creates a new one.

A SparkSession is unique for each appName, and calling getOrCreate() with the same name will return an existing SparkSession once it has been created.

If a SparkSession already exists, this code will return the existing session instead of creating a new one.

A new SparkSession is created every time the getOrCreate() method is invoked.

Full Access

Question # 20

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Replace .bucketBy() with .partitionBy("event_year", "event_month")

Change the bucket count (42) to a lower number

Add .sortBy() after .bucketBy()

Replace .bucketBy() with .partitionBy("event_year") only

Full Access

Question # 21

A data engineer is working on a Streaming DataFrame streaming_df with the given streaming data:

Which operation is supported with streamingdf ?

streaming_df. select (countDistinct ("Name") )

streaming_df.groupby("Id") .count ()

streaming_df.orderBy("timestamp").limit(4)

streaming_df.filter (col("count") < 30).show()

Full Access

Answer:

Explanation:

Which operation is supported with streaming_df?

A. streaming_df.select(countDistinct("Name"))

B. streaming_df.groupby("Id").count()

C. streaming_df.orderBy("timestamp").limit(4)

D. streaming_df.filter(col("count") < 30).show()

Answer: B

In Structured Streaming, only a limited subset of operations is supported due to the nature of unbounded data. Operations like sorting (orderBy) and global aggregation (countDistinct) require a full view of the dataset, which is not possible with streaming data unless specific watermarks or windows are defined.

Review of Each Option:

A. select(countDistinct("Name"))

Not allowed â€” Global aggregation like countDistinct() requires the full dataset and is not supported directly in streaming without watermark and windowing logic.

[Reference: Databricks Structured Streaming Guide â€“ Unsupported Operations., B. groupby("Id").count()Supported â€” Streaming aggregations over a key (like groupBy("Id")) are supported. Spark maintains intermediate state for each key.Reference: Databricks Docs â†’ Aggregations in Structured Streaming (https://docs.databricks.com/structured-streaming/aggregation.html), C. orderBy("timestamp").limit(4) Not allowed â€” Sorting and limiting require a full view of the stream (which is infinite), so this is unsupported in streaming DataFrames.Reference: Spark Structured Streaming â€“ Unsupported Operations (ordering without watermark/window not allowed)., D. filter(col("count") < 30).show() Not allowed â€” show() is a blocking operation used for debugging batch DataFrames; it's not allowed on streaming DataFrames.Reference: Structured Streaming Programming Guide â€“ Output operations like show() are not supported., , Reference Extract from Official Guide:, â€œOperations like orderBy, limit, show, and countDistinct are not supported in Structured Streaming because they require the full dataset to compute a result. Use groupBy(...).agg(...) instead for incremental aggregations.â€â€” Databricks Structured Streaming Programming Guide]

Question # 22

6 of 55.

Which components of Apache Sparkâ€™s Architecture are responsible for carrying out tasks when assigned to them?

Driver Nodes

Executors

CPU Cores

Worker Nodes

Full Access

Question # 23

What is the benefit of Adaptive Query Execution (AQE)?

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Full Access

Question # 24

7 of 55.

A developer has been asked to debug an issue with a Spark application. The developer identified that the data being loaded from a CSV file is being read incorrectly into a DataFrame.

The CSV file has been read using the following Spark SQL statement:

CREATE TABLE locations

USING csv

OPTIONS (path '/data/locations.csv')

The first lines of the command SELECT * FROM locations look like this:

| city | lat | long |

| ALTI Sydney | -33... | ... |

Which parameter can the developer add to the OPTIONS clause in the CREATE TABLE statement to read the CSV data correctly again?

'header' 'true'

'header' 'false'

'sep' ','

'sep' '|'

Full Access

Question # 25

45 of 55.

Which feature of Spark Connect should be considered when designing an application that plans to enable remote interaction with a Spark cluster?

It is primarily used for data ingestion into Spark from external sources.

It provides a way to run Spark applications remotely in any programming language.

It can be used to interact with any remote cluster using the REST API.

It allows for remote execution of Spark jobs.

Full Access

Question # 26

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

Spark DataFrames, Structured Streaming, and GraphX

Spark SQL, Pandas API on Spark, and Structured Streaming

Spark Streaming, GraphX, and Pandas API on Spark

Spark DataFrames, Spark SQL, and MLlib

Full Access

Question # 27

A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.

Which change should be made to solve the issue?

Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges

Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy

Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges

Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

Full Access

Question # 28

A developer is trying to join two tables, sales.purchases_fct and sales.customer_dim, using the following code:

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'))

The developer has discovered that customers in the purchases_fct table that do not exist in the customer_dim table are being dropped from the joined table.

Which change should be made to the code to stop these customer records from being dropped?

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'left')

fact_df = cust_df.join(purch_df, F.col('customer_id') == F.col('custid'))

fact_df = purch_df.join(cust_df, F.col('cust_id') == F.col('customer_id'))

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'right_outer')

Full Access

Question # 29

14 of 55.

A developer created a DataFrame with columns color, fruit, and taste, and wrote the data to a Parquet directory using:

df.write.partitionBy("color", "taste").parquet("/path/to/output")

What is the result of this code?

It appends new partitions to an existing Parquet file.

It throws an error if there are null values in either partition column.

It creates separate directories for each unique combination of color and taste.

It stores all data in a single Parquet file.

Full Access

Question # 30

13 of 55.

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

region_id

region_name

North

East

West

The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.

Which code fragment meets the requirements?

regions_dict = dict(regions.take(3))

regions_dict = regions.select("region_id", "region_name").take(3)

regions_dict = dict(regions.select("region_id", "region_name").rdd.collect())

regions_dict = dict(regions.orderBy("region_id").limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())

Full Access

Question # 31

A Spark application suffers from too many small tasks due to excessive partitioning. How can this be fixed without a full shuffle?

Options:

Use the distinct() transformation to combine similar partitions

Use the coalesce() transformation with a lower number of partitions

Use the sortBy() transformation to reorganize the data

Use the repartition() transformation with a lower number of partitions

Full Access

Question # 32

How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?

Options:

Configure the application to run in cluster mode instead of local mode.

Increase the number of local threads based on the number of CPU cores.

Use the spark.dynamicAllocation.enabled property to scale resources dynamically.

Set the spark.executor.memory property to a large value.

Full Access

Question # 33

10 of 55.

What is the benefit of using Pandas API on Spark for data transformations?

It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.

It is available only with Python, thereby reducing the learning curve.

It runs on a single node only, utilizing memory efficiently.

It computes results immediately using eager execution.

Full Access

Question # 34

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

Use the applylnPandas API

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Full Access

Question # 35

30 of 55.

A data engineer is working on a num_df DataFrame and has a Python UDF defined as:

def cube_func(val):

return val * val * val

Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

spark.udf.register("cube_func", cube_func)

num_df.selectExpr("cube_func(num)").show()

num_df.select(cube_func("num")).show()

spark.createDataFrame(cube_func("num")).show()

num_df.register("cube_func").select("num").show()

Full Access

Question # 36

Given this view definition:

df.createOrReplaceTempView("users_vw")

Which approach can be used to query the users_vw view after the session is terminated?

Options:

Query the users_vw using Spark

Persist the users_vw data as a table

Recreate the users_vw and query the data using Spark

Save the users_vw definition and query using Spark

Full Access

Question # 37

38 of 55.

A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json.

The file contains records with varying schemas, and the engineer wants to create an external table in Spark SQL that:

Reads directly from /data/input.json.

Infers the schema automatically.

Merges differing schemas.

Which code snippet should the engineer use?

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeSchema 'true');

CREATE TABLE users

USING json

OPTIONS (path '/data/input.json');

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', inferSchema 'true');

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeAll 'true');

Full Access

Question # 38

5 of 55.

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

A job contains multiple tasks, and each task contains multiple stages.

A stage contains multiple jobs, and each job contains multiple tasks.

A stage contains multiple tasks, and each task contains multiple jobs.

A job contains multiple stages, and each stage contains multiple tasks.

Full Access

Question # 39

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

Same number as the cluster executors

Full Access

Question # 40

A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

groupBy

filter

select

coalesce

Full Access

Black Friday Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Question and Answers

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Quick Links

Why Us

Unlimited Packages

Site Secure