Weekend Sale - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Home > Databricks > Databricks Certification > Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Question and Answers

Question # 4

Given a CSV file with the content:

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

A.

[Row(name='bambi'), Row(name='alladin', age=20)]

B.

[Row(name='alladin', age=20)]

C.

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

D.

The code throws an error due to a schema mismatch.

Full Access
Question # 5

A developer runs:

What is the result?

Options:

A.

It stores all data in a single Parquet file.

B.

It throws an error if there are null values in either partition column.

C.

It appends new partitions to an existing Parquet file.

D.

It creates separate directories for each unique combination of color and fruit.

Full Access
Question # 6

An engineer has two DataFrames: df1 (small) and df2 (large). A broadcast join is used:

python

CopyEdit

from pyspark.sql.functions import broadcast

result = df2.join(broadcast(df1), on='id', how='inner')

What is the purpose of using broadcast() in this scenario?

Options:

A.

It filters the id values before performing the join.

B.

It increases the partition size for df1 and df2.

C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

D.

It ensures that the join happens only when the id values are identical.

Full Access
Question # 7

A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:

Low number of Active Tasks

Many tasks complete in milliseconds

Fewer tasks than available CPUs

Which approach should be used to adjust the partitioning for optimal resource allocation?

A.

Set the number of partitions equal to the total number of CPUs in the cluster

B.

Set the number of partitions to a fixed value, such as 200

C.

Set the number of partitions equal to the number of nodes in the cluster

D.

Set the number of partitions by dividing the dataset size (1 TB) by a reasonable partition size, such as 128 MB

Full Access
Question # 8

15 of 55.

A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:

id

name

count

timestamp

1

Delhi

20

2024-09-19T10:11

1

Delhi

50

2024-09-19T10:12

2

London

50

2024-09-19T10:15

3

Paris

30

2024-09-19T10:18

3

Paris

20

2024-09-19T10:20

4

Washington

10

2024-09-19T10:22

Which operation is supported with streaming_df?

A.

streaming_df.count()

B.

streaming_df.filter("count < 30")

C.

streaming_df.select(countDistinct("name"))

D.

streaming_df.show()

Full Access
Question # 9

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

A.

It provides a way to run Spark applications remotely in any programming language

B.

It can be used to interact with any remote cluster using the REST API

C.

It allows for remote execution of Spark jobs

D.

It is primarily used for data ingestion into Spark from external sources

Full Access
Question # 10

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

A)

B)

C)

D)

The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

A.

regions = dict(

regions_df

.select('region', 'region_id')

.sort('region_id')

.take(3)

)

B.

regions = dict(

regions_df

.select('region_id', 'region')

.sort('region_id')

.take(3)

)

C.

regions = dict(

regions_df

.select('region_id', 'region')

.limit(3)

.collect()

)

D.

regions = dict(

regions_df

.select('region', 'region_id')

.sort(desc('region_id'))

.take(3)

)

Full Access
Question # 11

A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.

Which code snippet can be used to meet this requirement?

A.

df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

B.

df_user_non_pii = df_user.drop("first_name", "last_name", "email", "birthdate")

C.

df_user_non_pii = df_user.dropfields("first_name", "last_name", "email", "birthdate")

D.

df_user_non_pii = df_user.dropfields("first_name, last_name, email, birthdate")

Full Access
Question # 12

18 of 55.

An engineer has two DataFrames — df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join:

from pyspark.sql.functions import broadcast

df_result = df2.join(broadcast(df1), on="id", how="inner")

What is the purpose of using broadcast() in this scenario?

A.

It increases the partition size for df1 and df2.

B.

It ensures that the join happens only when the id values are identical.

C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

D.

It filters the id values before performing the join.

Full Access
Question # 13

A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.

Which operation is AQE implementing to improve performance?

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

Full Access
Question # 14

41 of 55.

A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.

The DataFrame has columns:

id | Name | count | timestamp

---------------------------------

1 | USA | 10

2 | India | 20

3 | England | 50

4 | India | 50

5 | France | 20

6 | India | 10

7 | USA | 30

8 | USA | 40

Which code fragment should the engineer use to sort the data in the Name and count columns?

A.

df1.orderBy(col("count").desc(), col("Name").asc())

B.

df1.sort("Name", "count")

C.

df1.orderBy("Name", "count")

D.

df1.orderBy(col("Name").desc(), col("count").asc())

Full Access
Question # 15

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

A.

Execute their pyspark shell with the option --remote "https://localhost "

B.

Execute their pyspark shell with the option --remote "sc://localhost"

C.

Set the environment variable SPARK_REMOTE="sc://localhost" before starting the pyspark shell

D.

Add .remote("sc://localhost") to their SparkSession.builder calls in their Spark code

E.

Ensure the Spark property spark.connect.grpc.binding.port is set to 15002 in the application code

Full Access
Question # 16

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

A.

By configuring the option checkpointLocation during readStream

B.

By configuring the option recoveryLocation during the SparkSession initialization

C.

By configuring the option recoveryLocation during writeStream

D.

By configuring the option checkpointLocation during writeStream

Full Access
Question # 17

8 of 55.

A data scientist at a large e-commerce company needs to process and analyze 2 TB of daily customer transaction data. The company wants to implement real-time fraud detection and personalized product recommendations.

Currently, the company uses a traditional relational database system, which struggles with the increasing data volume and velocity.

Which feature of Apache Spark effectively addresses this challenge?

A.

Ability to process small datasets efficiently

B.

In-memory computation and parallel processing capabilities

C.

Support for SQL queries on structured data

D.

Built-in machine learning libraries

Full Access
Question # 18

In the code block below, aggDF contains aggregations on a streaming DataFrame:

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

A.

complete

B.

append

C.

replace

D.

aggregate

Full Access
Question # 19

A developer initializes a SparkSession:

spark = SparkSession.builder \

.appName("Analytics Application") \

.getOrCreate()

Which statement describes the spark SparkSession?

A.

The getOrCreate() method explicitly destroys any existing SparkSession and creates a new one.

B.

A SparkSession is unique for each appName, and calling getOrCreate() with the same name will return an existing SparkSession once it has been created.

C.

If a SparkSession already exists, this code will return the existing session instead of creating a new one.

D.

A new SparkSession is created every time the getOrCreate() method is invoked.

Full Access
Question # 20

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

A.

Replace .bucketBy() with .partitionBy("event_year", "event_month")

B.

Change the bucket count (42) to a lower number

C.

Add .sortBy() after .bucketBy()

D.

Replace .bucketBy() with .partitionBy("event_year") only

Full Access
Question # 21

A data engineer is working on a Streaming DataFrame streaming_df with the given streaming data:

Which operation is supported with streamingdf ?

A.

streaming_df. select (countDistinct ("Name") )

B.

streaming_df.groupby("Id") .count ()

C.

streaming_df.orderBy("timestamp").limit(4)

D.

streaming_df.filter (col("count") < 30).show()

Full Access
Question # 22

6 of 55.

Which components of Apache Spark’s Architecture are responsible for carrying out tasks when assigned to them?

A.

Driver Nodes

B.

Executors

C.

CPU Cores

D.

Worker Nodes

Full Access
Question # 23

What is the benefit of Adaptive Query Execution (AQE)?

A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

B.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

D.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Full Access
Question # 24

7 of 55.

A developer has been asked to debug an issue with a Spark application. The developer identified that the data being loaded from a CSV file is being read incorrectly into a DataFrame.

The CSV file has been read using the following Spark SQL statement:

CREATE TABLE locations

USING csv

OPTIONS (path '/data/locations.csv')

The first lines of the command SELECT * FROM locations look like this:

| city | lat | long |

| ALTI Sydney | -33... | ... |

Which parameter can the developer add to the OPTIONS clause in the CREATE TABLE statement to read the CSV data correctly again?

A.

'header' 'true'

B.

'header' 'false'

C.

'sep' ','

D.

'sep' '|'

Full Access
Question # 25

45 of 55.

Which feature of Spark Connect should be considered when designing an application that plans to enable remote interaction with a Spark cluster?

A.

It is primarily used for data ingestion into Spark from external sources.

B.

It provides a way to run Spark applications remotely in any programming language.

C.

It can be used to interact with any remote cluster using the REST API.

D.

It allows for remote execution of Spark jobs.

Full Access
Question # 26

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

A.

Spark DataFrames, Structured Streaming, and GraphX

B.

Spark SQL, Pandas API on Spark, and Structured Streaming

C.

Spark Streaming, GraphX, and Pandas API on Spark

D.

Spark DataFrames, Spark SQL, and MLlib

Full Access
Question # 27

A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.

Which change should be made to solve the issue?

A.

Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges

B.

Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy

C.

Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges

D.

Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

Full Access
Question # 28

A developer is trying to join two tables, sales.purchases_fct and sales.customer_dim, using the following code:

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'))

The developer has discovered that customers in the purchases_fct table that do not exist in the customer_dim table are being dropped from the joined table.

Which change should be made to the code to stop these customer records from being dropped?

A.

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'left')

B.

fact_df = cust_df.join(purch_df, F.col('customer_id') == F.col('custid'))

C.

fact_df = purch_df.join(cust_df, F.col('cust_id') == F.col('customer_id'))

D.

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'right_outer')

Full Access
Question # 29

14 of 55.

A developer created a DataFrame with columns color, fruit, and taste, and wrote the data to a Parquet directory using:

df.write.partitionBy("color", "taste").parquet("/path/to/output")

What is the result of this code?

A.

It appends new partitions to an existing Parquet file.

B.

It throws an error if there are null values in either partition column.

C.

It creates separate directories for each unique combination of color and taste.

D.

It stores all data in a single Parquet file.

Full Access
Question # 30

13 of 55.

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

region_id

region_name

10

North

12

East

14

West

The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.

Which code fragment meets the requirements?

A.

regions_dict = dict(regions.take(3))

B.

regions_dict = regions.select("region_id", "region_name").take(3)

C.

regions_dict = dict(regions.select("region_id", "region_name").rdd.collect())

D.

regions_dict = dict(regions.orderBy("region_id").limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())

Full Access
Question # 31

A Spark application suffers from too many small tasks due to excessive partitioning. How can this be fixed without a full shuffle?

Options:

A.

Use the distinct() transformation to combine similar partitions

B.

Use the coalesce() transformation with a lower number of partitions

C.

Use the sortBy() transformation to reorganize the data

D.

Use the repartition() transformation with a lower number of partitions

Full Access
Question # 32

How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?

Options:

A.

Configure the application to run in cluster mode instead of local mode.

B.

Increase the number of local threads based on the number of CPU cores.

C.

Use the spark.dynamicAllocation.enabled property to scale resources dynamically.

D.

Set the spark.executor.memory property to a large value.

Full Access
Question # 33

10 of 55.

What is the benefit of using Pandas API on Spark for data transformations?

A.

It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.

B.

It is available only with Python, thereby reducing the learning curve.

C.

It runs on a single node only, utilizing memory efficiently.

D.

It computes results immediately using eager execution.

Full Access
Question # 34

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

C)

D)

A.

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

B.

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Full Access
Question # 35

30 of 55.

A data engineer is working on a num_df DataFrame and has a Python UDF defined as:

def cube_func(val):

return val * val * val

Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

A.

spark.udf.register("cube_func", cube_func)

num_df.selectExpr("cube_func(num)").show()

B.

num_df.select(cube_func("num")).show()

C.

spark.createDataFrame(cube_func("num")).show()

D.

num_df.register("cube_func").select("num").show()

Full Access
Question # 36

Given this view definition:

df.createOrReplaceTempView("users_vw")

Which approach can be used to query the users_vw view after the session is terminated?

Options:

A.

Query the users_vw using Spark

B.

Persist the users_vw data as a table

C.

Recreate the users_vw and query the data using Spark

D.

Save the users_vw definition and query using Spark

Full Access
Question # 37

38 of 55.

A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json.

The file contains records with varying schemas, and the engineer wants to create an external table in Spark SQL that:

    Reads directly from /data/input.json.

    Infers the schema automatically.

    Merges differing schemas.

Which code snippet should the engineer use?

A.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeSchema 'true');

B.

CREATE TABLE users

USING json

OPTIONS (path '/data/input.json');

C.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', inferSchema 'true');

D.

CREATE EXTERNAL TABLE users

USING json

OPTIONS (path '/data/input.json', mergeAll 'true');

Full Access
Question # 38

5 of 55.

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

A.

A job contains multiple tasks, and each task contains multiple stages.

B.

A stage contains multiple jobs, and each job contains multiple tasks.

C.

A stage contains multiple tasks, and each task contains multiple jobs.

D.

A job contains multiple stages, and each stage contains multiple tasks.

Full Access
Question # 39

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

A.

10

B.

Same number as the cluster executors

C.

1

D.

20

Full Access
Question # 40

A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

A.

groupBy

B.

filter

C.

select

D.

coalesce

Full Access