Given a CSV file with the content:
And the following code:
from pyspark.sql.types import *
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())
])
spark.read.schema(schema).csv(path).collect()
What is the resulting output?
An engineer has two DataFrames: df1 (small) and df2 (large). A broadcast join is used:
python
CopyEdit
from pyspark.sql.functions import broadcast
result = df2.join(broadcast(df1), on='id', how='inner')
What is the purpose of using broadcast() in this scenario?
Options:
A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:
Low number of Active Tasks
Many tasks complete in milliseconds
Fewer tasks than available CPUs
Which approach should be used to adjust the partitioning for optimal resource allocation?
15 of 55.
A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:
id
name
count
timestamp
1
Delhi
20
2024-09-19T10:11
1
Delhi
50
2024-09-19T10:12
2
London
50
2024-09-19T10:15
3
Paris
30
2024-09-19T10:18
3
Paris
20
2024-09-19T10:20
4
Washington
10
2024-09-19T10:22
Which operation is supported with streaming_df?
Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?
A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:
The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.
Which code fragment meets the requirements?
A)
B)
C)
D)
The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.
Which code fragment meets the requirements?
A data scientist of an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user. Before further processing the data, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns in this DataFrame. The PII columns in df_user are first_name, last_name, email, and birthdate.
Which code snippet can be used to meet this requirement?
18 of 55.
An engineer has two DataFrames — df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join:
from pyspark.sql.functions import broadcast
df_result = df2.join(broadcast(df1), on="id", how="inner")
What is the purpose of using broadcast() in this scenario?
A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.
Which operation is AQE implementing to improve performance?
41 of 55.
A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.
The DataFrame has columns:
id | Name | count | timestamp
---------------------------------
1 | USA | 10
2 | India | 20
3 | England | 50
4 | India | 50
5 | France | 20
6 | India | 10
7 | USA | 30
8 | USA | 40
Which code fragment should the engineer use to sort the data in the Name and count columns?
A developer wants to test Spark Connect with an existing Spark application.
What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)
A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.
How can this be achieved?
8 of 55.
A data scientist at a large e-commerce company needs to process and analyze 2 TB of daily customer transaction data. The company wants to implement real-time fraud detection and personalized product recommendations.
Currently, the company uses a traditional relational database system, which struggles with the increasing data volume and velocity.
Which feature of Apache Spark effectively addresses this challenge?
In the code block below, aggDF contains aggregations on a streaming DataFrame:
Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?
A developer initializes a SparkSession:
spark = SparkSession.builder \
.appName("Analytics Application") \
.getOrCreate()
Which statement describes the spark SparkSession?
42 of 55.
A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.
Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).
The current code:
from pyspark.sql import functions as F
final = df.withColumn("event_year", F.year("event_ts")) \
.withColumn("event_month", F.month("event_ts")) \
.bucketBy(42, ["event_year", "event_month"]) \
.saveAsTable("events.liveLatest")
However, consumers report poor query performance.
Which change will enable efficient querying by year and month?
A data engineer is working on a Streaming DataFrame streaming_df with the given streaming data:
Which operation is supported with streamingdf ?
6 of 55.
Which components of Apache Spark’s Architecture are responsible for carrying out tasks when assigned to them?
7 of 55.
A developer has been asked to debug an issue with a Spark application. The developer identified that the data being loaded from a CSV file is being read incorrectly into a DataFrame.
The CSV file has been read using the following Spark SQL statement:
CREATE TABLE locations
USING csv
OPTIONS (path '/data/locations.csv')
The first lines of the command SELECT * FROM locations look like this:
| city | lat | long |
| ALTI Sydney | -33... | ... |
Which parameter can the developer add to the OPTIONS clause in the CREATE TABLE statement to read the CSV data correctly again?
45 of 55.
Which feature of Spark Connect should be considered when designing an application that plans to enable remote interaction with a Spark cluster?
A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.
Which combination of Apache Spark modules should the data scientist use in this scenario?
Options:
A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.
Which change should be made to solve the issue?
A developer is trying to join two tables, sales.purchases_fct and sales.customer_dim, using the following code:
fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'))
The developer has discovered that customers in the purchases_fct table that do not exist in the customer_dim table are being dropped from the joined table.
Which change should be made to the code to stop these customer records from being dropped?
14 of 55.
A developer created a DataFrame with columns color, fruit, and taste, and wrote the data to a Parquet directory using:
df.write.partitionBy("color", "taste").parquet("/path/to/output")
What is the result of this code?
13 of 55.
A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:
region_id
region_name
10
North
12
East
14
West
The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.
Which code fragment meets the requirements?
A Spark application suffers from too many small tasks due to excessive partitioning. How can this be fixed without a full shuffle?
Options:
How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?
Options:
10 of 55.
What is the benefit of using Pandas API on Spark for data transformations?
A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?
A)
Use the applylnPandas API
B)
C)
D)
30 of 55.
A data engineer is working on a num_df DataFrame and has a Python UDF defined as:
def cube_func(val):
return val * val * val
Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?
Given this view definition:
df.createOrReplaceTempView("users_vw")
Which approach can be used to query the users_vw view after the session is terminated?
Options:
38 of 55.
A data engineer is working with Spark SQL and has a large JSON file stored at /data/input.json.
The file contains records with varying schemas, and the engineer wants to create an external table in Spark SQL that:
Reads directly from /data/input.json.
Infers the schema automatically.
Merges differing schemas.
Which code snippet should the engineer use?
5 of 55.
What is the relationship between jobs, stages, and tasks during execution in Apache Spark?
Given a DataFrame df that has 10 partitions, after running the code:
result = df.coalesce(20)
How many partitions will the result DataFrame have?
A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.
Which operation results in a shuffle?