Pre-Summer Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Home > Databricks > Databricks Certification > Databricks-Certified-Professional-Data-Engineer

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Question and Answers

Question # 4

What is true for Delta Lake?

A.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

B.

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters.

C.

Z-ORDER can only be applied to numeric values stored in Delta Lake tables.

D.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Full Access
Question # 5

A data team is implementing an append-only Delta Lake pipeline that processes both batch and streaming data . They want to ensure that schema changes in the source data are automatically incorporated without breaking the pipeline.

Which configuration should the team use when writing data to the Delta table?

A.

ignoreChanges = false

B.

mergeSchema = true

C.

overwriteSchema = true

D.

validateSchema = false

Full Access
Question # 6

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

A.

Three columns will be returned, but one column will be named " redacted " and contain only null values.

B.

Only the email and itv columns will be returned; the email column will contain all null values.

C.

The email and ltv columns will be returned with the values in user itv.

D.

The email, age. and ltv columns will be returned with the values in user ltv.

E.

Only the email and ltv columns will be returned; the email column will contain the string " REDACTED " in each row.

Full Access
Question # 7

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Total Disk Space remains constant

D.

Network I/O never spikes

E.

Overall cluster CPU utilization is around 25%

Full Access
Question # 8

Which statement describes Delta Lake optimized writes?

A.

A shuffle occurs prior to writing to try to group data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.

B.

Optimized writes logical partitions instead of directory partitions partition boundaries are only represented in metadata fewer small files are written.

C.

An asynchronous job runs after the write completes to detect if files could be further compacted; yes, an OPTIMIZE job is executed toward a default of 1 GB.

D.

Before a job cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.

Full Access
Question # 9

How are the operational aspects of Lakeflow Declarative Pipelines different from Spark Structured Streaming ?

A.

Lakeflow Declarative Pipelines manage the orchestration of multi-stage pipelines automatically, while Structured Streaming requires external orchestration for complex dependencies.

B.

Structured Streaming can process continuous data streams, while Lakeflow Declarative Pipelines cannot.

C.

Lakeflow Declarative Pipelines can write to Delta Lake format, while Structured Streaming cannot.

D.

Lakeflow Declarative Pipelines automatically handle schema evolution, while Structured Streaming always requires manual schema management.

Full Access
Question # 10

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.

If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

A.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

B.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

C.

All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

D.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

E.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

Full Access
Question # 11

A junior developer complains that the code in their notebook isn ' t producing the correct results in the development environment. A shared screenshot reveals that while they ' re using a notebook versioned with Databricks Repos, they ' re using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

A.

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

B.

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

C.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

D.

Merge all changes back to the main branch in the remote Git repository and clone the repo again

E.

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Full Access
Question # 12

A Data Engineer is building a simple data pipeline using Lakeflow Declarative Pipelines (LDP) in Databricks to ingest customer data. The raw customer data is stored in a cloud storage location in JSON format. The task is to create Lakeflow Declarative Pipelines that read the raw JSON data and write it into a Delta table for further processing.

Which code snippet will correctly ingest the raw JSON data and create a Delta table using LDP?

A.

import dlt

@dlt.table

def raw_customers():

return spark.read.format( " csv " ).load( " s3://my-bucket/raw-customers/ " )

B.

import dlt

@dlt.table

def raw_customers():

return spark.read.json( " s3://my-bucket/raw-customers/ " )

C.

import dlt

@dlt.table

def raw_customers():

return spark.read.format( " parquet " ).load( " s3://my-bucket/raw-customers/ " )

D.

import dlt

@dlt.view

def raw_customers():

return spark.format.json( " s3://my-bucket/raw-customers/ " )

Full Access
Question # 13

A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks.

One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.

What approach would allow them to do this?

A.

Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter.

B.

Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.

C.

Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.

D.

Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.

Full Access
Question # 14

A company has a task management system that tracks the most recent status of tasks. The system takes task events as input and processes events in near real-time using Lakeflow Declarative Pipelines. A new task event is ingested into the system when a task is created or the task status is changed. Lakeflow Declarative Pipelines provides a streaming table (tasks_status) for BI users to query.

The table represents the latest status of all tasks and includes 5 columns:

    task_id (unique for each task)

    task_name

    task_owner

    task_status

    task_event_time

The table enables three properties: deletion vectors, row tracking, and change data feed (CDF).

A data engineer is asked to create a new Lakeflow Declarative Pipeline to enrich the tasks_status table in near real-time by adding one additional column representing task_owner’s department, which can be looked up from a static dimension table (employee).

How should this enrichment be implemented?

A.

Create a new Lakeflow Declarative Pipeline: use the readStream() function to read tasks_status table; enrich with the employee table; store the result in a new streaming table.

B.

Create a new Lakeflow Declarative Pipeline: use readStream() function with option readChangeFeed to read tasks_status table CDF; enrich with the employee table; create a new streaming table as the result table and use apply_changes() function to process the changes from the enriched CDF.

C.

Create a new Lakeflow Declarative Pipeline: use the read() function to read tasks_status table; enrich with employee table; store the result in a materialized view.

D.

Create a new Lakeflow Declarative Pipeline: use the readStream() function with the option skipChangeCommits to read the tasks_status table; enrich with the employee table; store the result in a new streaming table.

Full Access
Question # 15

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Which solution would improve the performance?

A)

B)

C)

D)

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Full Access
Question # 16

Which statement describes the default execution mode for Databricks Auto Loader?

A.

New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

B.

Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.

C.

Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.

D.

New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.

Full Access
Question # 17

A data engineer is tasked with ensuring that a Delta table in Databricks continuously retains deleted files for 15 days (instead of the default 7 days), in order to permanently comply with the organization’s data retention policy.

Which code snippet correctly sets this retention period for deleted files?

A.

spark.sql( " ALTER TABLE my_table SET TBLPROPERTIES ( ' delta.deletedFileRetentionDuration ' = ' interval 15 days ' ) " )

B.

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, " /mnt/data/my_table " )

deltaTable.deletedFileRetentionDuration = " interval 15 days "

C.

spark.sql( " VACUUM my_table RETAIN 15 HOURS " )

D.

spark.conf.set( " spark.databricks.delta.deletedFileRetentionDuration " , " 15 days " )

Full Access
Question # 18

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response indicating the job run request was submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents?

A.

The job_id and number of times the job has been run are concatenated and returned.

B.

The globally unique ID of the newly triggered run.

C.

The job_id is returned in this field.

D.

The number of times the job definition has been run in this workspace.

Full Access
Question # 19

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

A.

The job_id is returned in this field.

B.

The job_id and number of times the job has been are concatenated and returned.

C.

The number of times the job definition has been run in the workspace.

D.

The globally unique ID of the newly triggered run.

Full Access
Question # 20

To identify the top users consuming compute resources, a data engineering team needs to monitor usage within their Databricks workspace for better resource utilization and cost control. The team decided to use Databricks system tables, available under the System catalog in Unity Catalog, to gain detailed visibility into workspace activity.

Which SQL query should the team run from the System catalog to achieve this?

A.

SELECT sku_name,

identity_metadata.created_by AS user_email,

COUNT(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

B.

SELECT identity_metadata.run_as AS user_email,

SUM(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email

ORDER BY total_dbus DESC

LIMIT 10

C.

SELECT sku_name,

identity_metadata.created_by AS user_email,

SUM(usage_quantity * usage_unit) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

D.

SELECT sku_name,

usage_metadata.run_name AS user_email,

SUM(usage_quantity) AS total_dbus

FROM system.billing.usage

GROUP BY user_email, sku_name

ORDER BY total_dbus DESC

LIMIT 10

Full Access
Question # 21

A member of the data engineering team has submitted a short notebook that they wish to schedule as part of a larger data pipeline. Assume that the commands provided below produce the logically correct results when run as presented.

Which command should be removed from the notebook before scheduling it as a job?

A.

Cmd 2

B.

Cmd 3

C.

Cmd 4

D.

Cmd 5

E.

Cmd 6

Full Access
Question # 22

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.

Which of the following likely explains these smaller file sizes?

A.

Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

B.

Z-order indices calculated on the table are preventing file compaction

C Bloom filler indices calculated on the table are preventing file compaction

C.

Databricks has autotuned to a smaller target file size based on the overall size of data in the table

D.

Databricks has autotuned to a smaller target file size based on the amount of data in each partition

Full Access
Question # 23

A data engineer has configured their Databricks Asset Bundle with multiple targets in databricks.yml and deployed it to the production workspace. Now, to validate the deployment, they need to invoke a job named my_project_job specifically within the prod target context. Assuming the job is already deployed, they need to trigger its execution while ensuring the target-specific configuration is respected.

Which command will trigger the job execution?

A.

databricks execute my_project_job -e prod

B.

databricks job run my_project_job --env prod

C.

databricks run my_project_job -t prod

D.

databricks bundle run my_project_job -t prod

Full Access
Question # 24

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company ' s data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

A.

Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.

B.

Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.

C.

Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

D.

Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.

E.

Databricks notebooks send all executable code from the user ' s browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Full Access
Question # 25

A data engineer has created a new cluster using shared access mode with default configurations. The data engineer needs to allow the development team access to view the driver logs if needed.

What are the minimal cluster permissions that allow the development team to accomplish this?

A.

CAN ATTACH TO

B.

CAN MANAGE

C.

CAN VIEW

D.

CAN RESTART

Full Access
Question # 26

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should have five workers , one driver of type i3.xlarge, and should use the ' 14.3.x-scala2.12 ' runtime.

Which command should the data engineer use?

A.

databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name DataEngineer_cluster

B.

databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

C.

databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

D.

databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

Full Access
Question # 27

A data engineer is designing a pipeline in Databricks that processes records from a Kafka stream where late-arriving data is common.

Which approach should the data engineer use?

A.

Implement a custom solution using Databricks Jobs to periodically reprocess all historical data.

B.

Use batch processing and overwrite the entire output table each time to ensure late data is incorporated correctly.

C.

Use an Auto CDC pipeline with batch tables to simplify late data handling.

D.

Use a watermark to specify the allowed lateness to accommodate records that arrive after their expected window, ensuring correct aggregation and state management.

Full Access
Question # 28

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:

The table daily_store_sales contains all the information needed to update store_sales_summary . The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

A.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

B.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

C.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

D.

Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

E.

Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Full Access
Question # 29

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

A.

" Can Manage " privileges on the required cluster

B.

Workspace Admin privileges, cluster creation allowed. " Can Attach To " privileges on the required cluster

C.

Cluster creation allowed. " Can Attach To " privileges on the required cluster

D.

" Can Restart " privileges on the required cluster

E.

Cluster creation allowed. " Can Restart " privileges on the required cluster

Full Access
Question # 30

A data engineer is running a groupBy aggregation on a massive user activity log grouped by user_id. A few users have millions of records, causing task skew and long runtimes.

Which technique will fix the skew in this aggregation?

A.

Use salting by adding a random prefix to skewed keys before aggregation, then aggregate again after removing the prefix.

B.

Increase the Spark driver memory and retry.

C.

Use reduceByKey instead of groupBy to avoid shuffles.

D.

Filter out the skewed users before the aggregation.

Full Access
Question # 31

A transactions table has been liquid clustered on the columns product_id, user_id, and event_date.

Which operation lacks support for cluster on write?

A.

spark.writestream.format( ' delta ' ).mode( ' append ' )

B.

CTAS and RTAS statements

C.

INSERT INTO operations

D.

spark.write.format( ' delta ' ).mode( ' append ' )

Full Access
Question # 32

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?

A.

/jobs/runs/list

B.

/jobs/runs/get-output

C.

/jobs/runs/get

D.

/jobs/get

E.

/jobs/list

Full Access
Question # 33

A DLT pipeline includes the following streaming tables:

Raw_lot ingest raw device measurement data from a heart rate tracking device.

Bgm_stats incrementally computes user statistics based on BPM measurements from raw_lot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table while recomputing the downstream table when a pipeline update is run?

A.

Set the skipChangeCommits flag to true on bpm_stats

B.

Set the SkipChangeCommits flag to true raw_lot

C.

Set the pipelines, reset, allowed property to false on bpm_stats

D.

Set the pipelines, reset, allowed property to false on raw_iot

Full Access
Question # 34

A data engineer needs to implement column masking for a sensitive column in a Unity Catalog-managed table. The masking logic must dynamically check if users belong to specific groups defined in a separate table (group_access) that maps groups to allowed departments.

Which approach should the engineer use to efficiently enforce this requirement?

A.

Create a UDF that hardcodes allowed groups and apply it as a column mask.

B.

Create a view without selecting the sensitive column.

C.

Apply a column mask that references the group_access mapping table in its UDF.

D.

Use a row filter to restrict access based on the user’s group.

Full Access
Question # 35

A Delta Lake table representing metadata about content from user has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

A.

Date

B.

Post_id

C.

User_id

D.

Post_time

Full Access
Question # 36

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

A.

Delta Lake statistics are not optimized for free text fields with high cardinality.

B.

Text data cannot be stored with Delta Lake.

C.

ZORDER ON review will need to be run to see performance gains.

D.

The Delta log creates a term matrix for free text fields to support selective filtering.

E.

Delta Lake statistics are only collected on the first 4 columns in a table.

Full Access
Question # 37

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM ' s resources?

A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Network I/O never spikes

D.

Total Disk Space remains constant

E.

CPU Utilization is around 75%

Full Access
Question # 38

A data team is automating a daily multi-task ETL pipeline in Databricks. The pipeline includes a notebook for ingesting raw data, a Python wheel task for data transformation, and a SQL query to update aggregates. They want to trigger the pipeline programmatically and see previous runs in the GUI. They need to ensure tasks are retried on failure and stakeholders are notified by email if any task fails.

Which two approaches will meet these requirements? (Choose 2 answers)

A.

Use the REST API endpoint /jobs/runs/submit to trigger each task individually as separate job runs and implement retries using custom logic in the orchestrator.

B.

Create a multi-task job using the UI, Databricks Asset Bundles (DABs), or the Jobs REST API (/jobs/create) with notebook, Python wheel, and SQL tasks. Configure task-level retries and email notifications in the job definition.

C.

Trigger the job programmatically using the Databricks Jobs REST API (/jobs/run-now), the CLI (databricks jobs run-now), or one of the Databricks SDKs.

D.

Create a single orchestrator notebook that calls each step with dbutils.notebook.run(), defining a job for that notebook and configuring retries and notifications at the notebook level.

E.

Use Databricks Asset Bundles (DABs) to deploy the workflow, then trigger individual tasks directly by referencing each task’s notebook or script path in the workspace.

Full Access
Question # 39

The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes.

A junior engineer has written the following code to add CHECK constraints to the Delta Lake table:

A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed.

Which statement explains the cause of this failure?

A.

Because another team uses this table to support a frequently running application, two-phase locking is preventing the operation from committing.

B.

The activity details table already exists; CHECK constraints can only be added during initial table creation.

C.

The activity details table already contains records that violate the constraints; all existing data must pass CHECK constraints in order to add them to an existing table.

D.

The activity details table already contains records; CHECK constraints can only be added prior to inserting values into a table.

E.

The current table schema does not contain the field valid coordinates; schema evolution will need to be enabled before altering the table to add a constraint.

Full Access
Question # 40

The data architect has decided that once data has been ingested from external sources into the

Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.

The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.

GRANT USAGE ON DATABASE prod TO eng;

GRANT SELECT ON DATABASE prod TO eng;

Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?

A.

Group members have full permissions on the prod database and can also assign permissions to other users or groups.

B.

Group members are able to list all tables in the prod database but are not able to see the results of any queries on those tables.

C.

Group members are able to query and modify all tables and views in the prod database, but cannot create new tables or views.

D.

Group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database.

E.

Group members are able to create, query, and modify all tables and views in the prod database, but cannot define custom functions.

Full Access
Question # 41

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

A.

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

B.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

C.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

D.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

E.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Full Access
Question # 42

A data engineer is masking a column containing email addresses. The goal is to produce output strings of identical length for all rows, while generating different outputs for different email values .

Which SQL function should be used to achieve this?

A.

mask(email, ' ? ' )

B.

hash(email)

C.

sha1(email)

D.

sha2(email, 0)

Full Access
Question # 43

A data architect is designing a Databricks solution to efficiently process data for different business requirements.

In which scenario should a data engineer use a materialized view compared to a streaming table ?

A.

Implementing a CDC (Change Data Capture) pipeline that needs to detect and respond to database changes within seconds.

B.

Ingesting data from Apache Kafka topics with sub-second processing requirements for immediate alerting.

C.

Precomputing complex aggregations and joins from multiple large tables to accelerate BI dashboard performance.

D.

Processing high-volume, continuous clickstream data from a website to monitor user behavior in real-time.

Full Access
Question # 44

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

" device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT "

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

A.

withWatermark( " event_time " , " 10 minutes " )

B.

awaitArrival( " event_time " , " 10 minutes " )

C.

await( " event_time + ‘10 minutes ' " )

D.

slidingWindow( " event_time " , " 10 minutes " )

E.

delayWrite( " event_time " , " 10 minutes " )

Full Access
Question # 45

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Which response correctly fills in the blank to meet the specified requirements?

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Full Access
Question # 46

A data engineer is configuring Delta Sharing for a Databricks-to-Databricks scenario to optimize read performance. The recipient needs to perform time travel queries and streaming reads on shared sales data.

Which configuration will provide the optimal performance while enabling these capabilities?

A.

Share tables WITH HISTORY , ensure tables don’t have partitioning enabled, and enable CDF before sharing.

B.

Share tables WITHOUT HISTORY and enable partitioning for better query performance.

C.

Share the entire schema WITHOUT HISTORY and rely on recipient-side caching for performance.

D.

Use the open sharing protocol instead of Databricks-to-Databricks sharing for better performance.

Full Access
Question # 47

A data engineer is using Auto Loader to read incoming JSON data as it arrives. They have configured Auto Loader to quarantine invalid JSON records but notice that over time, some records are being quarantined even though they are well-formed JSON .

The code snippet is:

df = (spark.readStream

.format( " cloudFiles " )

.option( " cloudFiles.format " , " json " )

.option( " badRecordsPath " , " /tmp/somewhere/badRecordsPath " )

.schema( " a int, b int " )

.load( " /Volumes/catalog/schema/raw_data/ " ))

What is the cause of the missing data?

A.

At some point, the upstream data provider switched everything to multi-line JSON.

B.

The badRecordsPath location is accumulating many small files.

C.

The source data is valid JSON but does not conform to the defined schema in some way.

D.

The engineer forgot to set the option " cloudFiles.quarantineMode " = " rescue " .

Full Access
Question # 48

The business intelligence team has a dashboard configured to track various summary metrics for retail stories. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

For Demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table named products_per_order, includes the following fields:

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?

A.

Use the Delta Cache to persists the products_per_order table in memory to quickly the dashboard with each query.

B.

Populate the dashboard by configuring a nightly batch job to save the required to quickly update the dashboard with each query.

C.

Use Structure Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.

D.

Define a view against the products_per_order table and define the dashboard against this view.

Full Access
Question # 49

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.

Which statement describes the contents of the workspace audit logs concerning these events?

A.

Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identity these events.

B.

Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events.

C.

Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.

D.

Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.

E.

Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.

Full Access
Question # 50

Which Python variable contains a list of directories to be searched when trying to locate required modules?

A.

importlib.resource path

B.

,sys.path

C.

os-path

D.

pypi.path

E.

pylib.source

Full Access
Question # 51

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

A.

Task queueing resulting from improper thread pool assignment.

B.

Spill resulting from attached volume storage being too small.

C.

Network latency due to some cluster nodes being in different regions from the source data

D.

Skew caused by more data being assigned to a subset of spark-partitions.

E.

Credential validation errors while pulling data from an external system.

Full Access
Question # 52

The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible.

A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have

Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.

Which statement captures best practices for this situation?

A.

Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.

B.

All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.

C.

In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.

D.

Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.

Full Access
Question # 53

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

A.

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

B.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

C.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

D.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

E.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Full Access
Question # 54

An organization processes customer data from web and mobile applications. Data includes names, emails, phone numbers, and location history. Data arrives both as batch files (from SFTP daily) and streaming JSON events (from Kafka in real-time).

To comply with data privacy policies, the following requirements must be met:

    Personally Identifiable Information (PII) such as email, phone number, and IP address must be masked or anonymized before storage.

    Both batch and streaming pipelines must apply consistent PII handling.

    Masking logic must be auditable and reproducible.

    The masked data must remain usable for downstream analytics.

How should the data engineer design a compliant data pipeline on Databricks that supports both batch and streaming modes, applies data masking to PII, and maintains traceability for audits?

A.

Allow PII to be stored unmasked in Bronze for lineage tracking, then apply masking logic in Gold tables used for reporting.

B.

Load batch data with notebooks and ingest streaming data with SQL Warehouses; use Unity Catalog column masks on Silver tables to redact fields after storage.

C.

Ingest both batch and streaming data using Lakeflow Declarative Pipelines, and apply masking via Unity Catalog column masks at read time to avoid modifying the data during ingestion.

D.

Use Lakeflow Declarative Pipelines for batch and streaming ingestion, define a PII masking function , and apply it during Bronze ingestion before writing to Delta Lake .

Full Access
Question # 55

Which approach demonstrates a modular and testable way to use DataFrame.transform for ETL code in PySpark?

A.

class Pipeline:

def transform(self, df):

return df.withColumn( " value_upper " , upper(col( " value " )))

pipeline = Pipeline()

assertDataFrameEqual(pipeline.transform(test_input), expected)

B.

def upper_value(df):

return df.withColumn( " value_upper " , upper(col( " value " )))

def filter_positive(df):

return df.filter(df[ " id " ] > 0)

pipeline_df = df.transform(upper_value).transform(filter_positive)

C.

def upper_transform(df):

return df.withColumn( " value_upper " , upper(col( " value " )))

actual = test_input.transform(upper_transform)

assertDataFrameEqual(actual, expected)

D.

def transform_data(input_df):

# transformation logic here

return output_df

test_input = spark.createDataFrame([(1, " a " )], [ " id " , " value " ])

assertDataFrameEqual(transform_data(test_input), expected)

Full Access
Question # 56

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

A.

configure

B.

fs

C.

jobs

D.

libraries

E.

workspace

Full Access
Question # 57

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

A.

Can manage

B.

Can edit

C.

Can run

D.

Can Read

Full Access
Question # 58

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.

The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120 . Notifications are triggered to be sent at most every 1 minute.

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

A.

The total average temperature across all sensors exceeded 120 on three consecutive executions of the query

B.

The recent_sensor_recordings table was unresponsive for three consecutive runs of the query

C.

The source query failed to update properly for three consecutive minutes and then restarted

D.

The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query

E.

The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

Full Access
Question # 59

An analytics team wants to run a short-term experiment in Databricks SQL on the customer transactions Delta table (about 20 billion records) created by the data engineering team. Which strategy should the data engineering team use to ensure minimal downtime and no impact on the ongoing ETL processes?

A.

Create a new table for the analytics team using a CTAS statement.

B.

Deep clone the table for the analytics team.

C.

Give the analytics team direct access to the production table.

D.

Shallow clone the table for the analytics team.

Full Access
Question # 60

A data engineer, while designing a Pandas UDF to process financial time-series data with complex calculations that require maintaining state across rows within each stock symbol group, must ensure the function is efficient and scalable. Which approach will solve the problem with minimum overhead while preserving data integrity?

A.

Use a scalar_iter Pandas UDF with iterator-based processing, implementing state management through persistent storage (Delta tables) that gets updated after each batch to maintain continuity across iterator chunks.

B.

Use a scalar Pandas UDF that processes the entire dataset at once, implementing custom partitioning logic within the UDF to group by stock symbol and maintain state using global variables shared across all executor processes.

C.

Use applyInPandas on a Spark DataFrame so that each stock symbol group is received as a pandas DataFrame, allowing processing within each group while maintaining state variables local to each group’s processing function.

D.

Use a grouped-aggregate Pandas UDF that processes each stock symbol group independently, maintaining state through intermediate aggregation results that get passed between successive UDF calls via broadcast variables.

Full Access