Halloween Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: mxmas70

Home > Databricks > Databricks Certification > Databricks-Certified-Professional-Data-Engineer

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Question and Answers

Question # 4

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Which response correctly fills in the blank to meet the specified requirements?

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Full Access
Question # 5

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

A.

to_interval("event_time", "5 minutes").alias("time")

B.

window("event_time", "5 minutes").alias("time")

C.

"event_time"

D.

window("event_time", "10 minutes").alias("time")

E.

lag("event_time", "10 minutes").alias("time")

Full Access
Question # 6

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?

A.

/jobs/runs/list

B.

/jobs/runs/get-output

C.

/jobs/runs/get

D.

/jobs/get

E.

/jobs/list

Full Access
Question # 7

A junior data engineer on your team has implemented the following code block.

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?

A.

They are merged.

B.

They are ignored.

C.

They are updated.

D.

They are inserted.

E.

They are deleted.

Full Access
Question # 8

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?

A.

The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.

B.

Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.

C.

Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.

D.

Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.

E.

Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Full Access
Question # 9

A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should have five workers, one driver of type i3.xlarge, and should use the '14.3.x-scala2.12' runtime.

Which command should the data engineer use?

A.

databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name DataEngineer_cluster

B.

databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

C.

databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

D.

databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster

Full Access
Question # 10

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

A.

withWatermark("event_time", "10 minutes")

B.

awaitArrival("event_time", "10 minutes")

C.

await("event_time + ‘10 minutes'")

D.

slidingWindow("event_time", "10 minutes")

E.

delayWrite("event_time", "10 minutes")

Full Access
Question # 11

A Delta Lake table representing metadata about content posts from users has the following schema:

    user_id LONG

    post_text STRING

    post_id STRING

    longitude FLOAT

    latitude FLOAT

    post_time TIMESTAMP

    date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

A.

date

B.

user_id

C.

post_id

D.

post_time

Full Access
Question # 12

A DLT pipeline includes the following streaming tables:

Raw_lot ingest raw device measurement data from a heart rate tracking device.

Bgm_stats incrementally computes user statistics based on BPM measurements from raw_lot.

How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table while recomputing the downstream table when a pipeline update is run?

A.

Set the skipChangeCommits flag to true on bpm_stats

B.

Set the SkipChangeCommits flag to true raw_lot

C.

Set the pipelines, reset, allowed property to false on bpm_stats

D.

Set the pipelines, reset, allowed property to false on raw_iot

Full Access
Question # 13

The downstream consumers of a Delta Lake table have been complaining about data quality issues impacting performance in their applications. Specifically, they have complained that invalid latitude and longitude values in the activity_details table have been breaking their ability to use other geolocation processes.

A junior engineer has written the following code to add CHECK constraints to the Delta Lake table:

A senior engineer has confirmed the above logic is correct and the valid ranges for latitude and longitude are provided, but the code fails when executed.

Which statement explains the cause of this failure?

A.

Because another team uses this table to support a frequently running application, two-phase locking is preventing the operation from committing.

B.

The activity details table already exists; CHECK constraints can only be added during initial table creation.

C.

The activity details table already contains records that violate the constraints; all existing data must pass CHECK constraints in order to add them to an existing table.

D.

The activity details table already contains records; CHECK constraints can only be added prior to inserting values into a table.

E.

The current table schema does not contain the field valid coordinates; schema evolution will need to be enabled before altering the table to add a constraint.

Full Access
Question # 14

Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets.

Which statement describes a limitation of Databricks Secrets?

A.

Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash will display the value in plain text.

B.

Account administrators can see all secrets in plain text by logging on to the Databricks Accounts console.

C.

Secrets are stored in an administrators-only table within the Hive Metastore; database administrators have permission to query this table by default.

D.

Iterating through a stored secret and printing each character will display secret contents in plain text.

E.

The Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials.

Full Access
Question # 15

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs UI. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

A.

Can Manage

B.

Can Edit

C.

No permissions

D.

Can Read

E.

Can Run

Full Access
Question # 16

The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts, transforms, and loads the data for their pipeline runs in 10 minutes. Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

A.

Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster.

B.

Schedule a job to execute the pipeline once an hour on a new job cluster.

C.

Schedule a Structured Streaming job with a trigger interval of 60 minutes.

D.

Configure a job that executes every time new data lands in a given directory.

Full Access
Question # 17

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.

The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

A.

The total average temperature across all sensors exceeded 120 on three consecutive executions of the query

B.

The recent_sensor_recordingstable was unresponsive for three consecutive runs of the query

C.

The source query failed to update properly for three consecutive minutes and then restarted

D.

The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query

E.

The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

Full Access
Question # 18

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

A.

Workspace audit logs

B.

Driver's log file

C.

Ganglia

D.

Cluster Event Log

E.

Executor's log file

Full Access
Question # 19

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

df has the following schema: device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT

Code block:

df.withWatermark("event_time", "10 minutes")

.groupBy(

________,

"device_id"

)

.agg(

avg("temp").alias("avg_temp"),

avg("humidity").alias("avg_humidity")

)

.writeStream

.format("delta")

.saveAsTable("sensor_avg")

Which line of code correctly fills in the blank within the code block to complete this task?

A.

window("event_time", "5 minutes").alias("time")

B.

to_interval("event_time", "5 minutes").alias("time")

C.

"event_time"

D.

lag("event_time", "5 minutes").alias("time")

Full Access
Question # 20

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.

Which statement explains what is preventing this privilege transfer?

A.

Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.

B.

The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.

C.

Other than the default "admins" group, only individual users can be granted privileges on jobs.

D.

A user can only transfer job ownership to a group if they are also a member of that group.

E.

Only workspace administrators can grant "Owner" privileges to a group.

Full Access
Question # 21

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

A.

A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.

B.

An external table will be created in the storage container mounted to /mnt/finance eda bucket.

C.

A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.

D.

An managed table will be created in the storage container mounted to /mnt/finance eda bucket.

E.

A managed table will be created in the DBFS root storage container.

Full Access
Question # 22

Which statement describes integration testing?

A.

Validates interactions between subsystems of your application

B.

Requires an automated testing framework

C.

Requires manual intervention

D.

Validates an application use case

E.

Validates behavior of individual elements of your application

Full Access
Question # 23

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

A.

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

B.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

C.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

D.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

E.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Full Access
Question # 24

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

A.

• Total VMs; 1

• 400 GB per Executor

• 160 Cores / Executor

B.

• Total VMs: 8

• 50 GB per Executor

• 20 Cores / Executor

C.

• Total VMs: 4

• 100 GB per Executor

• 40 Cores/Executor

D.

• Total VMs:2

• 200 GB per Executor

• 80 Cores / Executor

Full Access
Question # 25

The following table consists of items found in user carts within an e-commerce website.

The following MERGE statement is used to update this table using an updates view, with schema evaluation enabled on this table.

How would the following update be handled?

A.

The update is moved to separate ''restored'' column because it is missing a column expected in the target schema.

B.

The new restored field is added to the target schema, and dynamically read as NULL for existing unmatched records.

C.

The update throws an error because changes to existing columns in the target schema are not supported.

D.

The new nested field is added to the target schema, and files underlying existing records are updated to include NULL values for the new field.

Full Access
Question # 26

Which statement describes Delta Lake Auto Compaction?

A.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

B.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

C.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

D.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

E.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Full Access
Question # 27

The data science team has created and logged a production using MLFlow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, load the customer table containing the customer_id key column into a Dataframe, and defines the feature columns needed for the model.

Which code block will output DataFrame with the schema'' customer_id LONG, predictions DOUBLE''?

A.

Model, predict (df, columns)

B.

Df, map (lambda k:midel (x [columns]) ,select (''customer_id predictions'')

C.

Df. Select (''customer_id''.

Model (''columns) alias (''predictions'')

D.

Df.apply(model, columns). Select (''customer_id, prediction''

Full Access
Question # 28

Given the following error traceback:

AnalysisException: cannot resolve 'heartrateheartrateheartrate' given input columns:

[spark_catalog.database.table.device_id, spark_catalog.database.table.heartrate,

spark_catalog.database.table.mrn, spark_catalog.database.table.time]

The code snippet was:

display(df.select(3*"heartrate"))

Which statement describes the error being raised?

A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

Full Access
Question # 29

An analytics team wants to run a short-term experiment in Databricks SQL on the customer transactions Delta table (about 20 billion records) created by the data engineering team. Which strategy should the data engineering team use to ensure minimal downtime and no impact on the ongoing ETL processes?

A.

Create a new table for the analytics team using a CTAS statement.

B.

Deep clone the table for the analytics team.

C.

Give the analytics team direct access to the production table.

D.

Shallow clone the table for the analytics team.

Full Access
Question # 30

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

A.

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: Unlimited

B.

Cluster: New Job Cluster;

Retries: None;

Maximum Concurrent Runs: 1

C.

Cluster: Existing All-Purpose Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

D.

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

E.

Cluster: Existing All-Purpose Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Full Access
Question # 31

Which statement regarding stream-static joins and static Delta tables is correct?

A.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

B.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.

C.

The checkpoint directory will be used to track state information for the unique keys present in the join.

D.

Stream-static joins cannot use static Delta tables because of consistency issues.

E.

The checkpoint directory will be used to track updates to the static Delta table.

Full Access
Question # 32

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org.

Which of the following solutions addresses the situation while emphasizing simplicity?

A.

Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.

B.

Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes.

C.

Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table.

D.

Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.

Full Access
Question # 33

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:

The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

A.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

B.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

C.

Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

D.

Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

E.

Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Full Access
Question # 34

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

A.

Use &Pip install in a notebook cell

B.

Run source env/bin/activate in a notebook setup script

C.

Install libraries from PyPi using the cluster UI

D.

Use &sh install in a notebook cell

Full Access
Question # 35

The data engineer is using Spark's MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the spark UI's Storage tab to signal that a cached table is not performing optimally?

A.

Size on Disk is> 0

B.

The number of Cached Partitions> the number of Spark Partitions

C.

The RDD Block Name included the '' annotation signaling failure to cache

D.

On Heap Memory Usage is within 75% of off Heap Memory usage

Full Access
Question # 36

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().

Which of the following statements is correct?

A.

DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.

B.

By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.

C.

The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.

D.

Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.

E.

The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.

Full Access
Question # 37

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

A.

Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.

B.

The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.

C.

Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.

D.

One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.

E.

The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

Full Access
Question # 38

The data governance team is reviewing user for deleting records for compliance with GDPR. The following logic has been implemented to propagate deleted requests from the user_lookup table to the user aggregate table.

Assuming that user_id is a unique identifying key and that all users have requested deletion have been removed from the user_lookup table, which statement describes whether successfully executing the above logic guarantees that the records to be deleted from the user_aggregates table are no longer accessible and why?

A.

No: files containing deleted records may still be accessible with time travel until a BACUM command is used to remove invalidated data files.

B.

Yes: Delta Lake ACID guarantees provide assurance that the DELETE command successed fully and permanently purged these records.

C.

No: the change data feed only tracks inserts and updates not deleted records.

D.

No: the Delta Lake DELETE command only provides ACID guarantees when combined with the MERGE INTO command

Full Access
Question # 39

The Databricks CLI is used to trigger a run of an existing job by passing the job_id parameter. The response indicating the job run request was submitted successfully includes a field run_id. Which statement describes what the number alongside this field represents?

A.

The job_id and number of times the job has been run are concatenated and returned.

B.

The globally unique ID of the newly triggered run.

C.

The job_id is returned in this field.

D.

The number of times the job definition has been run in this workspace.

Full Access
Question # 40

A Delta Lake table with Change Data Feed (CDF) enabled in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records?

A.

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

B.

Modify the overwrite logic to include a field populated by calling current_timestamp() as data are being written; use this field to identify records written on a particular date.

C.

Replace the current overwrite logic with a MERGE statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the Change Data Feed.

D.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

Full Access