Google Cloud

50 Questions

Question Bank

50 Professional Data Engineer Practice Questions: Question Bank 2025

Build your exam confidence with our curated bank of 50 practice questions for the Professional Data Engineer certification. Each question includes detailed explanations to help you understand the concepts deeply.

50 Questions

All Domains

Mixed Difficulty

Need More? Try 100 Questions View Study Guide

Question Banks Available

50 Questions

Current Selection

Current

Extended Practice

Extended Practice

Comprehensive Question Bank

Why Use Our 50 Question Bank?

Strategically designed questions to maximize your exam preparation

50 Questions

A comprehensive set of practice questions covering key exam topics

All Domains Covered

Questions distributed across all exam objectives and domains

Mixed Difficulty

Easy, medium, and hard questions to test all skill levels

Detailed Explanations

Learn from comprehensive explanations for each answer

50 Question Bank

Practice Questions

50 practice questions for Professional Data Engineer

AI Generated

50 Questions

Designing data processing systems

A retail company stores daily CSV sales extracts in Cloud Storage. Analysts want to query the files with standard SQL without loading them into a database first, and costs should stay low. What should you do?

Building and operationalizing data processing systems

A team wants to orchestrate a nightly pipeline that loads data to BigQuery, runs data quality checks, and then triggers a Dataform workflow. They need retries, alerting, and a visual view of dependencies. Which Google Cloud service is the best fit?

Ensuring solution quality

You need to ensure that objects written to a Cloud Storage bucket are encrypted with customer-managed encryption keys (CMEK) and that only certain service accounts can use the key. What should you configure?

Operationalizing machine learning models

A data scientist trains models in Vertex AI and wants the serving endpoint to scale down to zero when idle to reduce cost, while still supporting HTTPS requests. Which deployment option should they choose?

Designing data processing systems

A streaming pipeline ingests IoT sensor events. The pipeline must compute per-device rolling 10-minute aggregates and handle out-of-order events up to 5 minutes late. Which approach best satisfies these requirements?

Ensuring solution quality

A BigQuery dataset contains a table with sensitive PII columns (email, phone). Analysts should be able to query aggregates, but only a small compliance group can see raw PII. You want to minimize query rewrites and manage access centrally. What should you do?

Building and operationalizing data processing systems

A batch Dataflow job intermittently fails with 'quota exceeded' errors when writing to BigQuery. The team wants to make the pipeline resilient without manual restarts and avoid duplicates in the destination table. What should they do?

Operationalizing machine learning models

A team trains a model in Vertex AI and wants to promote only models that outperform the currently deployed model on a fixed evaluation dataset. They also want a clear record of which dataset, code, and parameters produced each model. What should they implement?

Ensuring solution quality

A financial services company needs a disaster recovery design for a critical analytics dataset in BigQuery. Requirements: recovery from accidental deletion, protection from region-level failures, and minimal operational overhead. Which approach best meets these needs?

Designing data processing systems

You are designing a feature store-like pipeline for near-real-time predictions. Features come from streaming events (Pub/Sub) and batch backfills (BigQuery). The online serving system needs low-latency key-based lookups, and you must ensure training/serving consistency (same feature definitions). What is the best architecture on Google Cloud?

Building and operationalizing data processing systems

A retail company stores daily sales files in Cloud Storage. A scheduled pipeline loads the files into BigQuery each morning. Some days, the same file is accidentally uploaded twice, and duplicates appear in BigQuery. The company wants an approach that prevents duplicates without manual cleanup and keeps the pipeline simple. What should you do?

Ensuring solution quality

You are serving an ML model on Vertex AI for online predictions. Security requires that predictions are made without sending traffic over the public internet. The client applications run on Compute Engine in a shared VPC. What should you implement?

Building and operationalizing data processing systems

A data engineering team wants to standardize how datasets and models move from development to production across multiple projects. They need an approval gate, repeatable deployments, and auditable change history. What is the best approach on Google Cloud?

Designing data processing systems

A company is moving from an on-prem EDW to BigQuery. They have 5 years of historical data (rarely queried) and 6 months of recent data (queried frequently). They want to optimize performance and keep storage organized. What should you recommend?

Building and operationalizing data processing systems

A streaming pipeline uses Pub/Sub and Dataflow to aggregate clickstream events into per-minute metrics written to BigQuery. After a schema change, the pipeline starts failing with errors about mismatched fields when writing to BigQuery. The team wants to roll out schema changes safely with minimal downtime. What should they do?

Operationalizing machine learning models

Your organization trains models in Vertex AI and deploys them as endpoints. A compliance requirement states that you must be able to reproduce any prediction by tying it back to the exact training dataset snapshot, code version, and model artifact. What should you implement?

Building and operationalizing data processing systems

A Dataflow batch job reads from Cloud Storage and writes to BigQuery. It occasionally becomes significantly slower, and the team suspects uneven work distribution due to a few very large input files. They want to improve throughput without changing the business logic. What should they do?

Ensuring solution quality

You need to validate data quality for a BigQuery dataset used by multiple downstream reports. Requirements include: detect nulls in critical columns, enforce referential integrity-like checks between dimension and fact tables, and publish a daily quality report. The solution should be managed and integrate well with BigQuery. What should you use?

Ensuring solution quality

A financial services company must ensure that sensitive fields (SSNs, bank account numbers) in raw ingestion logs are never accessible to analysts, but data engineers must still be able to reprocess raw data for compliance audits. They also want analysts to access de-identified data in BigQuery. What is the best architecture?

Operationalizing machine learning models

You run a feature generation pipeline that computes features from streaming events and writes them to an online store for low-latency serving and to an offline store for training. The serving model’s performance degraded, and you suspect training/serving skew due to different transformations used in batch vs streaming. What should you do to minimize skew going forward?

Ensuring solution quality

A data engineering team stores curated datasets in BigQuery. Analysts must be prevented from accidentally incurring very large query costs, but they still need the flexibility to explore data. What is the BEST way to enforce this across projects?

Building and operationalizing data processing systems

You have a streaming pipeline writing events into BigQuery. Downstream dashboards must not show duplicate events, and duplicates can arrive minutes apart. Each event has a stable event_id. Which approach is MOST appropriate?

Operationalizing machine learning models

A team uses Vertex AI to train a model. They need to reproduce a specific model later and prove what data and parameters were used. Which feature BEST supports this requirement?

Building and operationalizing data processing systems

A company runs nightly Dataflow batch jobs that read from Cloud Storage and write to BigQuery. Some runs fail due to transient BigQuery quota errors. They want automatic retries without rerunning successful work and want to minimize duplicate rows. What should they do?

Operationalizing machine learning models

A retail company wants to build a feature store for both online predictions (low-latency) and offline training. They already use BigQuery for analytics and need point-in-time correct historical features for training to avoid label leakage. What is the BEST approach on Google Cloud?

Designing data processing systems

A data platform ingests JSON events with evolving schemas into BigQuery. The team wants to minimize pipeline breakages when new optional fields appear, while keeping data queryable. Which design is BEST?

Ensuring solution quality

Your organization has multiple teams publishing datasets to BigQuery. You need consistent data governance: dataset-level access, column-level security for PII, and the ability to audit who accessed sensitive fields. What should you implement?

Operationalizing machine learning models

A real-time fraud detection service uses a model deployed on Vertex AI endpoints. After deployment, the fraud rate detected drops significantly, but offline evaluation metrics remain stable. You suspect data drift in production features. What should you do FIRST?

Designing data processing systems

You are designing a cross-region disaster recovery strategy for a critical BigQuery dataset used for compliance reporting. Requirements: RPO of minutes, minimal operational overhead, and the ability to fail over if a region becomes unavailable. What is the BEST solution?

Ensuring solution quality

A Beam pipeline on Dataflow processes messages from Pub/Sub and writes to Bigtable. During a traffic spike, you observe increased processing lag and Bigtable hot-spotting. The row key is currently constructed as userId + timestamp. What change is MOST likely to fix the hot-spotting while preserving efficient queries by user and time?

Building and operationalizing data processing systems

A retail company streams click events into Pub/Sub and uses a Dataflow streaming pipeline to write to BigQuery. During an incident, the pipeline stops for 20 minutes and then restarts. The business requires no data loss and no duplicate rows in BigQuery. What should you implement?

Ensuring solution quality

Your organization stores analytics datasets in BigQuery. Only a small group should see columns containing PII (email, phone), while analysts should still be able to query non-PII columns in the same tables without creating separate copies. What is the recommended approach?

Building and operationalizing data processing systems

A data engineer needs to run a daily ELT SQL transformation that depends on upstream table loads and should fail fast if upstream data is missing. The team wants minimal infrastructure management. Which solution best fits?

Operationalizing machine learning models

You deployed a Vertex AI endpoint for online predictions. Your SRE team wants to monitor for prediction service degradation and get alerted on elevated error rates and latency without building custom instrumentation. What should you do?

Designing data processing systems

A media company stores raw video metadata as JSON in Cloud Storage. They want to run ad-hoc SQL across all files with minimal ETL and keep the data in place. Query performance should be acceptable for interactive analysis. What is the best approach?

Ensuring solution quality

A Dataflow batch pipeline reading from Cloud Storage occasionally fails due to corrupt input records. The business wants the pipeline to continue processing valid records, capture bad records for later review, and produce a count of rejected records per day. What should you do?

Ensuring solution quality

Your team needs to grant a third-party vendor access to read only a specific subset of tables in a BigQuery dataset for 60 days. The vendor should not see other datasets in the project. What is the recommended access approach?

Operationalizing machine learning models

A fraud detection model is trained in Vertex AI on historical data. After deployment, the fraud rate and customer behavior shift, and the model’s precision degrades. The team wants an automated way to detect feature distribution shifts and trigger investigation. What should you implement?

Designing data processing systems

You run a global streaming pipeline that ingests events into BigQuery. Downstream analysts require consistent results when joining multiple event types, and they want to query "as of" a specific time without seeing partially processed late-arriving data. Which design best meets this requirement?

Ensuring solution quality

A healthcare company must keep patient data encrypted with customer-managed keys and ensure that if keys are disabled, access to the data is immediately prevented across both BigQuery and Cloud Storage. They also need centralized control and auditability of key usage. What should you recommend?

Ensuring solution quality

A product team wants analysts to run SQL on a shared dataset without being able to query columns containing PII (e.g., email, phone). The team also wants to avoid managing separate tables for masked data. What is the recommended approach in BigQuery?

Building and operationalizing data processing systems

You need to orchestrate a daily pipeline that loads raw files to Cloud Storage, runs a Dataflow batch transform, then executes BigQuery SQL for aggregations. Operations wants retries, dependencies, and alerting on failures with minimal custom code. Which approach should you choose?

Designing data processing systems

A streaming pipeline writes events to BigQuery. Analysts complain that some dashboards show missing data for the last few minutes, but the pipeline appears healthy. You learn the pipeline uses BigQuery streaming inserts. What should you explain as the most likely cause?

Designing data processing systems

You need to design a lakehouse-style architecture. Raw data (CSV/JSON/Parquet) lands in Cloud Storage, and multiple teams want to query it with BigQuery while retaining schema evolution and ACID-like guarantees on updates. What is the best approach?

Building and operationalizing data processing systems

A Dataflow streaming job reads from Pub/Sub and writes to BigQuery. During a backfill, Pub/Sub throughput spikes and the pipeline starts to fall behind. You need to reduce end-to-end latency and prevent hot keys from overwhelming certain workers. What should you do?

Operationalizing machine learning models

You are deploying a Vertex AI model for online prediction. Some requests require a specialized feature extraction step written in Python that must execute at request time. You want the feature logic versioned and deployed together with the model to reduce training/serving skew. What should you use?

Ensuring solution quality

A company is using BigQuery as the source of truth for KPIs. They want to prevent accidental deletion of tables and also ensure that any query accessing regulated datasets is auditable. What combination best satisfies these requirements?

Operationalizing machine learning models

You are troubleshooting a Vertex AI endpoint that occasionally returns 5xx errors under load. CPU and memory are not saturated, but request latency spikes coincide with autoscaling events. What is the most appropriate mitigation?

Ensuring solution quality

A regulated financial firm needs to build a multi-project analytics platform. Data scientists in a shared project must query sensitive BigQuery datasets located in separate producer projects, but producers must retain full control and must not grant broad dataset access. The firm also wants to reduce the risk of data exfiltration. What is the best design?

Designing data processing systems

You operate an event-driven platform where multiple microservices publish events with evolving schemas. You must ingest these events into analytics with low latency, support schema evolution without breaking consumers, and allow replay/backfill. Which architecture is most appropriate on Google Cloud?

Need more practice?

Expand your preparation with our larger question banks

100 Questions 200 Questions

FAQ

Professional Data Engineer 50 Practice Questions FAQs

Professional Data Engineer is a professional certification from Google Cloud that validates expertise in professional data engineer technologies and concepts. The official exam code is GCP-9.

Our 50 Professional Data Engineer practice questions include a curated selection of exam-style questions covering key concepts from all exam domains. Each question includes detailed explanations to help you learn.

50 questions is a great starting point for Professional Data Engineer preparation. For comprehensive coverage, we recommend also using our 100 and 200 question banks as you progress.

The 50 Professional Data Engineer questions are organized by exam domain and include a mix of easy, medium, and hard questions to test your knowledge at different levels.

More Preparation Resources

Explore other ways to prepare for your certification