50 Professional Data Engineer Practice Questions: Question Bank 2025
Build your exam confidence with our curated bank of 50 practice questions for the Professional Data Engineer certification. Each question includes detailed explanations to help you understand the concepts deeply.
Question Banks Available
Current Selection
Extended Practice
Extended Practice
Why Use Our 50 Question Bank?
Strategically designed questions to maximize your exam preparation
50 Questions
A comprehensive set of practice questions covering key exam topics
All Domains Covered
Questions distributed across all exam objectives and domains
Mixed Difficulty
Easy, medium, and hard questions to test all skill levels
Detailed Explanations
Learn from comprehensive explanations for each answer
Practice Questions
50 practice questions for Professional Data Engineer
A retail company stores daily CSV sales extracts in Cloud Storage. Analysts want to query the files with standard SQL without loading them into a database first, and costs should stay low. What should you do?
A team wants to orchestrate a nightly pipeline that loads data to BigQuery, runs data quality checks, and then triggers a Dataform workflow. They need retries, alerting, and a visual view of dependencies. Which Google Cloud service is the best fit?
You need to ensure that objects written to a Cloud Storage bucket are encrypted with customer-managed encryption keys (CMEK) and that only certain service accounts can use the key. What should you configure?
A data scientist trains models in Vertex AI and wants the serving endpoint to scale down to zero when idle to reduce cost, while still supporting HTTPS requests. Which deployment option should they choose?
A streaming pipeline ingests IoT sensor events. The pipeline must compute per-device rolling 10-minute aggregates and handle out-of-order events up to 5 minutes late. Which approach best satisfies these requirements?
A BigQuery dataset contains a table with sensitive PII columns (email, phone). Analysts should be able to query aggregates, but only a small compliance group can see raw PII. You want to minimize query rewrites and manage access centrally. What should you do?
A batch Dataflow job intermittently fails with 'quota exceeded' errors when writing to BigQuery. The team wants to make the pipeline resilient without manual restarts and avoid duplicates in the destination table. What should they do?
A team trains a model in Vertex AI and wants to promote only models that outperform the currently deployed model on a fixed evaluation dataset. They also want a clear record of which dataset, code, and parameters produced each model. What should they implement?
A financial services company needs a disaster recovery design for a critical analytics dataset in BigQuery. Requirements: recovery from accidental deletion, protection from region-level failures, and minimal operational overhead. Which approach best meets these needs?
You are designing a feature store-like pipeline for near-real-time predictions. Features come from streaming events (Pub/Sub) and batch backfills (BigQuery). The online serving system needs low-latency key-based lookups, and you must ensure training/serving consistency (same feature definitions). What is the best architecture on Google Cloud?
A retail company stores daily sales files in Cloud Storage. A scheduled pipeline loads the files into BigQuery each morning. Some days, the same file is accidentally uploaded twice, and duplicates appear in BigQuery. The company wants an approach that prevents duplicates without manual cleanup and keeps the pipeline simple. What should you do?
You are serving an ML model on Vertex AI for online predictions. Security requires that predictions are made without sending traffic over the public internet. The client applications run on Compute Engine in a shared VPC. What should you implement?
A data engineering team wants to standardize how datasets and models move from development to production across multiple projects. They need an approval gate, repeatable deployments, and auditable change history. What is the best approach on Google Cloud?
A company is moving from an on-prem EDW to BigQuery. They have 5 years of historical data (rarely queried) and 6 months of recent data (queried frequently). They want to optimize performance and keep storage organized. What should you recommend?
A streaming pipeline uses Pub/Sub and Dataflow to aggregate clickstream events into per-minute metrics written to BigQuery. After a schema change, the pipeline starts failing with errors about mismatched fields when writing to BigQuery. The team wants to roll out schema changes safely with minimal downtime. What should they do?
Your organization trains models in Vertex AI and deploys them as endpoints. A compliance requirement states that you must be able to reproduce any prediction by tying it back to the exact training dataset snapshot, code version, and model artifact. What should you implement?
A Dataflow batch job reads from Cloud Storage and writes to BigQuery. It occasionally becomes significantly slower, and the team suspects uneven work distribution due to a few very large input files. They want to improve throughput without changing the business logic. What should they do?
You need to validate data quality for a BigQuery dataset used by multiple downstream reports. Requirements include: detect nulls in critical columns, enforce referential integrity-like checks between dimension and fact tables, and publish a daily quality report. The solution should be managed and integrate well with BigQuery. What should you use?
A financial services company must ensure that sensitive fields (SSNs, bank account numbers) in raw ingestion logs are never accessible to analysts, but data engineers must still be able to reprocess raw data for compliance audits. They also want analysts to access de-identified data in BigQuery. What is the best architecture?
You run a feature generation pipeline that computes features from streaming events and writes them to an online store for low-latency serving and to an offline store for training. The serving model’s performance degraded, and you suspect training/serving skew due to different transformations used in batch vs streaming. What should you do to minimize skew going forward?
A data engineering team stores curated datasets in BigQuery. Analysts must be prevented from accidentally incurring very large query costs, but they still need the flexibility to explore data. What is the BEST way to enforce this across projects?
You have a streaming pipeline writing events into BigQuery. Downstream dashboards must not show duplicate events, and duplicates can arrive minutes apart. Each event has a stable event_id. Which approach is MOST appropriate?
A team uses Vertex AI to train a model. They need to reproduce a specific model later and prove what data and parameters were used. Which feature BEST supports this requirement?
A company runs nightly Dataflow batch jobs that read from Cloud Storage and write to BigQuery. Some runs fail due to transient BigQuery quota errors. They want automatic retries without rerunning successful work and want to minimize duplicate rows. What should they do?
A retail company wants to build a feature store for both online predictions (low-latency) and offline training. They already use BigQuery for analytics and need point-in-time correct historical features for training to avoid label leakage. What is the BEST approach on Google Cloud?
A data platform ingests JSON events with evolving schemas into BigQuery. The team wants to minimize pipeline breakages when new optional fields appear, while keeping data queryable. Which design is BEST?
Your organization has multiple teams publishing datasets to BigQuery. You need consistent data governance: dataset-level access, column-level security for PII, and the ability to audit who accessed sensitive fields. What should you implement?
A real-time fraud detection service uses a model deployed on Vertex AI endpoints. After deployment, the fraud rate detected drops significantly, but offline evaluation metrics remain stable. You suspect data drift in production features. What should you do FIRST?
You are designing a cross-region disaster recovery strategy for a critical BigQuery dataset used for compliance reporting. Requirements: RPO of minutes, minimal operational overhead, and the ability to fail over if a region becomes unavailable. What is the BEST solution?
A Beam pipeline on Dataflow processes messages from Pub/Sub and writes to Bigtable. During a traffic spike, you observe increased processing lag and Bigtable hot-spotting. The row key is currently constructed as userId + timestamp. What change is MOST likely to fix the hot-spotting while preserving efficient queries by user and time?
A retail company streams click events into Pub/Sub and uses a Dataflow streaming pipeline to write to BigQuery. During an incident, the pipeline stops for 20 minutes and then restarts. The business requires no data loss and no duplicate rows in BigQuery. What should you implement?
Your organization stores analytics datasets in BigQuery. Only a small group should see columns containing PII (email, phone), while analysts should still be able to query non-PII columns in the same tables without creating separate copies. What is the recommended approach?
A data engineer needs to run a daily ELT SQL transformation that depends on upstream table loads and should fail fast if upstream data is missing. The team wants minimal infrastructure management. Which solution best fits?
You deployed a Vertex AI endpoint for online predictions. Your SRE team wants to monitor for prediction service degradation and get alerted on elevated error rates and latency without building custom instrumentation. What should you do?
A media company stores raw video metadata as JSON in Cloud Storage. They want to run ad-hoc SQL across all files with minimal ETL and keep the data in place. Query performance should be acceptable for interactive analysis. What is the best approach?
A Dataflow batch pipeline reading from Cloud Storage occasionally fails due to corrupt input records. The business wants the pipeline to continue processing valid records, capture bad records for later review, and produce a count of rejected records per day. What should you do?
Your team needs to grant a third-party vendor access to read only a specific subset of tables in a BigQuery dataset for 60 days. The vendor should not see other datasets in the project. What is the recommended access approach?
A fraud detection model is trained in Vertex AI on historical data. After deployment, the fraud rate and customer behavior shift, and the model’s precision degrades. The team wants an automated way to detect feature distribution shifts and trigger investigation. What should you implement?
You run a global streaming pipeline that ingests events into BigQuery. Downstream analysts require consistent results when joining multiple event types, and they want to query "as of" a specific time without seeing partially processed late-arriving data. Which design best meets this requirement?
A healthcare company must keep patient data encrypted with customer-managed keys and ensure that if keys are disabled, access to the data is immediately prevented across both BigQuery and Cloud Storage. They also need centralized control and auditability of key usage. What should you recommend?
A product team wants analysts to run SQL on a shared dataset without being able to query columns containing PII (e.g., email, phone). The team also wants to avoid managing separate tables for masked data. What is the recommended approach in BigQuery?
You need to orchestrate a daily pipeline that loads raw files to Cloud Storage, runs a Dataflow batch transform, then executes BigQuery SQL for aggregations. Operations wants retries, dependencies, and alerting on failures with minimal custom code. Which approach should you choose?
A streaming pipeline writes events to BigQuery. Analysts complain that some dashboards show missing data for the last few minutes, but the pipeline appears healthy. You learn the pipeline uses BigQuery streaming inserts. What should you explain as the most likely cause?
You need to design a lakehouse-style architecture. Raw data (CSV/JSON/Parquet) lands in Cloud Storage, and multiple teams want to query it with BigQuery while retaining schema evolution and ACID-like guarantees on updates. What is the best approach?
A Dataflow streaming job reads from Pub/Sub and writes to BigQuery. During a backfill, Pub/Sub throughput spikes and the pipeline starts to fall behind. You need to reduce end-to-end latency and prevent hot keys from overwhelming certain workers. What should you do?
You are deploying a Vertex AI model for online prediction. Some requests require a specialized feature extraction step written in Python that must execute at request time. You want the feature logic versioned and deployed together with the model to reduce training/serving skew. What should you use?
A company is using BigQuery as the source of truth for KPIs. They want to prevent accidental deletion of tables and also ensure that any query accessing regulated datasets is auditable. What combination best satisfies these requirements?
You are troubleshooting a Vertex AI endpoint that occasionally returns 5xx errors under load. CPU and memory are not saturated, but request latency spikes coincide with autoscaling events. What is the most appropriate mitigation?
A regulated financial firm needs to build a multi-project analytics platform. Data scientists in a shared project must query sensitive BigQuery datasets located in separate producer projects, but producers must retain full control and must not grant broad dataset access. The firm also wants to reduce the risk of data exfiltration. What is the best design?
You operate an event-driven platform where multiple microservices publish events with evolving schemas. You must ingest these events into analytics with low latency, support schema evolution without breaking consumers, and allow replay/backfill. Which architecture is most appropriate on Google Cloud?
Need more practice?
Expand your preparation with our larger question banks
Professional Data Engineer 50 Practice Questions FAQs
Professional Data Engineer is a professional certification from Google Cloud that validates expertise in professional data engineer technologies and concepts. The official exam code is GCP-9.
Our 50 Professional Data Engineer practice questions include a curated selection of exam-style questions covering key concepts from all exam domains. Each question includes detailed explanations to help you learn.
50 questions is a great starting point for Professional Data Engineer preparation. For comprehensive coverage, we recommend also using our 100 and 200 question banks as you progress.
The 50 Professional Data Engineer questions are organized by exam domain and include a mix of easy, medium, and hard questions to test your knowledge at different levels.
More Preparation Resources
Explore other ways to prepare for your certification