Data Engineer Interview Questions
Prepare for your data engineer interview with 10 expert-curated questions and sample answers covering pipelines, SQL, data modeling, and behavioral topics.
behavioral Questions
Tell me about a time a pipeline you owned failed in production. What happened?
behavioralintermediate
Tell me about a time a pipeline you owned failed in production. What happened?
Sample Answer
An upstream API silently changed a date format, and our parser wrote nulls for a day before an analyst noticed. I led the fix: backfilled from raw data within four hours — possible only because we landed immutable raw copies — then added schema contract tests and a row-level null anomaly alert. The lasting change was cultural: we now treat every external source as adversarial.
Tip: Choose a real failure with a systemic fix. 'It never failed' is the only wrong answer.
Where do you see data engineering going in the next few years, and how are you preparing?
behavioralbeginner
Where do you see data engineering going in the next few years, and how are you preparing?
Sample Answer
Two shifts: AI workloads making unstructured data pipelines and vector stores first-class citizens, and declarative tooling absorbing more pipeline plumbing. I'm preparing by building RAG ingestion pipelines in side projects and going deeper on data contracts and quality — the judgment layers tools won't replace.
Tip: Show curiosity with specifics — name a technology you've actually experimented with.
technical Questions
Walk me through a data pipeline you built end-to-end. What were the hardest design decisions?
technicalintermediate
Walk me through a data pipeline you built end-to-end. What were the hardest design decisions?
Sample Answer
I built a CDC pipeline ingesting 50M daily change events from PostgreSQL into Snowflake using Debezium and Kafka. The hardest decisions were choosing exactly-once semantics over at-least-once — we handled it with idempotent merge keys — and deciding to land raw data immutably before transformation, which later saved us during a three-day backfill when a business rule changed.
Tip: Pick one pipeline and go deep. Interviewers probe trade-offs (latency vs. cost, batch vs. streaming), not tool names.
How do you ensure data quality in your pipelines?
technicalintermediate
How do you ensure data quality in your pipelines?
Sample Answer
I layer checks: schema validation at ingestion, dbt tests for nulls, uniqueness, and referential integrity in transformation, and anomaly detection on row counts and key metrics post-load. Failures page the team through Airflow alerts before stakeholders see bad dashboards. At my last role this caught a vendor silently dropping a column within 20 minutes.
Tip: Mention detection AND prevention AND alerting — most candidates only describe one layer.
Explain the difference between a data warehouse, data lake, and lakehouse. When would you choose each?
technicalbeginner
Explain the difference between a data warehouse, data lake, and lakehouse. When would you choose each?
Sample Answer
A warehouse stores structured, modeled data optimized for SQL analytics; a lake stores raw files cheaply in any format; a lakehouse layers warehouse features — ACID transactions, schema enforcement — on lake storage via formats like Delta or Iceberg. I'd choose a warehouse for BI-heavy teams, a lake for ML and raw retention, and a lakehouse when one platform must serve both without duplicating data.
Tip: End with the decision criteria — the 'when' matters more than the definitions.
How would you design a pipeline that must deliver data within 15 minutes of an event occurring?
technicaladvanced
How would you design a pipeline that must deliver data within 15 minutes of an event occurring?
Sample Answer
Fifteen minutes allows micro-batch rather than true streaming, which is cheaper and simpler. I'd use Kafka for ingestion, Spark structured streaming or Snowpipe with 5-minute triggers, and incremental dbt models downstream. I'd set an SLO with end-to-end freshness monitoring, and design for replay — the question isn't whether the pipeline falls behind but how it catches up.
Tip: Clarify the SLA first — answering '15 minutes' with a Flink architecture signals over-engineering.
How do you decide between normalizing and denormalizing data in a warehouse?
technicalintermediate
How do you decide between normalizing and denormalizing data in a warehouse?
Sample Answer
I model normalized in staging for correctness and maintainability, then denormalize into wide marts where query patterns demand it — BI tools and analysts benefit from fewer joins, and storage is cheap relative to analyst time. The deciding factors are query frequency, join cost at scale, and how often the dimensional attributes change.
Tip: Frame it as layers serving different consumers, not a single global choice.
What strategies do you use to control cloud data platform costs?
technicalintermediate
What strategies do you use to control cloud data platform costs?
Sample Answer
Visibility first: per-query and per-pipeline cost attribution so owners see their spend. Then the big levers — warehouse auto-suspend, right-sized clusters, partition pruning, incremental models instead of full refreshes, and lifecycle policies on raw storage. At my last company this combination cut our Snowflake bill 38% in one quarter without removing a single dataset.
Tip: Cost questions are increasingly common — bring one concrete number from your own experience.
situational Questions
A dashboard shows numbers that don't match the source system. How do you debug it?
situationalintermediate
A dashboard shows numbers that don't match the source system. How do you debug it?
Sample Answer
I work backward through the lineage: first confirm the discrepancy and its magnitude, then compare row counts and aggregates at each stage — dashboard query, mart table, staging, raw landing, source extract. The divergence point tells you the layer at fault. Most often it's timezone boundaries, late-arriving data, or a filter discrepancy in the semantic layer; I document the root cause and add a reconciliation test so it can't silently recur.
Tip: Show a systematic bisection method. Naming the usual suspects (timezones, late data, dedup logic) proves experience.
How do you handle a stakeholder who insists their numbers are wrong, but your pipeline is correct?
situationalbeginner
How do you handle a stakeholder who insists their numbers are wrong, but your pipeline is correct?
Sample Answer
I treat it as a definitions problem until proven otherwise. I sit with them, reproduce their expected number, and trace both calculations — nine times out of ten the difference is filters, time windows, or metric definitions rather than pipeline bugs. Then I codify the agreed definition in the semantic layer so the ambiguity can't return.
Tip: Empathy plus rigor. Dismissing stakeholders or blindly 'fixing' correct pipelines are both red flags.
Preparation Tips
Practice SQL hard — window functions, CTEs, and query optimization questions open most data engineering interviews.
Prepare one pipeline story you can draw on a whiteboard: sources, transport, transformation, storage, consumers, and failure modes.
Know your numbers: data volumes, latencies, costs, and reliability percentages from past work.
Review the company's likely stack from the job posting and be ready to compare it to yours honestly.
Be ready for a take-home or live exercise involving a messy CSV — practice profiling and cleaning data quickly.
Practice Data Engineer Interview Questions
Get AI-powered feedback on your answers and ace your next interview.
Start Interview PrepRelated Interview Questions
Data Analyst
Master your data analyst interview with 10 real-world questions and answers on SQL, data visualization, statistical analysis, and business insights.
Data Scientist
Prepare for your data scientist interview with 10 expert-curated questions and sample answers covering ML, statistics, experimentation, and behavioral topics.
Software Engineer
Prepare for your software engineer interview with 10 expert-curated questions and sample answers covering coding, system design, and behavioral topics.