Machine Learning Engineer Interview Questions
Prepare for your ML engineer interview with 10 expert-curated questions and sample answers covering MLOps, model serving, LLMs, and system design.
behavioral Questions
Tell me about a time you had to significantly reduce model serving costs or latency.
behavioralintermediate
Tell me about a time you had to significantly reduce model serving costs or latency.
Sample Answer
Our recommendation model's GPU serving bill was unsustainable. I profiled and found 70% of compute went to candidates that rarely won. I introduced a two-stage design — a cheap candidate filter followed by the full ranker on the top 100 — then quantized the ranker to INT8. Latency dropped 64%, cost fell by half, and offline metrics moved less than half a percent.
Tip: Distillation, quantization, caching, and multi-stage ranking are the levers — have a story using at least one.
Describe a disagreement with a data scientist about productionizing their model.
behavioralintermediate
Describe a disagreement with a data scientist about productionizing their model.
Sample Answer
A data scientist wanted to ship a notebook model with 200 features, several from sources with no production SLA. Rather than vetoing, I quantified it: we measured feature importance, found 20 features captured 95% of performance, and shipped that version with reliable sources in two weeks. The full model became a later iteration once the data contracts existed. Framing it as sequencing, not rejection, kept the partnership healthy.
Tip: Show respect for the modeling work while holding the production bar — collaboration stories beat being right.
technical Questions
Design the serving architecture for a model that must handle 10,000 requests per second at low latency.
technicaladvanced
Design the serving architecture for a model that must handle 10,000 requests per second at low latency.
Sample Answer
I'd start with the latency budget and split it: feature fetching, inference, network. For 10K RPS I'd serve a distilled or quantized model behind a load balancer with horizontal autoscaling, precompute heavy features into a low-latency store like Redis, batch requests dynamically where the budget allows, and cache responses for repeated inputs. Then load test against p99 — not average — latency before launch.
Tip: Ask about the latency SLO before designing. Jumping to architecture without requirements is the common fail.
What is training-serving skew and how do you prevent it?
technicalintermediate
What is training-serving skew and how do you prevent it?
Sample Answer
It's when features differ between training and inference — different code paths, different data freshness, or leakage of future information into training. I prevent it with a feature store that serves the same definitions to both paths, point-in-time-correct training joins, and skew monitoring that compares live feature distributions against training distributions.
Tip: Feature stores and point-in-time correctness are the keywords interviewers listen for here.
How do you monitor a model in production?
technicalintermediate
How do you monitor a model in production?
Sample Answer
Three layers: system health (latency, errors, throughput), input health (feature drift, null spikes, distribution shifts), and output health (prediction distribution, and true performance once labels arrive). Since labels often lag, I use proxy metrics — like downstream conversion — for early warning, with alerts wired to thresholds the team agreed on before launch.
Tip: The label-lag problem is the depth test — address how you monitor before ground truth exists.
How would you build a RAG system for a company knowledge base, and where do they typically fail?
technicaladvanced
How would you build a RAG system for a company knowledge base, and where do they typically fail?
Sample Answer
Pipeline: chunk documents semantically, embed and index them in a vector store, retrieve with hybrid search — dense plus keyword — rerank, then generate with citations. They typically fail at retrieval, not generation: bad chunking, stale indexes, and queries that need joins across documents. I'd invest in retrieval evaluation with a golden question set before touching prompt engineering.
Tip: 'RAG failures are retrieval failures' is the experienced take. Mention evaluation — most candidates skip it.
How do you evaluate an LLM-powered feature before launch?
technicaladvanced
How do you evaluate an LLM-powered feature before launch?
Sample Answer
I build an eval set from real expected inputs with graded references, then layer automated checks: exact-match or rubric scoring where possible, LLM-as-judge with calibration against human ratings for open-ended outputs, plus red-team suites for safety and injection. Gate releases on eval regression, just like unit tests — vibes-based prompt iteration doesn't scale past the demo.
Tip: Treating evals as CI for prompts is the answer that signals production LLM experience.
How do you decide between fine-tuning a model and using prompting or RAG?
technicalintermediate
How do you decide between fine-tuning a model and using prompting or RAG?
Sample Answer
Prompting first — fastest iteration, no infrastructure. Add RAG when the task needs knowledge the model lacks, especially fresh or proprietary data. Fine-tune when the task needs consistent style or format, domain-specific behavior prompting can't reach, or when shrinking to a cheaper model. They compose: RAG for knowledge, fine-tuning for behavior.
Tip: The 'RAG for knowledge, fine-tuning for behavior' framing is concise and correct — use it.
situational Questions
Your model's offline metrics improved but the A/B test shows no lift. What happened?
situationaladvanced
Your model's offline metrics improved but the A/B test shows no lift. What happened?
Sample Answer
Common causes: the offline metric doesn't capture the business outcome, the improvement is concentrated in segments too small to move the aggregate, a serving bug means the new model isn't actually live, or the system around the model — UI, downstream logic — bottlenecks the gains. I'd verify deployment first; you'd be surprised how often the 'new model' test was serving the old model.
Tip: Listing the deployment-bug hypothesis first shows hard-won production scar tissue.
What's in your ML system design toolkit when starting a brand-new project?
situationalbeginner
What's in your ML system design toolkit when starting a brand-new project?
Sample Answer
Questions before tools: what decision does this automate, what's the cost of a wrong prediction, what data exists today, and what's the simplest baseline — often rules — that creates value? Then I design the evaluation harness before the model, because you can't improve what you can't measure. The model itself is usually the easiest part to swap later.
Tip: Baseline-first and evaluation-first thinking marks senior candidates regardless of the role's seniority.
Preparation Tips
Practice one full ML system design — recommendations or fraud detection — covering data, features, training, serving, and monitoring.
Be fluent in the 2026 LLM stack: RAG architecture, fine-tuning trade-offs, evals, and cost/latency optimization.
Prepare cost and latency numbers from your past serving work; they anchor your credibility.
Review coding fundamentals — many MLE loops include a standard software engineering round.
Have a clear answer for how you collaborate with data scientists and what the role split should be.
Practice Machine Learning Engineer Interview Questions
Get AI-powered feedback on your answers and ace your next interview.
Start Interview PrepRelated Interview Questions
Data Scientist
Prepare for your data scientist interview with 10 expert-curated questions and sample answers covering ML, statistics, experimentation, and behavioral topics.
Software Engineer
Prepare for your software engineer interview with 10 expert-curated questions and sample answers covering coding, system design, and behavioral topics.
Data Engineer
Prepare for your data engineer interview with 10 expert-curated questions and sample answers covering pipelines, SQL, data modeling, and behavioral topics.