Drop in ingestion, processing, storage, and serving components. Pick the tools. Wire the flows. The validator scores your design against the SLA and the cost target the prompt asked for, then tells you what's missing in plain English.
Covers the design problems that come up in real DE system design rounds: ETL versus ELT, batch versus streaming with exactly-once requirements, partitioning and replay, schema evolution, and the cost-versus-latency trade-offs that decide which tool you reach for.
Drop in ingestion, processing, storage, and serving boxes. Wire them. Pick the tool for each. The validator compares your design against the reference and against the SLA the prompt asked for. Multiple correct designs are accepted because there isn't one right pipeline for any real problem.
If you wire Kafka into a one-source CSV-arrives-daily problem, the validator calls it out. If you pick a daily batch when the prompt asks for sub-minute freshness, it flags the SLA miss. The grade isn't whether you used the famous tool, it's whether the pick survives the constraint.
Most problems can be solved either way. The validator looks at what your downstream needs, the cost of raw retention, and the compute model of your warehouse, then tells you whether your pick lines up with the trade-offs. Defending the choice is the part interviewers actually grade.
Every design gets a cost-efficiency number. Real-time streaming when daily batch satisfies the SLA shows up red. Over-provisioned cluster sizing shows up red. Storage hot-tier when the access pattern is cold shows up red. The number is rough but the direction is right.
Early problems are a single source to a single sink. Later problems add late-arriving data, schema drift on the source, exactly-once requirements, replay, multi-region failover. The complexity comes from the realism, not from the number of boxes.
When you submit, you see what's wrong by category: missing fault tolerance, wrong processing model, under-specified semantics, an obvious cheaper alternative. Closer to a real design review than to a quiz.
A clear scenario, a stated SLA, and no timer. Drag the components in, wire them, submit. The validator returns a cost score, an SLA verdict, and a list of what it expected to see that isn't on your canvas. Best for learning a new pattern.
A vague scenario, a timer, and an AI interviewer that pushes on every tool pick. Halfway through they might add a requirement (the upstream now goes down for an hour a day, you need replay across regions) that forces you to rework the design on the fly.
The pipeline architecture practice index on DataDriven, a free site focused on data engineering interview preparation. The design canvas covers the patterns interviewers actually ask about in system design rounds: batch and streaming ingestion, ETL versus ELT, fault tolerance and replay, schema evolution, partitioning and incremental loading, and the cost-versus-latency trade-offs that determine when Kafka, Spark, or a simpler scheduled job is the right answer.
Start with a scenario you'd actually be asked in a senior loop. Clickstream into a warehouse, change data capture from a production database, multi-region replay.
Start a design problemContinue your prep
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 920 companies, collected from real candidates.