Data Pipeline Architecture

Pipeline Architecture Practice on a Real Canvas

Drop in ingestion, processing, storage, and serving components. Pick the tools. Wire the flows. The validator scores your design against the SLA and the cost target the prompt asked for, then tells you what's missing in plain English.

Covers the design problems that come up in real DE system design rounds: ETL versus ELT, batch versus streaming with exactly-once requirements, partitioning and replay, schema evolution, and the cost-versus-latency trade-offs that decide which tool you reach for.

How the design canvas works

A canvas where you actually draw the pipeline

Drop in ingestion, processing, storage, and serving boxes. Wire them. Pick the tool for each. The validator compares your design against the reference and against the SLA the prompt asked for. Multiple correct designs are accepted because there isn't one right pipeline for any real problem.

Tool choice gets evaluated, not just present

If you wire Kafka into a one-source CSV-arrives-daily problem, the validator calls it out. If you pick a daily batch when the prompt asks for sub-minute freshness, it flags the SLA miss. The grade isn't whether you used the famous tool, it's whether the pick survives the constraint.

ETL vs ELT as a real design decision

Most problems can be solved either way. The validator looks at what your downstream needs, the cost of raw retention, and the compute model of your warehouse, then tells you whether your pick lines up with the trade-offs. Defending the choice is the part interviewers actually grade.

Cost as a first-class score

Every design gets a cost-efficiency number. Real-time streaming when daily batch satisfies the SLA shows up red. Over-provisioned cluster sizing shows up red. Storage hot-tier when the access pattern is cold shows up red. The number is rough but the direction is right.

Difficulty that grows with the failure modes

Early problems are a single source to a single sink. Later problems add late-arriving data, schema drift on the source, exactly-once requirements, replay, multi-region failover. The complexity comes from the realism, not from the number of boxes.

Feedback that names what's missing

When you submit, you see what's wrong by category: missing fault tolerance, wrong processing model, under-specified semantics, an obvious cheaper alternative. Closer to a real design review than to a quiz.

Data Pipeline Architecture Topics

ETL vs ELT

Medium
Very High (1,500/mo searches)Core

Data Pipeline Architecture

Medium-Hard
High (600/mo searches)Core

Batch Processing vs Stream Processing

Medium-Hard
High (300/mo searches)Core

Apache Spark and PySpark

Hard
Very High (3,800+ 1,100/mo searches)Multiple

Apache Kafka

Hard
High (1,000/mo searches)Multiple

Reliability and Fault Tolerance

Hard
~50% of roundsMultiple

Incremental Loading and CDC

Medium-Hard
~45% of roundsMultiple

Storage Architecture

Medium-Hard
~60% of roundsMultiple

Two modes, used for different parts of prep

Problem mode

A clear scenario, a stated SLA, and no timer. Drag the components in, wire them, submit. The validator returns a cost score, an SLA verdict, and a list of what it expected to see that isn't on your canvas. Best for learning a new pattern.

Interview mode

A vague scenario, a timer, and an AI interviewer that pushes on every tool pick. Halfway through they might add a requirement (the upstream now goes down for an hour a day, you need replay across regions) that forces you to rework the design on the fly.

Data Pipeline Architecture FAQ

What is data pipeline architecture?+
The design of the system that gets data from where it's produced to where it's consumed, including the tool selection, the processing model, the storage layout, and the failure handling. In interviews, it's the data-engineering version of the system design round: you're given a scenario like ten million events a day with a fifteen-minute SLA, and you're expected to talk through Kafka or not Kafka, Flink or Spark, S3 or warehouse-direct, and what happens when the source goes down for two hours.
What is the difference between ETL and ELT?+
ETL transforms in flight, before the data lands in the warehouse. ELT lands the raw data first and transforms in place using the warehouse's own compute. ETL made sense when warehouses were expensive and storage was cheap. ELT became the default once Snowflake and BigQuery made warehouse compute cheap and infinitely scalable. The right answer in 2026 is almost always ELT, but knowing why the trade-off shifted is the part interviewers grade on.
What PySpark questions should I prep for?+
DataFrame versus RDD, transformations versus actions, lazy evaluation, when to broadcast a small dim, how to detect and handle skew, why a wide transformation triggers a shuffle, partition tuning, and the difference between persist and cache. If the company runs Spark in production (Databricks, Netflix, Uber, most large lakehouses), expect at least one question that goes deeper than DataFrame syntax.
What Kafka questions should I expect?+
Topics and partitions, consumer groups and offset management, the at-most-once / at-least-once / exactly-once trade-off, schema registry and how Avro or Protobuf handle compatibility, and Kafka Connect for the sink side. The most common follow-up: when is Kafka overkill, and what would you reach for instead. The honest answer for most internal pipelines: SQS plus a Lambda or a scheduled S3 read.
Is this practice free?+
Yes. There's no subscription and there isn't going to be one.

About this page

The pipeline architecture practice index on DataDriven, a free site focused on data engineering interview preparation. The design canvas covers the patterns interviewers actually ask about in system design rounds: batch and streaming ingestion, ETL versus ELT, fault tolerance and replay, schema evolution, partitioning and incremental loading, and the cost-versus-latency trade-offs that determine when Kafka, Spark, or a simpler scheduled job is the right answer.

Open the pipeline canvas

Start with a scenario you'd actually be asked in a senior loop. Clickstream into a warehouse, change data capture from a production database, multi-region replay.

Start a design problem

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 920 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats