Data Engineer Portfolio 2026: Projects That Get Interviews

Your ETL GitHub projects are failing you in 2026 screens. Here's exactly what AI-native DE portfolios need: RAG pipelines, LLM eval harnesses, and vector DB projects.

DataDriven Field Notes

Updated May 27, 202611 min readBy DataDriven Editorial

What this post actually says

01A GitHub profile of batch ETL pipelines and dbt models is now pattern-matched by AI-native screeners as “needs retraining,” not “strong fundamentals.”
02The single highest-signal portfolio project in 2026 is a RAG pipeline with RAGAS metrics and documented failure narratives.
03Vector database expertise is table stakes; pgvector is usually the right answer, and being able to name the constraint that would push you to Qdrant is what clears screens.
04LLM eval harnesses are the most over-indexed-for skill of 2026. Demand grew 340% since 2024 and the project type tests judgment under ambiguity, which coding rounds can’t.
05Mid-career DEs with 5+ years of Spark plus one published RAG project are rare and expensive: $300+/hr for architects who can bridge legacy and AI infrastructure.

The portfolio that gets interviews in 2026

A hiring panel at an AI-native startup recently reviewed twelve data engineer portfolios. Nine had immaculate dbt lineage graphs, Airflow DAGs with retry logic, and Spark jobs processing “millions of records.” All nine failed the first technical screen. The three who advanced looked less conventional on paper. Two had under three years of experience. One had never touched Airflow. All three had built a RAG pipeline with documented eval metrics.

The rules for a data engineer portfolio in 2026 have changed while most of us were busy maintaining production pipelines. AI-related roles now account for 20% of all US tech ads, up from 11% in 2022. RAG is the dominant architecture for enterprise AI. 70% of engineering teams are shipping retrieval-augmented generation systems. ETL projects on a GitHub profile aren’t just neutral anymore. They are actively working against the candidate.

What follows is exactly what to build, how fast, and where to put it.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PayPalInterview question

Solve a problem

Why an ETL-only portfolio is now a liability

A GitHub profile full of batch ETL pipelines and dbt models is now pattern-matched by screeners as “needs retraining for LLM-adjacent systems.” Not “strong fundamentals.” Not “solid candidate.” Needs retraining.

The numbers behind that signal are blunt. 80% of new databases on Databricks are now built by AI agents, up from 30% the year before. The pipeline plumbing data engineers spent years perfecting is now agent-written. Hiring managers scoring portfolios explicitly evaluate “AI infrastructure that minimizes hallucinations caused by poor-quality data” as a differentiator. Generic tutorial projects (Titanic datasets, CSV ETL scripts) actively disqualify candidates. Fewer than 1 in 10 junior candidates include portfolios at all, but generic ones signal “tutorial recycler” rather than problem-solver.

Spark and dbt aren’t dead. Leading with them is dead. The job title stays the same, but the skills in demand are moving closer to AI infrastructure. A data engineer resume that reads like a 2023 best-practices checklist competes for roles paying $130K–$150K while RAG engineers at the same level are closing $175K–$290K. The path forward isn’t starting over; it’s one or two new projects that reframe the existing work.

The RAG pipeline that clears first screens

A retrieval-augmented generation pipeline is the single most important portfolio project a data engineer can build in 2026, not because RAG is trendy, but because it is what interviewers are literally asking about. “Describe the last eval harness you built. What did you measure? How did you handle subjectivity? What surprised you?” That’s from an actual LLM engineering interview framework. Candidates with only ETL experience resort to generic answers. Candidates who’ve built RAG systems answer with failure narratives.

A minimum viable RAG project needs four components:

A real dataset collected by the author (not a Kaggle download). Policy documents, internal wikis, technical documentation. Something with enough variety to break naive chunking.
Hybrid retrieval: BM25 + dense vector search with reciprocal rank fusion. Dense-only search is flagged as incomplete understanding of the tradeoff landscape.
Evaluation with RAGAS metrics: Faithfulness ≥ 0.9, Answer Relevancy ≥ 0.85, Context Precision ≥ 0.8. Production targets, not aspirational numbers.
Documented failures: which chunking strategy broke on this data, which embedding model the author switched away from and why.

Hierarchical chunking delivers 3–5x better F1 scores on structured documents versus flat chunking. That’s the kind of finding a candidate discovers by building, not by reading tutorials. A basic evaluation pipeline looks like this:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Your RAG pipeline returns these for each test question
eval_dataset = Dataset.from_dict({
    "question": test_questions,
    "answer": rag_answers,
    "contexts": retrieved_contexts,
    "ground_truth": reference_answers
})

results = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

# Production thresholds: faithfulness >= 0.9, relevancy >= 0.85
print(results)

The code isn’t the point; any tutorial supplies the code. The signal is the README documenting what happened when faithfulness dropped below 0.9 on a specific category of questions, and what the author changed. Engineers who shipped AI features in production always have stories about being wrong. A 2026 portfolio needs to show that, too. Candidates sharpening their Python fundamentals to support this work should do that first, but the project is what gets them past the screen.

Vector DB side projects: what interviewers check

Vector database expertise is the most in-demand AI engineering skill of 2026 hiring. Not optional. Not “nice to have.” Table stakes. Interviewers aren’t asking “which vector DB is best?” They’re asking “why did you pick this one, and what would make you switch?”

The surprising right answer for most data engineering portfolio projects is pgvector. For a new production RAG project where data already lives in Postgres (users, tenants, permissions), pgvector wins because joins, transactions, and row-level security come free. At roughly $15/month infrastructure for 10M vectors, it’s also the cheapest option versus Qdrant ($45/month) or Pinecone ($70/month). Choosing pgvector isn’t the budget move; it’s the correct architectural move for specific constraints. Interviewers listen for whether the candidate can name the constraint that drove the choice.

When a project outgrows pgvector, the tradeoffs get interesting. Qdrant achieves 22ms p95 latency versus Pinecone’s 45ms. Qdrant’s filtered search maintains recall during HNSW traversal; Pinecone applies filters post-retrieval. That difference matters for high-selectivity queries, and it’s exactly the kind of thing that shows up in system design interviews.

A minimal embedding pipeline that demonstrates real understanding looks like this:

-- pgvector: create the embedding table with metadata for filtered search
CREATE TABLE doc_embeddings (
    id SERIAL PRIMARY KEY,
    doc_id INTEGER REFERENCES documents(id),
    chunk_text TEXT NOT NULL,
    chunk_index INTEGER,
    embedding vector(1536),  -- OpenAI ada-002 dimensions
    tenant_id INTEGER NOT NULL,
    created_at TIMESTAMP DEFAULT now()
);

-- HNSW index for approximate nearest neighbor search
CREATE INDEX ON doc_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

-- Filtered similarity search: tenant isolation + semantic retrieval
SELECT chunk_text, 1 - (embedding <=> $1::vector) AS similarity
FROM doc_embeddings
WHERE tenant_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 10;

The portfolio signal isn’t “I can write SQL.” It’s that the author chose pgvector, explained that tenant isolation via row-level security was the driving constraint, and documented the latency characteristics that would push the system to Qdrant at scale. Reducing top-K from 20 to 5 cuts reranker cost 4x. That kind of reasoning under constraints clears screens where chatbot demos without benchmarks fail.

The LLM eval harness: highest-signal portfolio project

LLM evaluation demand is up 340% since 2024. For a single highest-leverage portfolio project, an eval harness is the one to build.

Eval harnesses test the exact skill companies can’t screen for with coding problems: judgment under ambiguity. Going from a vague product spec (“is this response good?”) to a measurable, reproducible eval pipeline is the gap traditional DE portfolios don’t fill. A dbt lineage graph demonstrates data plumbing. An eval harness demonstrates systems thinking.

The frameworks are mature. EleutherAI’s lm-evaluation-harness is the de facto standard. Hugging Face’s LightEval has 1,000+ stars and underpins the HF Leaderboards. For RAG-specific evaluation, RAGAS, RAGChecker, and TruLens are production-standard.

What interviewers want to see in an eval project:

Multi-level evaluation: model layer (baseline fluency / recall), task layer (domain alignment), system layer (user satisfaction + latency + cost economics). A high-accuracy model that’s too slow or expensive isn’t production-viable.
Multiple evaluation strategies: LLM-as-judge, reference-based, and statistical approaches. Each has failure modes. Document which one lied to you.
A production monitoring loop: not “I evaluated once,” but “here’s how I’d catch regression.”
Metric failure modes: the moment the author discovered that an aggregate accuracy metric masked catastrophic failure on a specific input pattern, or that a trusted metric didn’t correlate with actual user satisfaction.

Mid-level LLM engineer base pay sits at $145K–$200K, senior hits $200K–$320K. Those numbers aren’t for people who followed a tutorial. They’re for people who can explain why their eval pipeline caught something unit tests couldn’t.

“Engineers who’ve built eval harnesses always have a story about a metric that was green while the system was broken. That story is worth more than three years of production ETL experience on a resume.”

DataDriven editorial, 2026

Live Viewers, Live Billing

> We run a live video platform where creators broadcast to thousands of viewers at once. The product team wants real-time viewer counts and chat activity for creators, and the ads team needs accurate impression data for billing. Design a data pipeline for our livestream events.

+ Source

+ Transform

+ Storage

+ Quality

+ Consumer

+ Queue

Bronze

Silver

Gold

Custom

Pipeline Architecture

Sketch the architecture.

Click or drag a node from the toolbar above. Right-click the canvas for the full menu.

Drag from a node's right port to another node's left port to wire data flow.

Feature store project: the Databricks hiring signal

Databricks interview loops span 3–6 weeks with 5–6 interviews at virtual onsite. Feature store design rounds specifically test dual-store architecture: Delta Lake offline plus Redis or DynamoDB online, with point-in-time correctness. Without a feature store project on a portfolio, screeners assume only batch ETL experience, which no longer clears first screens at AI-native orgs.

The project doesn’t need to be production-deployed. A well-designed GitHub repo with clear trade-off documentation passes screening faster than a deployed ETL pipeline without those annotations. What it needs:

Dual-store architecture: Delta Lake (or Parquet) for offline training features, Redis for online serving at millisecond latency. ElastiCache for Redis achieves microsecond performance; that’s the online store benchmark.
Point-in-time correctness: training sets constructed with correct temporal logic so future data isn’t leaked into training accidentally. Junior candidates stumble here because they don’t recognize that data visibility windows differ between training and serving.
Ops budget acknowledgment: Feast requires roughly 0.3 FTE to maintain in production. Mentioning Feast without acknowledging that overhead reads as “followed a tutorial.”

The contrarian take that actually demonstrates seniority: sometimes a feature store isn’t needed. Small teams running fewer than 10 models over-invest in feature stores when simpler ML ops pipelines suffice. Interviewers want architects who know when to say no. A strong project documents why a feature store was warranted, not just how it was built.

Feast for sub-minute latency requirements is a red flag. Feast is batch-materialization-focused. Sub-minute refresh requires Tecton or Hopsworks. Knowing that distinction, and explaining it in a README, is the difference between “built something” and “understands the production landscape.”

Framing Spark and dbt without looking desperate

dbt demand surged 9+ percentage points in job requirements across 2025–2026. SQL demand jumped 18 points. Apache Spark still appears in 39% of data engineering postings. Those skills aren’t dead. They’re just not enough on their own.

The winning framing is “I built this dbt transformation to support this RAG evaluation dataset.” Legacy tools enable AI infrastructure; that’s the story. Modern Data Platform Engineers with dbt + Snowflake command a 20–40% salary premium over generalists. Legacy ETL combined with a modern stack makes a candidate critical for enterprise modernization. Legacy ETL alone makes a candidate a commodity.

The unfair advantage nobody’s exploiting: early-career DEs flooding the market with pure-RAG projects clear screens but fail system design questions because they can’t articulate dbt incremental strategies or Spark bottlenecks. A mid-career DE with 5+ years of Spark plus one published RAG project is rare and expensive: $300+/hr for architects who can teach junior teams both paradigms.

Strong candidates don’t have more projects; they have clearer thinking. Surface tradeoffs (batch vs. streaming, dbt vs. raw Spark) and justify them. Position legacy skills as deliberate architectural choices, not resume padding. Portfolio order matters: lead with AI, reference legacy stack as foundation.

# dbt model that feeds a RAG evaluation pipeline
# This is the framing: legacy tool serving AI infrastructure

-- models/staging/stg_support_tickets.sql
-- Chunks support tickets for RAG evaluation dataset
SELECT
    ticket_id,
    created_at,
    category,
    -- Semantic chunking prep: split long resolutions into paragraphs
    SPLIT_PART(resolution_text, '\n\n', chunk.index) AS chunk_text,
    chunk.index AS chunk_position,
    LENGTH(SPLIT_PART(resolution_text, '\n\n', chunk.index)) AS chunk_length
FROM {{ source('support', 'tickets') }}
CROSS JOIN LATERAL GENERATE_SERIES(
    1,
    ARRAY_LENGTH(STRING_TO_ARRAY(resolution_text, '\n\n'), 1)
) AS chunk(index)
WHERE resolution_text IS NOT NULL
  AND LENGTH(SPLIT_PART(resolution_text, '\n\n', chunk.index)) > 50

The model isn’t impressive on its own. Paired with a RAG pipeline that consumes its output, it tells a story: the author understands the full stack from transformation through retrieval through evaluation.

Where to publish: GitHub alone won't cut it

Hugging Face Spaces hosts 500K+ applications as of January 2026. The platform contains 2M+ models and 500K+ datasets. Google BigQuery offers SQL-native managed inference for Hugging Face models. Hugging Face is no longer a niche platform; it’s where AI-native hiring teams expect to see candidate work.

The multi-platform requirement is real: GitHub for code and version history, Hugging Face Spaces or Streamlit for a live demo, and a blog post explaining the architectural decisions. A project without a diagram doesn’t exist to a reviewer. A diagram is the fastest way to communicate system thinking.

A quieter finding from 2025–2026 hiring data: Medium and Towards Data Science articles about RAG systems and eval frameworks outrank GitHub in recruiter search patterns. Portfolio visibility now depends on content distribution (blog + code + demo), not repository stars. Most hiring managers don’t read code deeply; hiring is pattern recognition, not code review. Strong candidates have clearer thinking visible across multiple surfaces, not more repositories.

A data engineer GitHub portfolio is one piece. Pairing every repo with a deployed demo and a written explanation of trade-offs triples surface area for the same amount of work.

A 30-day minimum-viable AI portfolio sprint

The job-search window for displaced DEs is tight. Senior engineers with specialized AI-data experience place in a median 17 days via specialized recruiters. Mid-level engineers face 60–90 day searches. With 144,355 tech workers affected in Q1 2026 alone (982 per day), speed matters. A 30-day sprint produces a portfolio that tells a coherent story.

Week 1: RAG pipeline. Pick a real document corpus. Implement hybrid retrieval (BM25 + dense). Get RAGAS metrics running. Deploy a Streamlit demo. Push to GitHub and Hugging Face Spaces.

Week 2: Eval harness. Build evaluation on top of the Week 1 RAG. Add LLM-as-judge and reference-based scoring. Document every metric failure found. Publish the blog post explaining what surprised the author.

Week 3: Vector DB deep dive. Migrate the Week 1 pipeline to pgvector. Benchmark latency and recall. Document why the author would or wouldn’t move to Qdrant at scale. Add the architecture diagram.

Week 4: Feature store + integration. Build a minimal feature store (Feast + Redis) that feeds features into the RAG pipeline’s reranker. Connect any existing dbt or Spark work as the offline feature source. Update the interview prep to walk through the full system.

Four weeks. Four projects that reference each other. One coherent portfolio that tells a story of an engineer who understands data from warehouse through vector retrieval through production evaluation. Recruiters spend less than 10 seconds on a resume but engage 80% more with GitHub projects featuring runnable code or live demos. A 30-day sprint produces both.

The portfolios that get callbacks

Three waves of “data engineering is getting automated away” have come and gone, and the field is still here. Still employed. Still debugging the same categories of problems: schema drift, late-arriving data, upstream teams breaking contracts without telling anyone. Those are eternal. What changes is how candidates prove they can solve the new problems too.

RAG grew from 0% to 4% of job postings year-over-year. LLM engineering grew from 3% to 12%. Those numbers are small until you realize that’s where the salary premium is concentrated. One end-to-end RAG project with documented eval metrics repositions a candidate from “pipeline plumber” to “AI-native DE.” That’s the math on who’s getting callbacks and who isn’t.

Build the projects. Publish them everywhere. Lead with AI; let the dbt models support the story instead of carrying it.

Common misconceptions vs hiring-manager reality

The Myth

More projects = stronger portfolio.

The Reality

Hiring is pattern recognition, not code review. One coherent end-to-end project with documented tradeoffs beats five tutorial recreations. Recruiters spend under 10 seconds on a resume.

The Myth

Pinecone or Qdrant is the right vector DB choice.

The Reality

pgvector is usually correct when data already lives in Postgres. Joins, transactions, and row-level security come free at ~$15/month for 10M vectors. Naming the constraint that would push you to Qdrant is what clears screens.

The Myth

Building an LLM eval harness requires a research background.

The Reality

Eval harnesses test judgment under ambiguity, not academic depth. Frameworks like RAGAS, LightEval, and lm-evaluation-harness are production-mature. The signal is documenting metric failures, not novelty.

The Myth

Spark and dbt are dead skills.

The Reality

dbt demand surged 9+ points and Spark still appears in 39% of postings. Legacy ETL paired with one AI-native project is rare and expensive ($300+/hr for architects who bridge both).

data engineer portfolio 2026AI data engineer portfoliodata engineering side projectsdata engineer github portfoliodata engineering portfolio projects

02 / Why practice

Try the actual problems

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Start practicing

Related interview prep

system design round prep guide→

Pipeline architecture, exactly-once semantics, and the framing that gets you to L5.

ML data engineer interview guide→

ML data engineer interview, feature stores, training data pipelines, online inference.

streaming data engineer interview guide→

Streaming Data Engineer interview, Kafka, Flink, exactly-once, event-time vs processing-time.

←All articles