ETL Pipeline: How to Design One That Lasts
An ETL pipeline extracts data from your source systems, transforms it to match your target schema, and loads it into a data warehouse or analytics platform. That's the textbook definition — and it's accurate, as far as it goes.
In practice, it's the infrastructure layer that determines whether your analytics are trustworthy or not. Every dashboard your team looks at, every report that lands in a board meeting, every KPI someone tracks — it all flows through the pipeline. If the pipeline is brittle, your data is brittle. If the transformation logic is undocumented, your analytics are undocumented. And if it fails silently, your business is making decisions on stale or broken data without knowing it.
What the textbook definition doesn't tell you is how to design one that doesn't fall apart in two years. That's what this guide is for.
Key Takeaways
- An ETL pipeline extracts, transforms, and loads data — but design decisions in each phase determine whether it scales or becomes a liability.
- ETL and ELT aren't interchangeable: ETL transforms before loading (better for regulated industries), ELT transforms after loading (better for cloud-native stacks).
- Most ETL pipelines fail because of poor data quality handling, undocumented transformation logic, and zero observability — not because of bad tooling.
- The costliest ETL mistakes happen during pipeline design, not execution — treating pipelines as one-time builds rather than maintained systems compounds debt over time.
- If your pipeline was designed three years ago and nobody's touched the architecture since, you almost certainly have a maintenance crisis waiting to happen.
What an ETL Pipeline Actually Is (and Why the Definition Matters)
An ETL pipeline is a structured data workflow that does three things: Extract data from one or more source systems, Transform it into a format suited for analysis, and Load it into a destination system — typically a data warehouse, data lakehouse, or analytics platform.
That's the textbook definition. Here's the one that matters in practice: an ETL pipeline is the infrastructure layer that determines whether your organisation's analytics are trustworthy or not.
Every dashboard you look at, every report that lands in a board meeting, every KPI your team tracks — it all flows through the pipeline. If the pipeline is brittle, your data is brittle. If the transformation logic is undocumented, your analytics are undocumented. And if the pipeline fails silently, your business makes decisions on stale or broken data without knowing it.
To understand how ETL works at a basic level: the extract phase pulls raw data from sources (databases, APIs, flat files, SaaS platforms). The transform phase applies business logic — cleaning, deduplication, aggregation, schema mapping. The load phase writes the transformed data into the target system, structured and ready for query.
Simple in theory. Much harder in practice once you're dealing with 12 source systems, three teams with different data conventions, and a warehouse schema that was designed for a business that looked very different four years ago.
ETL vs ELT: The Architecture Decision Most Teams Get Wrong
Here's a question worth asking before you build anything: should you actually be building an ETL pipeline, or an ELT one?
The difference isn't just letter order. It reflects a fundamental choice about where transformation happens.
ETL (Extract, Transform, Load) transforms data before it enters the destination system. Transformation logic lives in an intermediate layer — a staging environment or a dedicated ETL tool. The data that hits your warehouse is already clean, structured, and shaped to your schema.
ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the destination system (typically a cloud warehouse like Snowflake, BigQuery, or Redshift). The ELT process in data warehouse environments uses the warehouse's native compute power for transformation — which is often cheaper and faster than spinning up a separate processing layer.
So which one's right for you? It depends on four things:
-
Your warehouse's compute model. Cloud warehouses with elastic compute (Snowflake, BigQuery) handle ELT well. Legacy on-premise warehouses often don't.
-
Your data sensitivity requirements. ETL lets you mask or drop sensitive fields before data ever enters the warehouse. ELT means raw data (potentially including PII) lands first. In healthcare and finance, this matters.
-
Your transformation complexity. Complex, multi-step business logic is often easier to manage and version-control in an ETL layer than in warehouse SQL.
-
Your team's skillset. ETL tooling (Informatica, Talend, custom Python) vs. ELT tooling (dbt, Dataform) requires different competencies.
Most mid-market enterprises in healthcare and manufacturing are better served by ETL architectures — at least initially — because of data governance requirements and the reality that their warehouses weren't built for ELT compute patterns. Finance teams with modern cloud stacks often benefit from moving to ELT.
The mistake is defaulting to one without examining your actual constraints.
The Five Design Principles That Separate Maintainable Pipelines from Legacy Nightmares
Most ETL pipeline failures aren't runtime failures. They're design failures — decisions made at the start that create compounding problems as the business changes. Here are the five principles that prevent it.
-
Design for schema change, not current schema. Your source systems will change. New fields appear. Tables get renamed. Data types shift. Pipelines designed for a fixed schema break when the schema moves. Design extraction logic that handles upstream changes gracefully — fail loudly instead of silently, and make it easy to update mappings without rebuilding the pipeline.
-
Build observability in from day one. Observability isn't monitoring. Monitoring tells you if the job finished. Observability tells you if the job finished correctly — rows processed, records dropped, transformation errors, data freshness lag. If you can't answer "did the right data arrive in the right shape at the right time?", your pipeline is a black box. Black boxes become legacy problems.
-
Document transformation logic as code, not tribal knowledge. Transformation rules that exist in someone's head — or in a commented-out SQL script with no version history — are a single point of failure. Treat transformation logic as code: version-controlled, documented, reviewable. Tools like dbt make this significantly easier for ELT patterns. For ETL patterns, the same discipline applies to scripting layers.
-
Separate ingestion from transformation from orchestration. Pipelines that bundle all three into a single monolithic script are hard to debug, harder to scale, and nearly impossible to hand off. Keep concerns separate: one layer handles extraction and staging, one handles transformation, one handles scheduling and dependency management (Airflow and similar orchestration tools exist precisely for this reason).
-
Plan for failure, not just for success. What happens when a source API is down? What happens when a transformation produces null values that should never be null? What happens when the load step fails halfway through? Pipelines without graceful failure handling and retry logic become incidents. Design your error states as carefully as your happy path.
What ETL Pipeline Architecture Actually Looks Like at Scale
A scalable ETL pipeline architecture isn't just a technical diagram. It's a set of decisions about where data lives, how it moves, and who's responsible for what.
At the architectural level, a well-designed ETL pipeline in a mid-market enterprise typically looks like this:
Source layer — the origin systems (ERP, CRM, HRIS, EMR, flat file drops, SaaS APIs). The extraction layer connects to each source, handles authentication, and pulls data on a defined schedule or trigger.
Staging layer — a raw landing zone where extracted data sits before transformation. This is your audit trail. If a transformation goes wrong, you can reprocess from staging without re-extracting from source.
Transformation layer — where business logic is applied. Deduplication, data type normalisation, business rule application, schema mapping. This is where the pipeline definition becomes real: the rules that turn raw operational data into analytics-ready facts and dimensions.
Target layer — the data warehouse or lakehouse where transformed data lands, structured for query. Often organised into raw, staging, and presentation layers (bronze/silver/gold in lakehouse terminology).
Orchestration layer — the scheduler and dependency manager. Airflow, Prefect, dbt Cloud, or even well-structured cron jobs at smaller scale. This layer ensures jobs run in the right order, retries on failure, and alerts when things go wrong.
The Most Common ETL Pipeline Failures (and How to Design Around Them)
Let's be direct about what breaks ETL pipelines. Most failures are predictable and preventable.
-
Silent data quality failures
The pipeline runs. The load completes. But the source system sent malformed records, and the transformation didn't catch them. The result: clean-looking data in your warehouse that's factually wrong. Prevention: build data quality validation into every transformation step (row counts, null checks, referential integrity, range checks). If you're already focused on data quality management as a discipline, this is where those commitments either hold or fall apart.
-
Schema drift
A source system updates and adds a new required field. Your extraction query doesn't include it. The pipeline starts dropping records. Nobody notices for two weeks. Prevention: schema versioning, automated schema change detection, and alerting.
-
Undocumented business logic
Someone built a transformation rule three years ago that everyone's forgotten. Now a new analyst changes the query and business KPIs shift by 12%. Nobody can explain why. Prevention: every transformation rule documented, versioned, and reviewable. This is the ETL methodology discipline most teams skip and every team regrets.
-
No lineage
You can't trace a number in a dashboard back to the source record that produced it. When a business user asks "where does this come from?", the honest answer is "we're not sure." Prevention: data lineage tooling — whether native in your warehouse, in your ETL tool, or in a dedicated catalog — is not optional for production analytics.
-
Over-dependence on ETL scripting without abstraction
Raw ETL scripting with no abstraction layer makes pipelines fragile and expensive to maintain. Every schema change requires manual rework. Prevention: use framework abstractions (dbt for SQL transformations, orchestration tools for scheduling) and resist the temptation to hand-craft every pipeline from scratch.
Tooling Decisions: What to Consider Before You Buy or Build
There's no shortage of scalable ETL pipeline tools. The question isn't which tool is best in the abstract — it's which tool fits your architecture, team skills, and data scale.
Managed ETL tools (Informatica, Talend, Azure Data Factory, AWS Glue) offer pre-built connectors, visual pipeline design, and managed infrastructure. Lower engineering overhead to start, but vendor lock-in is real and customisation often requires dropping into code anyway.
Code-first frameworks (Apache Spark, pandas ETL pipeline scripts, PySpark) give full control and are cost-effective at scale, but require engineering investment and strong DevOps practices around deployment and monitoring.
Transformation-layer tools (dbt, Dataform) don't handle extraction or loading but excel at the transformation layer — bringing software engineering practices (version control, testing, documentation) to SQL-based transformations. Increasingly the standard for ELT patterns.
Orchestration tools (Apache Airflow, Prefect, Dagster) manage scheduling, dependencies, retries, and alerting. These are not ETL tools — they're the layer that ensures your ETL or ELT pipeline runs reliably.
The honest answer for most mid-market data teams: you probably need one managed connector layer, a transformation framework, and an orchestration tool. Trying to collapse all three into a single platform is where vendor lock-in and rigidity creep in.
If you're starting from scratch and want the right foundation before committing to a tool stack, understanding how your broader data pipeline architecture should be sequenced — covering platform selection, team structure, and governance — is the right starting point.
The Real Takeaway
An ETL pipeline isn't a project with a completion date. It's infrastructure that will outlive the tools it was built with, the team members who built it, and — if you're not careful — the business requirements it was designed to serve.
The pipelines that age well share a few things in common: they're observable, they're documented, they're designed around the assumption that everything upstream will eventually change, and they treat failure as something to plan for rather than something to explain after the fact.
The pipelines that become legacy nightmares? They were built to work today, not to survive tomorrow.
Classic Informatics has been building and evolving data pipelines for over 23 years — across 1,000+ clients in healthcare, manufacturing, and finance, in 30+ countries. We've seen what makes pipelines age well and what turns them into the thing the next data leader inherits and spends 18 months untangling. If you're designing a pipeline now or rearchitecting one that's starting to show its age, we're here to help you get the foundations right.
FAQS
Frequently Asked Questions
An ETL pipeline is a data workflow that extracts data from source systems, transforms it to match a target schema or business logic, and loads it into a destination such as a data warehouse or analytics platform. It's the infrastructure layer that determines whether analytics data is reliable, fresh, and trustworthy.
