Data Pipelines — 01

Oct 22, 2020

A pipeline’s literal meaning is a linear sequence of specialized modules. A data pipeline is simply a series of steps involved to move data from one system to another.

Why Data Pipelines?

To ensure consistent flow of data from one location to another.
To carry out useful analysis — Analysis cannot begin unless data from all the systems is available.
Data from multiple systems sometimes needs to be combined in ways that makes sense for analysis.

Data Pipelines and ETLs

Although, these days words — data pipelines and ETLs are used interchangeably but they are not the same. Traditionally, ETLs were designed for processing data in batches; used for extracting, transforming and loading data from one system to another. Data pipelines also transfer data from one system to another system. Then whats the difference?

Difference between Data Pipelines and ETLs

I could gather only these two differences so far:

Data pipelines may or may not involve data transformation.
Data pipelines could involve one or more ETLs.

Thus, ETLs actually become a subset of Data Pipelines.

Which Use Cases require Data Pipelines?

The business that requires data from multiple sources.
There is a requirement of real-time analysis.
There exists multiple data silos.

Various Types of Data Pipeline Solutions

There could be various types of data pipeline solutions like batch processing, real-time processing, cloud-native solution or one could go for open source tools to save the cost.

I will discuss about these solutions/techniques in subsequent write-ups.