Data Pipelines — 02
--
In my last writeup, I explained Data Pipelines to some extent, that was the perspective from one end of the pipeline but a key part of a data pipeline project is also to understand the data activation needs or domain needs, so to say. Historically, a common pattern was to create a datamart or datawarehouse and run multi dimensional OLAP queries. There are various products that support such a design and architecture such as Microsoft’s SSAS Microstrategy used to be one popular choice.
Nowadays, data no longer flows merely from single source to a datawarehouse or a data lake rather flows typically in both directions. IOT devices may even have their own data pipelines built in, which provide input and feedback to the larger system. This multi directional data flowing to and fro is causing the data to get humongously exponentially large. We also need to consider pipeline requirements like scalability, performance, and design for our future pipelines.
Data engineering process involves using different data storage and manipulating different tools together, thus the entire process involves many data technologies to be able to choose the right ones for a certain job.
Data ingestion tools — Cloud Dataflow, Kafka, Apache Airflow, AWS Glue are a few choices that a data engineer has. These tools are aimed at large scale data ingestion and low latency processing through parallel execution of pipeline jobs. Event streaming platforms like Kafka makes the entire process more durable by offering durability and abstraction of a distributed commit. Other popular ETL and data solutions are the Stitch platform for rapidly moving data and Blendo, a tool for syncing data from various sources to a data warehouse.
Warehouse solutions — Widely used data warehouse tools are Teradata Data Warehouse, SAP Data Warehouse, Oracle Exadata and Amazon Redshift/Google BigQuery as cloud solutions.
Big data tools — Big data technologies could be leveraged like Hadoop distributed file systems (HDFS), search engines like Elasticsearch, ETL and data platforms, Apache Spark analytics engine for large-scale data processing.
Further, depending upon your requirement whether batch or a real-time, open-source or proprietary, on-premise or cloud, one must consider the following data points in mind while choosing any set of tools —
- The set of tools you choose should allow you to quickly build a pipeline and set up your infrastructure in minimum amount of time.
- Minimum maintenance overhead of the pipeline.
- Connectivity with multiple data sources should be allowed.
- Should be able to transfer and load data without error.
- Quick resolution from Customer Support if land into any issue.
This may still work in some wider or general scenarios, but in modern scenarios and PAS age, agility, speed of data activation and business centricity matters a lot. There are different business centric solutions. For instance, in a banking and finance industry there could be a need to retain, engage and grow your business or product lines. Similarly for a marketing persona customer acquisition, retention could be a deep priority. DMP, CDP, Audience Manager types solutions could be very relevant as compared to the data lake. Data driven Architecture is a foundational step to support such a strategy. If we zoom into the application layer of the data pipeline for such a use case then there are a specialized set of solutions that one should consider depending upon the business needs. Here, CIM, CDP or a Pure Play Customer 360 could be leveraged.