Build ETL Pipeline With PySpark on AWS EC2 (2) — Build Relational Data Model

Chuan Zhang
3 min readNov 19, 2020

In last post, I demonstrated how to setup PySpark environment on an EC2 instance. In this post, I am going to briefly introduce what data pipeline is, and demonstrate in examples how to build relational data model from unstructured/semi-structured data.

Data Pipeline

Generally speaking, a data pipeline is “a set of data processing elements connected in series, where the output of one element is the input of the next one”.

In terms of architecture, there are two types of data pipelines, batch data pipeline and stream data pipeline. The diagram below illustrates a typical batch data pipeline, and it usually consists a ETL pipeline and a data warehouse.

The two data pipeline types are not mutually exclusive. In many cases, both types can be included in the same data system. The figure below shows the data pipeline of Redfin on AWS. It uses the Lambda Architecture, which combines a traditional batch pipeline with a fast real-time stream pipeline.

Along the batch pipeline, Redfin uploads its data onto S3 and taking S3 as its staging area. Its ETL pipeline extracts datasets from S3, manipulates using EMR services and feeds them into its Redshift data warehouse. Finally, the processed data on data warehouse is used to generate reports, support business intelligence and so on.

The stream data pipeline is out of the scope of our discussions. In the rest of this and subsequent posts, I will mainly focus on batch data pipelines. Without explicitly specifying its type, all the described pipelines are batch pipelines.

In general, source data of a pipeline can be from different systems and in different formats, such as the operational/ERP/CRM systems (SQL or NoSQL databases), file storage systems, and so on. Data pipeline starts from a series ETL operations, which extract and clean the data from different sources, and load them into the staging area. Next, another series of ETL operations extract, reorganize (map/deduplicate/model) and load the data from the staging area into the databases in data warehouse. Among the different operations along a data pipeline, building data model is a very important step. In next section, I demonstrate how to build relational data model in an example.

Building Relational Data Model

In this section, we take a json file downloaded from internet as the source data, which mimic the information about movies scraped from Wikipedia.

(under construction …)

--

--