A data warehouse (DW) is a core component of business intelligence and a central data repository for different sources. This poses many challenges as the schema definition for those sources may be completely different from one another. In this presentation, we discuss how to create a data warehouse model that can support data for disparate data sources. We look at how to build a dimensional model which can enable us to import the data with different shapes into a data warehouse. We then create processes to transform and load the data using Apache Spark. We finally use PowerBI to visualize the data from our data warehouse.
Modern Data Warehouse |
Learning Objectives:
Challenges when using disparate data sources (different models)
- The number of fields on the payload can be different. Some system can send more data readings/facts than another.
- The fields can have a different name but the data represent the same reading.
- The model can change independently of the other system.
Define a common model to support the different model shapes
- We need to identify the common fields across all the data models that can identify a unique entry. For example, the date, a serial number or device id, a regional location.
- Identify the dimension from the actual entry from all these models. The dimensions are used to build other tables that can provide structured information about the readings.
- We need to identify the measurement or readings and support their different names.
- For performance, we would like to be able to read the entries without the overhead of processing all the measurements.
- Star Schema design can be used to associate the dimension tables to the fact table.
Strategy for processing the different models into the common model with Apache Spark
- Load the data source into data frames and map their fields into a data set with well defined data types for the fields.
- Parse the dimension fields into a different data set. Create a unique hash numeric key that can be used as a foreign key on the the relational data model.
- Map all the measurements from their data source name to the corresponding field on the common model.
- Create a dimension for the field name to catalog the measurement fields thus avoiding having to load all the data and get unique names.
- Transpose all the measurements from columns to rows. This enables us to support different number of columns as well as support field changes on the data source.
Data Pipeline for Loading the Data into the Data Warehouse with Apache Spark
- Load the new unique dimension records from the Spark database into the dimension tables in the data warehouse first.
- Load the fact entry with all the dimension relationships. Create a unique hash id for the entry.
- Load the fact measurements with the entry hash id relationship. This maps the reading to a unique entry.
The sample project with the Spark and SQL code as well as demo data can be found at this location:
https://github.com/ozkary/modern-data-warehouse-spark/
By following this high level plan, we should be able to create a data warehouse that can support disparate data sources with different models using modern data processing tooling.
0 comments :
Post a Comment
What do you think?