Overview

In this technical presentation, we will delve into the fundamental concepts of Data Engineering, focusing on two pivotal components of modern data architecture - Data Lakes and Data Warehouses. We will explore their roles, differences, and how they collectively empower organizations to harness the true potential of their data.

Introduction to Data Lake and Data Warehouse - Data Engineering Process Fundamentals

Follow this GitHub repo during the presentation: (Star the project to follow and get updates)

👉 GitHub Repo

Data engineering Series:

👉 Blog Series

YouTube Video

Video Agenda

Agenda:

Introduction to Data Engineering:
Brief overview of the data engineering landscape and its critical role in modern data-driven organizations.
Operational Data
Understanding Data Lakes:
Explanation of what a data lake is and its purpose in storing vast amounts of raw and unstructured data.
Exploring Data Warehouses:
Definition of data warehouses and their role in storing structured, processed, and business-ready data.
Comparing Data Lakes and Data Warehouses:
Comparative analysis of data lakes and data warehouses, highlighting their strengths and weaknesses.
Discussing when to use each based on specific use cases and business needs.
Integration and Data Pipelines:
Insight into the seamless integration of data lakes and data warehouses within a data engineering pipeline.
Code walkthrough showcasing data movement and transformation between these two crucial components.
Real-world Use Cases:
Presentation of real-world use cases where effective use of data lakes and data warehouses led to actionable insights and business success.
Hands-on demonstration using Python, Jupyter Notebook and SQL to solidify the concepts discussed, providing attendees with practical insights and skills.
Q&A and Hands-on Session:
An interactive Q&A session to address any queries.

Conclusion:

This session aims to equip attendees with a strong foundation in data engineering, focusing on the pivotal role of data lakes and data warehouses. By the end of this presentation, participants will grasp how to effectively utilize these tools, enabling them to design efficient data solutions and drive informed business decisions.

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

Data Lake and Data Warehouse
Discovery and Data Analysis
Design and Infrastructure Planning
Data Lake - Pipeline and Orchestration
Data Warehouse - Design and Implementation
Analysis and Visualization

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Operational Data

Operational data is often generated by applications, and it is stored in transactional relational databases like SQL Server, Oracle and NoSQL (JSON) databases like MongoDB, Firebase. This is the data that is created after an application saves a user transaction like contact information, a purchase or other activities that are available from the application.

Features:

Application support and transactions
Relational data structure and SQL or document structure NoSQL
Small queries for case analysis

Not Best For:

Reporting system
Large queries
Centralized Big Data system

Data Engineering Process Fundamentals - Operational Data

Data Lake - Analytical Data Staging

A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. Analytical data is the transaction data that has been extracted from a source system via a data pipeline as part of the staging data process.

Features:

Store the data in its raw format without any transformation
This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files
Low Cost for massive storage power
Not Designed for querying or data analysis
It is used as external tables by most systems

Data Engineering Process Fundamentals - Analytical Data staging

Data Warehouse - Analytical Data

A Data Warehouse is a centralized storage system that stores integrated data from multiple sources. The system is designed to host and serve Big Data scenarios with lower operational cost than transaction databases, but higher costs than a Data Lake. This system host the Analytical Data that has been processed and is ready for analytical purposes.

Data Warehouse Features:

Stores historical data in relational tables with an optimized schema, which enables the data analysis process
Provides SQL support to query the data
It can integrate external resources like CSV and parquet files that are stored on Data Lakes as external tables
The system is designed to host and serve Big Data scenarios. It is not meant to be used as a transactional system
Storage is more expensive
Offloads archived data to Data Lakes

Data Engineering Process Fundamentals - Analytical Data Store

Discovery - Data Analysis

During the discovery phase of a Data Engineering Process, we look to identify and clearly document a problem statement, which helps us have an understanding of what we are trying to solve. We also look at our analytical approach to make observations about at the data, its structure and source. This leads us into defining the requirements for the project, so we can define the scope, design and architecture of the solution.

Download sample data files
Run experiments to make observations
Write Python scripts using VS Code or Jupyter Notebooks
Transform the data with Pandas
Make charts with Plotly
Document the requirements

Data Engineering Process Fundamentals - Data Analysis and discovery

Design and Planning

The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful system. It involves defining the system architecture, designing data pipelines, implementing source control practices, ensuring continuous integration and deployment (CI/CD), and leveraging tools like Docker and Terraform for infrastructure automation.

Use GitHub for code repo and for CI/CD actions
Use Terraform is an Infrastructure as Code (IaC) tool that enables us to manage cloud resources across multiple cloud providers
Use Docker containers to run the code and manage its dependencies

Data Engineering Process Fundamentals - Design and Planning

Data Lake - Pipeline and Orchestration

A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake, and monitor cloud resources.

This can be code-centric, leveraging languages like Python
Or a low-code approach, utilizing tools such as Azure Data Factory, which provides a turn-key solution
Monitor services enable us to track telemetry data
Docker Hub, GitHub can be used for the CI/CD process

Data Engineering Process Fundamentals - Data Lake - Data Pipeline and Orchestration

Data Warehouse - Design and Implementation

In the design phase, we lay the groundwork by defining the database system, schema model, and technology stack required to support the data warehouse’s implementation and operations. In the implementation phase, we focus on converting conceptual data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis. Create a repeatable and extendable process.

Data Engineering Process Fundamentals - Data Warehouse Design and Implementation

Data Warehouse - Data Analysis

Data analysis is the practice of exploring data and understanding its meaning. It involves activities that can help us achieve a specific goal, such as identifying data dimensions and measures, as well as data analysis to identify outliers, trends, and distributions.

We can accomplish these activities by writing code using Python and Pandas, SQL, Visual Studio Code or Jupyter Notebooks.
What's more, we can use libraries, such as Plotly, to generate some visuals to further analyze data and create prototypes.

Data Engineering Process Fundamentals - Data Analysis

Data Analysis and Visualization

Data visualization is a powerful tool that takes the insights derived from data analysis and presents them in a visual format. While tables with numbers on a report provide raw information, visualizations allow us to grasp complex relationships and trends at a glance.

Dashboards, in particular, bring together various visual components like charts, graphs, and scorecards into a unified interface that can help us tell a story
Use tools like PowerBI, Looker, Tableau to model the data and create enterprise level visualizations

Data Engineering Process Fundamentals - Data Visualization

Conclusion

Both data lakes and data warehouses are essential components of a data engineering project. The primary function of a data lake is to store large amounts of operational data in its raw format, serving as a staging area for analytical processes. In contrast, a data warehouse acts as a centralized repository for information, enabling engineers to transform, process, and store extensive data. This allows the analytical team to utilize coding languages like Python and tools such as Jupyter Notebooks, as well as low-code platforms like Looker Studio and Power BI, to create enterprise-quality dashboards for the organization.

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals Book Amazon

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

Ozkary - Emerging Technologies

I am Oscar Garcia, Ozkary^TM. I author this site, speak at conferences and events, contribute to OSS, mentor people. I use this blog to post ideas and experiences about software development, with the goal to both learn from and help the technology communities around the world.

11/26/24

Introduction to Data Lakes and Data Warehouses - Data Engineering Process Fundamentals -

Overview