10/31/24

A Hands-On Exploration into the discovery phase - Data Engineering Process Fundamentals

Overview

The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.

In this session, we will delve into the essential building blocks of data engineering, placing a spotlight on the discovery process. From framing the problem statement to navigating the intricacies of exploratory data analysis (EDA) using Python, VSCode, Jupyter Notebooks, and GitHub, you'll gain a solid understanding of the fundamental aspects that drive effective data engineering projects.

A Hands-On Exploration into the discovery phase - Data Engineering Process Fundamentals

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

Jupyter Notebook

👉 https://github.com/ozkary/data-engineering-mta-turnstile/blob/main/Step1-Discovery/mta_discovery.ipynb

  • Data engineering Series:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

Jupyter Notebook Preview

# Standard library imports
from time import time
from pathlib import Path
import requests
from io import StringIO
# Load pandas support for data analysis tasks, dataframe (two-dimensional data structure with rows and columns) management
import pandas as pd    
import numpy as np 

# URL of the file you want to download. Note: It should be a Saturday date
url = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_241026.txt'

# Download the file in memory
response = requests.get(url)
response.raise_for_status()  # Check if the request was successful

# Create a DataFrame from the downloaded content
data = StringIO(response.text)
df = pd.read_csv(data)

# Display the DataFrame first 10 rows
df.head(10)

# use info to get the column names, data type and null values
df.info()

# remove spaces and type case the columns
df.columns = [column.strip() for column in df.columns]
print(df.columns)
df["ENTRIES"] = df["ENTRIES"].astype(int)
df["EXITS"] = df["EXITS"].astype(int)

# Define the set of special characters you want to check for
special_characters_set = set('@#$%/')


def has_special_characters(col, special_characters):
    # Check if any character in the column name is not alphanumeric or in the specified set
    return any(char in special_characters for char in col)

def rename_columns(df, special_characters_set):
    # Create a mapping of old column names to new column names
    mapping = {col: ''.join(char for char in col if char.isalnum() or char not in special_characters_set) for col in df.columns}

    print(mapping)
    # Rename columns using the mapping
    df_renamed = df.rename(columns=mapping)

    return df_renamed


# Identify columns with special characters using list comprehension syntax
columns_with_special_characters = [col for col in df.columns if has_special_characters(col, special_characters_set)]

# Print the result
print("Columns with special characters:", columns_with_special_characters)

# Identify columns with special characters and rename them
df = rename_columns(df, special_characters_set)

# Display the data frame again. there should be no column name with special characters
print(df.info())

YouTube Video

Video Agenda

  1. Introduction:

    • Unveiling the importance of the discovery process in data engineering.

    • Setting the stage with a real-world problem statement that will guide our exploration.

  2. Setting the Stage:

    • Downloading and comprehending sample data to kickstart our discovery journey.

    • Configuring the development environment with VSCode and Jupyter Notebooks.

  3. Exploratory Data Analysis (EDA):

    • Delving deep into EDA techniques with a focus on the discovery phase.

    • Demonstrating practical approaches using Python to uncover insights within the data.

  4. Code-Centric Approach:

    • Advocating the significance of a code-centric approach during the discovery process.

    • Showcasing how a code-centric mindset enhances collaboration, repeatability, and efficiency.

  5. Version Control with GitHub:

    • Integrating GitHub seamlessly into our workflow for version control and collaboration.

    • Managing changes effectively to ensure a streamlined data engineering discovery process.

  6. Real-World Application:

    • Applying insights gained from EDA to address the initial problem statement.

    • Discussing practical solutions and strategies derived from the discovery process.

Key Takeaways:

  • Mastery of the foundational aspects of data engineering.

  • Hands-on experience with EDA techniques, emphasizing the discovery phase.

  • Appreciation for the value of a code-centric approach in the data engineering discovery process.

Some of the technologies that we will be covering:

  • Python
  • Data Analysis and Visualization
  • Jupyter Notebook
  • Visual Studio Code

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

  • Importance of the Discovery Process
  • Setting the Stage - Technologies
  • Exploratory Data Analysis (EDA)
  • Code-Centric Approach
  • Version Control
  • Real-World Use Case

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Importance of the Discovery Process

The discovery process involves identifying the problem, analyzing data sources, defining project requirements, establishing the project scope, and designing an effective architecture to address the identified challenges.

  • Clearly document the problem statement to understand the challenges the project aims to address.
  • Make observations about the data, its structure, and sources during the discovery process.
  • Define project requirements based on the observations, enabling the team to understand the scope and goals.
  • Clearly outline the scope of the project, ensuring a focused and well-defined set of objectives.
  • Use insights from the discovery phase to inform the design of the solution, including data architecture.
  • Develop a robust project architecture that aligns with the defined requirements and scope.

Data Engineering Process Fundamentals - Discovery Process

Setting the Stage - Technologies

To set the stage, we need to identify and select the tools that can facilitate the analysis and documentation of the data. Here are key technologies that play a crucial role in this stage:

  • Python: A versatile programming language with rich libraries for data manipulation, analysis, and scripting.

Use Cases: Data download, cleaning, exploration, and scripting for automation.

  • Jupyter Notebooks: An interactive tool for creating and sharing documents containing live code, visualizations, and narrative text.

Use Cases: Exploratory data analysis, documentation, and code collaboration.

  • Visual Studio Code: A lightweight, extensible code editor with powerful features for source code editing and debugging.

Use Cases: Writing and debugging code, integrating with version control systems like GitHub.

  • SQL (Structured Query Language): A domain-specific language for managing and manipulating relational databases.

Use Cases: Querying databases, data extraction, and transformation.

Data Engineering Process Fundamentals - Discovery Tools

Exploratory Data Analysis (EDA)

EDA is our go-to method for downloading, analyzing, understanding and documenting the intricacies of the datasets. It's like peeling back the layers of information to reveal the stories hidden within the data. Here's what EDA is all about:

  • EDA is the process of analyzing data to identify patterns, relationships, and anomalies, guiding the project's direction.

  • Python and Jupyter Notebook collaboratively empower us to download, describe, and transform data through live queries.

  • Insights gained from EDA set the foundation for informed decision-making in subsequent data engineering steps.

  • Code written on Jupyter Notebook can be exported and used as the starting point for components for the data pipeline and transformation services.

Data Engineering Process Fundamentals - Discovery Pie Chart

Code-Centric Approach

A code-centric approach, using programming languages and tools in EDA, helps us understand the coding methodology for building data structures, defining schemas, and establishing relationships. This robust understanding seamlessly guides project implementation.

  • Code delves deep into data intricacies, revealing integration and transformation challenges often unclear with visual tools.

  • Using code taps into Pandas and Numpy libraries, empowering robust manipulation of data frames, establishment of loading schemas, and addressing transformation needs.

  • Code-centricity enables sophisticated analyses, covering aggregation, distribution, and in-depth examinations of the data.

  • While visual tools have their merits, a code-centric approach excels in hands-on, detailed data exploration, uncovering subtle nuances and potential challenges.

Data Engineering Process Fundamentals - Discovery Pie Chart

Version Control

Using a tool like GitHub is essential for effective version control and collaboration in our discovery process. GitHub enables us to track our exploratory code and Jupyter Notebooks, fostering collaboration, documentation, and comprehensive project management. Here's how GitHub enhances our process:

  • Centralized Tracking: GitHub centralizes tracking and managing our exploratory code and Jupyter Notebooks, ensuring a transparent and organized record of our data exploration.

  • Sharing: Easily share code and Notebooks with team members on GitHub, fostering seamless collaboration and knowledge sharing.

  • Documentation: GitHub supports Markdown, enabling comprehensive documentation of processes, findings, and insights within the same repository.

  • Project Management: GitHub acts as a project management hub, facilitating CI/CD pipeline integration for smooth and automated delivery of data engineering projects.

Data Engineering Process Fundamentals - Discovery Problem Statement

Summary: The Power of Discovery

By mastering the discovery phase, you lay a strong foundation for successful data engineering projects. A thorough understanding of your data is essential for extracting meaningful insights.

  • Understanding Your Data: The discovery phase is crucial for understanding your data's characteristics, quality, and potential.
  • Exploratory Data Analysis (EDA): Use techniques to uncover patterns, trends, and anomalies.
  • Data Profiling: Assess data quality, identify missing values, and understand data distributions.
  • Data Cleaning: Address data inconsistencies and errors to ensure data accuracy.
  • Domain Knowledge: Leverage domain expertise to guide data exploration and interpretation.
  • Setting the Stage: Choose the right language and tools for efficient data exploration and analysis.

The data engineering discovery process involves defining the problem statement, gathering requirements, and determining the scope of work. It also includes a data analysis exercise utilizing Python and Jupyter Notebooks or other tools to extract valuable insights from the data. These steps collectively lay the foundation for successful data engineering endeavors.

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

9/25/24

Live Dashboards: Boosting App Performance with Real-Time Integration

Overview

Dive into the future of web applications. We're moving beyond traditional API polling and embracing real-time integration. Imagine your client app maintaining a persistent connection with the server, enabling bidirectional communication and live data streaming. We'll also tackle scalability challenges and integrate Redis as our in-memory data solution.

Live Dashboards: Boosting App Performance with Real-Time Integration

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/Realtime-Apps-with-Nodejs-Angular-Socketio-Redis

YouTube Video

Video Agenda

This presentation explores strategies for building highly responsive and interactive live dashboards. We'll delve into the challenges of traditional API polling and demonstrate how to leverage Node.js, Angular, Socket.IO, and Redis to achieve real-time updates and a seamless user experience.

  • Introduction:

    • Understanding telemetry data and the importance to monitor the data
    • Challenges of traditional API polling for real-time data.
    • Design patterns to enhance an app with minimum changes
  • Traditional Solution Architecture

    • SQL Database Integration.
    • Restful API
    • Angular and Node.js Integration
  • Real-Time Integration with Web Sockets

    • Database Optimization Challenges
    • Introduction to Web Sockets for bidirectional communication.
    • Implementing Web Sockets in a Web application.
    • Handling data synchronization and consistency.
  • Distributed Caching with Redis:

    • Benefits of in-memory caching for improving performance and scalability.
    • Integrating Redis into your Node.js application.
    • Caching strategies for distributed systems.
  • Case Study: Building a Live Telemetry Dashboard

    • Step-by-step demonstration of the implementation.
    • Performance comparison with and without optimization techniques.
    • User experience benefits of real-time updates.
  • Benefits and Considerations

    • Improved dashboard performance and responsiveness.
    • Reduced server load and costs.
    • Scalability and scalability considerations.
    • Best practices for implementing real-time updates.

Why Attend:

Gain a deep understanding of real-time data integration for your Web application.

Presentation

Telemetry Data Story

Devices send telemetry data via API integration with SQL Server. There are inherit performance problems with a disk-based database. We progressively enhance the system with minimum changes by adding real-time integration and an in-memory cache system.

Live Dashboards: Real-time dashboard

Database Integration

Solution Architecture

  • Disk-based Storage
  • Web apps and APIs query database to get the data
  • Applications can do both high reads and writes
  • Web components, charts polling back-end database for reads

Let’s Start our Journey

  • Review our API integration and talk about concerns
  • Do not refactor everything
  • Enhance to real-time integration with sockets
  • Add Redis as the distributed cache
  • Add the service broker strategy to sync the data sources
  • Centralized the real-time integration with Redis

Live Dashboards: Direct API Integration

RESTful API Integration

Applied Technologies

  • REST API Written with Node.js
  • TypeORM Library Repository
  • Angular Client Application with Plotly.js Charts
  • Disk-based storage – SQL Server
  • API Telemetry (GET, POST) route

Use Case

  • IoT devices report telemetry information via API
  • Dashboard reads that most recent data only via API calls which queries the storage service
  • Polling the database to get new records

Project Repo (Star the project and follow) https://github.com/ozkary/Realtime-Apps-with-Nodejs-Angular-Socketio-Redis

Live Dashboards: Repository Integration

Database Optimization and Challenges

Slow Queries on disk-based storage

  • Effort on index optimization
  • Database Partition strategies
  • Double-digit millisecond average speed (physics on data disks)

Simplify data access strategies

  • Relational data is not optimal for high data read systems (joins?)
  • Structure needs to be de-normalized
  • Often views are created to shape the data, date range limit

Database Contention

  • Read isolation levels (nolock)
  • Reads competing with inserts

Cost to Scale

  • Vertical and horizontal scaling up on resources
  • Database read-replicas to separate reads and writes
  • Replication workloads/tasks
  • Data lakes and data warehouse

Live Dashboards: SQL Query

Real-Time Integration

What is Socket.io, Web Sockets?

  • Enables real-time bidirectional communication.
  • Push data to clients as events take place on the server
  • Data streaming
  • Connection starts as HTTP is them promoted to Web Sockets

Additional Technologies -Socket.io (Signalr for .Net) for both client and server components

Use Case

  • IoT devices report telemetry information via sockets. All subscribed clients get the information as an event which updates the dashboard

Demo

  • Update both server and client to support Web sockets
  • Use device demo tool to connect and automate the telemetry data to the server

Live Dashboards: Web Socket Integration

Distributed Cache Strategy

Why Use a Cache?

  • Data is stored in-memory
  • Sub-millisecond average speed
  • Cache-Aside Pattern
    • Read from cache first (cache-hit) fail over to database (cache miss)
    • Update cache on cache miss
  • Write-Through
    • Write to cache and database
    • Maintain both systems updated
  • Improves app performance
  • Reduces load on Database

Application Changes

  • Changes are only done on the server
  • No changes on client-side

Live Dashboards: Cache Architecture

Redis and Socket.io Integration

What is Redis?

  • Key-value store, keys can contain strings (JSON), hashes, lists, sets, & sorted sets
  • Redis supports a set of atomic operations on these data types (available until committed)
  • Other features include transactions, publish/subscribe, limited time to live -TTL
  • You can use Redis from most of today's programming languages using libraries

Use Case

  • As application load and data frequency increases, we need to use a cache for performance. We also need to centralize the events, so all the socket servers behind a load balancer can notify the clients. Update both storage and cache

Demo

  • Start Redis-cli on Ubuntu and show some inserts, reads and sync events.
    • sudo service redis-server restart
    • redis-cli -c -p 6379 -h localhost
    • zadd table:data 100 "{data:'100'}“
    • zrangebycore table:data 100 200
    • subscribe telemetry:data

Live Dashboards: Load Balanced Architecture

Summary: Boosting Your App Performance

When your application starts to slow down due to heavy read and writes on your database, consider moving the read operations to a cache solution and broadcasting the data to your application via a real-time integration using Web Sockets. This approach can significantly enhance performance and user experience.

Key Benefits

  • Improved Performance: Offloading reads to a cache system like Redis reduces load on the database.
  • Real-Time Updates: Using Web Sockets ensures that your application receives updates in real-time, with no need for manual refreshes.
  • Scalability: By reducing the database load, your application can handle more concurrent users.
  • Efficient Resource Utilization: Leveraging caching and real-time technologies optimizes the user of server resources, leading to savings and better performance.

Live Dashboards: Load Balanced Architecture

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

8/21/24

Medallion Architecture: A Blueprint for Data Insights and Governance - Data Engineering Process Fundamentals

Overview

Gain understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Data Engineering Process Fundamentals - Medallion Architecture

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

  • Introduction to Medallion Architecture

    • Defining Medallion Architecture
    • Core Principles
    • Benefits of Medallion Architecture
  • The Raw Zone

    • Understanding the purpose of the Raw Zone
    • Best practices for data ingestion and storage
  • The Bronze Zone

    • Data transformation and cleansing
    • Creating a foundation for analysis
  • The Silver Zone

    • Data optimization and summarization
    • Preparing data for consumption
  • The Gold Zone

    • Curated data for insights and action
    • Enabling self-service analytics
  • Empowering Insights

    • Data-driven decision-making
    • Accelerated Insights
  • Data Governance

    • Importance of data governance in Medallion Architecture
    • Implementing data ownership and stewardship
    • Ensuring data quality and security

Why Attend:

Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

  • Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
  • Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
  • Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
  • Scalability: The layered approach can accommodate growing data volumes and complexity.
  • Cost Efficiency: Optimized data storage and processing can reduce costs.

Data Engineering Process Fundamentals - Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

  • Key Characteristics:
    • Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
    • Data is ingested as-is, without any cleaning or transformation
    • High volume and velocity
    • Data retention policies are crucial
  • Benefits:
    • Preserves original data for potential future analysis
    • Enables data reprocessing
    • Supports data lineage and auditability

Data Engineering Process Fundamentals - Medallion Architecture Raw Zone Diagram

Use case Background

The Metropolitan Transportation Authority (MTA) subway system in New York has stations around the city. All the stations are equipped with turnstiles or gates which tracks as each person enters (departure) or exits (arrival) the station.

  • The MTA subway system has stations around the city.
  • All the stations are equipped with turnstiles or gates which tracks as each person enters or leaves the station.
  • CSV files provide information about the amount of commuters per stations at different time slots.

Data Engineering Process Fundamentals - Data streaming MTA Gates

Problem Statement

In the city of New York, commuters use the Metropolitan Transportation Authority (MTA) subway system for transportation. There are millions of people that use this system every day; therefore, businesses around the subway stations would like to be able to use Geofencing advertisement to target those commuters or possible consumers and attract them to their business locations at peak hours of the day.

  • Geofencing is a location based technology service in which mobile devices’ electronic signal is tracked as it enters or leaves a virtual boundary (geo-fence) on a geographical location. Businesses around those locations would like to use this technology to increase their sales.
  • Businesses around those locations would like to use this technology to increase their sales by pushing ads to potential customers at specific times.

ozkary-data-engineering-mta-geo-fence

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

  • Key Characteristics:
    • Data is cleansed and standardized
    • Basic transformations are applied (e.g., data type conversions, null handling)
    • Data is structured into tables or views
    • Data quality checks are implemented
    • Data retention policies may be shorter than the Raw Zone
  • Benefits:
    • Improves data quality and consistency
    • Provides a foundation for further analysis
    • Enables data exploration and discovery

Data Engineering Process Fundamentals - Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

  • Key Characteristics:
    • Data is cleansed, standardized, and enriched
    • Data is structured for analytical purposes (e.g., normalized, de-normalized)
    • Data is optimized for query performance (e.g., partitioning, indexing)
    • Data is aggregated and summarized for specific use cases
  • Benefits:
    • Improved query performance
    • Supports self-service analytics
    • Enables advanced analytics and machine learning
    • Reduces query costs

Data Engineering Process Fundamentals - Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

  • Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
  • Key Characteristics:
    • Data is highly refined, aggregated, and optimized for specific use cases
    • Data is often materialized for performance
    • Data is subject to rigorous quality checks and validation
    • Data is secured and governed
  • Benefits:
    • Enables rapid insights and decision-making
    • Supports self-service analytics and reporting
    • Provides a foundation for advanced analytics and machine learning
    • Reduces query latency

Data Engineering Process Fundamentals - Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

  • Key Characteristics:
    • Data is accessible and easily consumable
    • Supports various analytical tools and platforms (BI, ML, data science)
    • Enables self-service analytics
    • Drives business decisions and actions
  • Examples of Consumption Tools:
    • Business Intelligence (BI) tools (Looker, Tableau, Power BI)
    • Data science platforms (Python, R, SQL)
    • Machine learning platforms (TensorFlow, PyTorch)
    • Advanced analytics tools

Data Engineering Process Fundamentals - Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.

Key components of data governance include:

  • Data Lineage: Tracking data's journey from source to consumption.
  • Data Ownership: Defining who is responsible for data accuracy and usage.
  • Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
  • Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

Data Engineering Process Fundamentals - Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

Data Engineering Process Fundamentals - Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

Data Engineering Process Fundamentals - Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

  • Key Benefits:
    • Improved data quality
    • Enhanced governance
    • Accelerated insights
    • Scalability
    • Cost Efficiency.

Data Engineering Process Fundamentals - Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

7/24/24

Building Real-Time Data Pipelines: A Practical Guide - Data Engineering Process Fundamentals

Overview

In modern data engineering solutions, handling streaming data is very important. Businesses often need real-time insights to promptly monitor and respond to operational changes and performance trends. A data streaming pipeline facilitates the integration of real-time data into data warehouses and visualization dashboards.

Data Engineering Process Fundamentals - Building Real-Time Data Pipelines: A Practical Guide

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

  1. What is Data Streaming?

    • Understanding the concept of continuous data flow.

    • Real-time vs. batch processing.

    • Benefits and use cases of data streaming.

  2. Data Streaming Channels

    • APIs (Application Programming Interfaces)

    • Events (system-generated signals)

    • Webhooks (HTTP callbacks triggered by events)

  3. Data Streaming Components

    • Message Broker (Apache Kafka)

    • Producers and consumers

    • Topics for data categorization

    • Stream Processing Engine (Apache Spark Structured Streaming)

  4. Solution Design and Architecture

    • Real-time data source integration

    • Leveraging Kafka for reliable message delivery

    • Spark Structured Streaming for real-time processing

    • Writing processed data to the data lake

  5. Q&A Session

    • Get your questions answered by the presenters.

Why Join This Session?

  • Stay Ahead of the Curve: Gain a comprehensive understanding of data streaming, a crucial aspect of modern data engineering.
  • Unlock Real-Time Insights: Learn how to leverage data streaming for immediate processing and analysis, enabling faster decision-making.
  • Learn Kafka and Spark: Explore the power of Apache Kafka as a message broker and Apache Spark Structured Streaming for real-time data processing.
  • Build a Robust Data Lake: Discover how to integrate real-time data into your data lake for a unified data repository.

Presentation

Introduction - What is Data Streaming?

Data streaming enables us to build data integration in real-time. Unlike traditional batch processing, where data is collected and processed periodically, streaming data arrives continuously, and it is processed on-the-fly.

  • Understanding the concept of continuous data flow
    • Real-time, uninterrupted transfer of data from various channels.
    • Allows for immediate processing and analysis of data as it is generated.
  • Real-time vs. batch processing
    • Data is collected and process in chunks at certain times
    • The data can take hours and even days depending on the source
  • Benefits and use cases of data streaming
    • React instantly to events
    • Predict trends with real-time updates
    • Update dashboard with up to the minute/seconds data

Data Engineering Process Fundamentals - What is data streaming

Data Streaming Channels

Data streams can arrive from various channels, often hosted on HTTP endpoints. The specific channel technology depends on the provider. Generally, the integration involves either a push or a pull connection.

  • Events (Push Model): These can be delivered using a subscription model like Pub/Sub, where your system subscribes to relevant topics and receives data "pushed" to it whenever events occur. Examples include user clicks, sensor readings, or train arrivals.

  • Webhooks (Push-Based on Events): These are HTTP callbacks triggered by specific events on external platforms. You set up endpoints that listen for these notifications to capture the data stream.

  • APIs (Pull Model): Application Programming Interfaces are used to actively fetch data from external services, like social media platforms. Scheduled calls are made to the API at specific intervals to retrieve the data.

Data Engineering Process Fundamentals - Data streaming channels

Data Streaming System

Powering real-time data pipelines, Apache Kafka efficiently ingests data streams, while Apache Spark analyzes and transforms it, enabling large-scale insights.

Apache Kafka:

Apache Kafka: The heart of the data stream. It's a high-performance platform that acts as a message broker, reliably ingesting data (events) from various sources like applications, sensors, and webhooks. These events are published to categorized channels (topics) within Kafka for further processing.

Spark Structured Streaming:

Built on Spark, it processes Kafka data streams in real-time. Unlike simple ingestion, it allows for transformations, filtering, and aggregations on the fly, enabling real-time analysis of streaming data.

Data Engineering Process Fundamentals - Data streaming Systems

Data Streaming Components

Apache Kafka acts as the central message broker, facilitating real-time data flow. Producers, like applications or sensors, publish data (events) to categorized channels (topics) within Kafka. Spark then subscribes as a consumer, continuously ingesting and processing these data streams in real-time.

  • Message Broker (Kafka): Routes real-time data streams.
  • Producers & Consumers: Producers send data to topics, Consumers receive and process it.
  • Topics (Categories): Organize data streams by category.
  • Stream Processing Engine (Spark Structured Streaming):
    • Reads data from Kafka.
    • Extracts information.
    • Transforms & summarizes data (aggregations).
    • Writes to a data lake.

Data Engineering Process Fundamentals - Data streaming Components

Use case Background

The Metropolitan Transportation Authority (MTA) subway system in New York has stations around the city. All the stations are equipped with turnstiles or gates which tracks as each person enters (departure) or exits (arrival) the station.

  • The MTA subway system has stations around the city.
  • All the stations are equipped with turnstiles or gates which tracks as each person enters or leaves the station.
  • CSV files provide information about the amount of commuters per stations at different time slots.

Data Engineering Process Fundamentals - Data streaming MTA Gates

Data Specifications

Since we already have a data transformation layer that incrementally updates the data warehouse, our real-time integration will focus on leveraging this existing pipeline. We'll achieve this by aggregating data from the stream and writing the results directly to the data lake.

  • Group by these categorical fields: "AC", "UNIT","SCP","STATION","LINENAME","DIVISION", "DATE", "DESC"
  • Aggregate these measures: "ENTRIES", "EXITS"
  • Sample result: "A001,R001,02-00-00,Test-Station,456NQR,BMT,09-23-23,REGULAR,16:54:00,140,153"

# Define the schema for the incoming data
turnstiles_schema = StructType([
    StructField("AC", StringType()),
    StructField("UNIT", StringType()),
    StructField("SCP", StringType()),
    StructField("STATION", StringType()),
    StructField("LINENAME", StringType()),
    StructField("DIVISION", StringType()),
    StructField("DATE", StringType()),
    StructField("TIME", StringType()),
    StructField("DESC", StringType()),
    StructField("ENTRIES", IntegerType()),
    StructField("EXITS", IntegerType()),
    StructField("ID", StringType()),
    StructField("TIMESTAMP", StringType())
])

Solution Architecture for Real-time Data Integration

Data streams are captured by the Kafka producer and sent to Kafka topics. The Spark-based stream consumer retrieves and processes the data in real-time, aggregating it for storage in the data lake.

Components:

  • Real-Time Data Source: Continuously emits data streams (events or messages).
  • Message Broker Layer:
    • Kafka Broker Instance: Acts as a central hub, efficiently collecting and organizing data into topics.
    • Kafka Producer (Python): Bridges the gap between the source and Kafka.
  • Stream Processing Layer:
    • Spark Instance: Processes and transforms data in real-time using Apache Spark.
    • Stream Consumer (Python): Consumes messages from Kafka and acts as both a Kafka consumer and Spark application:
      • Retrieves data as soon as it arrives.
      • Processes and aggregates data.
      • Saves results to a data lake.
  • Data Storage: Data transformation for visualization tools (Looker, Power BI) to access.
  • Docker Containers: Use containers for deployments

Data Engineering Process Fundamentals - Data streaming MTA Gates

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

Data Engineering Process Fundamentals - Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Impact on Data Visualization

  • Our architecture efficiently processes real-time data by leveraging our existing data transformation layer.
  • This optimized flow enables significantly faster data visualization.
  • The dashboard refresh time can increase their frequency to load the new data.

For real-time updates directly on the dashboard, a socket-based integration would be necessary.

Data Engineering Process Fundamentals - Data transformation lineage

Key Takeaways: Real-Time Integration

Data streaming solutions are an absolute necessity, enabling the rapid processing and analysis of vast amounts of real-time data. Technologies like Kafka and Spark play a pivotal role in empowering organizations to harness real-time insights from their data streams.

  • Real-time Power: Kafka handles various data streams, feeding them to data topics.
  • Spark Processing Power: Spark reads from these topics, analyzes messages in real-time, and aggregates the data to our specifications.
  • Existing Pipeline Integration: Leverages existing pipelines to write data to data lakes for transformation.
  • Faster Insights: Delivers near real-time information for quicker data analysis and visualization.

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com

6/5/24

May the Tech Force Be With You: Unlock Your Career Journey in Technology

Overview

Curious about the possibilities and where your passion fits in the ever-evolving world of technology? Join us as we decode your unique technical journey! This presentation is designed to equip you with the knowledge and confidence to navigate your path in the exciting world of technology

Careers in Technology - Unlock Your Journey in Technology

YouTube Video

Video Agenda

  • What's Next?:

    • Understanding the Technical Landscape.
    • Continuous Learning.
    • Exploring Industry Trends and Job Market.
  • Explore Your Passion: Diverse Areas of Specialization:

    • Showcase different areas of CS specialization (e.g., web development, data science, artificial intelligence, cybersecurity).
  • Building Blocks of Tech: Programming Languages:

    • Showcase and explain some popular programming languages used in different areas.
  • Beyond Coding: Programming vs. Non-Programming Roles:

    • Debunk the myth that all CS careers involve coding.
    • Introduce non-programming roles in tech.
  • Code-Centric vs. Low-Code/No-Code Development:

    • Explain the concept of code-centric and low-code/no-code development approaches.
    • Discuss the advantages and disadvantages of each approach.
  • The Future is Bright:

    • Discuss emerging technologies like AI, cloud computing, and automation, and their impact on the future of CS careers.
    • Emphasize the importance of continuous learning and adaptability in this ever-changing landscape.

Why Attend?

  • In-demand skills: Discover the technical and soft skills sought after by employers in today's tech industry.
  • Matching your passion with a career: Explore diverse areas of specialization and identify the one that aligns with your interests and strengths.
  • Career paths beyond coding: Uncover a range of opportunities in tech, whether you're a coding whiz or have a different area of expertise.
  • Future-proofing your career: Gain knowledge of emerging technologies and how they'll shape the future of computer science.

By attending, you'll leave equipped with the knowledge and confidence to make informed decisions about your future in the ever-evolving world of technology.

Presentation

What's Next for Your Tech Career?

Feeling overwhelmed by the possibilities after graduation? You're not alone! Learning never ends, as there are some Technical foundation (hard skills) areas to consider as you embark on a tech career.

  • Understanding the Technical Landscape
    • Stay Informed: Keep up with the latest trends and advancements in technology
    • Broaden Your Horizons: Look beyond your core area of study. Explore other fields
  • Continuous Learning and Skill Development
    • Adapt and Evolve: The tech industry is constantly changing
    • Technical Skills: Focus on in-demand skills such as Cloud Computing, Cybersecurity, and Data Science

Careers in Technology - Technical Foundation with GitHub

Technical skills are crucial, but success in the tech industry also hinges on strong soft skills. These skills are essential for success in today's collaborative tech environment:

Networking and Professional Growth:

  • Build Your Tech Network: Connect and collaborate with online and offline tech communities.
  • Invest in Your Soft Skills: Enhance your communication, teamwork, and problem-solving skills.
  • Find Your Tech Mentor: Seek guidance and support from experienced professionals.

Careers in Technology - Technical Careers Networking

The tech industry is bursting with opportunities. To navigate this exciting landscape and land your dream job, consider these key areas to craft your career roadmap and take a chance:

  • Work style Preferences:

    • Remote vs. Relocation: Do you thrive in a remote work environment, or are you open to relocating for exciting opportunities?
    • Big Companies vs. Startups: Compare the established structure and resources of large companies or the fast-paced, dynamic culture of startups.
  • Explore an Industry Specialization:

    • Healthcare: Revolutionize patient care by contributing to advancements in medical technology and data analysis.
    • Manufacturing: Fuel innovation by optimizing production processes and integrating automation through industrial tech.

Careers in Technology - Industry Specializations

Diverse Areas of Specialization

Do you like creating websites? Web development might be your calling. Do you dream of building mobile apps? Mobile development could be your fit. Are you intrigued by the power of data and its ability to unlock valuable insights? Data science might be your ideal path.

  • Web Development: Build user interfaces and functionalities for websites and web applications.
  • Mobile Development: Create applications specifically designed for smartphones and tablets.
  • Data Engineering: Build complex data pipelines and data storage solutions.
  • Data Analyst: Process data, discover insights, create visualizations
  • Data Science: Analyze large datasets to extract valuable insights and inform decision-making.
  • Artificial Intelligence: Develop intelligent systems that can learn and make decisions.
  • Cloud Engineering: Design, build, and manage applications and data in the cloud.
  • Cybersecurity: Protect computer systems and networks from digital threats
  • Game Development: Create video games and AR experiences

Careers in Technology - Specialized Domains

Building Blocks of Tech: Programming Languages

The world of software development hinges on a powerful tool - programming languages. These languages, with their unique syntax and functionalities, has advantages for certain platforms like Web, Data, Mobile.

  • Versatile Languages:

    • JavaScript (JS): The king of web development, also used for building interactive interfaces and mobile apps (React Native).
    • Python: A beginner-friendly language, popular for data science, machine learning, web development (Django), and automation.
    • Java: An industry standard, widely used for enterprise applications, web development (Spring), and mobile development (Android), high-level programming.
    • C#: A powerful language favored for game development (Unity), web development (ASP.NET), and enterprise applications.
    • SQL: A powerful language essential for interacting with relational databases, widely used in web development, data analysis, and business intelligence.
  • Specialized Languages:

    • PHP: Primarily used for server-side scripting and web development (WordPress).
    • C++: A high-performance language for system programming, game development, and scientific computing, low-level programming.
  • Mobile-Centric Languages:

    • Swift: The go-to language for native iOS app development.
    • Objective-C: The predecessor to Swift, still used in some legacy iOS apps.
  • JavaScript Extensions:

    • TypeScript: A superset of JavaScript, adding optional static typing for larger web applications.

Careers in Technology - Programming Languages

Beyond Coding: Programming vs. Non-Programming Roles

Programming involve writing code to create apps and systems. Non-programming tech roles, like project managers, QA, UX designers, and technical writers, use their skills to guide the development process, design user experiences, and document technical information.

  • Programming Roles: Developers, software engineers, data engineers
  • Non-Programming Roles: Project managers, systems analysts, user experience (UX) designers, QA, DevOps, technical writers.

The industry continuous to define new specialized roles.

Careers in Technology -  Programming vs Non-Programming Roles

Empowering Everyone: Code-Centric vs. Low-Code/No-Code Development

Do you enjoy diving into the code itself using tools like Visual Studio Code? Or perhaps you prefer a more visual approach, leveraging designer tools and writing code snippets when needed?

  • Code-Centric Development:

    • Traditional approach where developers write code from scratch using programming languages like Python, C#, or C++.
    • Offers maximum flexibility and control over the application's functionality and performance.
    • Requires strong programming skills and a deep understanding of software development principles.

Careers in Technology - Code vs No-Code VSCode

  • Low-Code/No-Code Development:
    • User-friendly platforms that enable rapid application development with minimal coding or no coding required.
    • Utilize drag-and-drop interfaces, pre-built components, and templates to streamline the development process.
    • Ideal for building simple applications, automating workflows, or creating prototypes.

Careers in Technology - Code vs No-Code Visualization with Looker

Evolving with Technology

The landscape of software development is constantly transforming, with new technologies like AI, low-code/no-code platforms, automation, and cloud engineering emerging. Keep evolving!

  • AI as a Co-Pilot: AI won't replace programmers; it will become a powerful collaborator. Imagine AI tools that:

    • Generate code snippets based on your requirements.
    • Refactor and debug code for efficiency and security.
    • Automate repetitive tasks, freeing you for more creative problem-solving.
  • Low-Code/No-Code Democratization: These platforms will empower citizen developers to build basic applications, streamlining workflows. Programmers will focus on complex functionalities and integrating these solutions.

  • Automation Revolution: Repetitive coding tasks will be automated, allowing programmers to focus on higher-level logic, system design, and innovation.

  • Cloud Engineering Boom: The rise of cloud platforms will create a demand for skilled cloud engineers who can design, build, and manage scalable applications in the cloud.

Careers in Technology - Evolving with technology copilots

Final Thoughts: Your Future in Tech Awaits

The tech world is yours to explore! Keep learning, join a community, choose your path in tech and industry, and build your roadmap. Find a balance between your professional pursuits and personal well-being.

Thanks for reading.

Send question or comment at Twitter @ozkary 👍 Originally published by ozkary.com