1/21/26

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment

Overview

In the modern data landscape, the wall between "where data lives" and "how we get insights" is crumbling. This session focuses on the Cognitive Data Lakehouse. A paradigm shift that allows developers to treat a fragmented data lake as a unified, high-performance warehouse.

We will explore how to move beyond brittle ETL pipelines using Zero-ETL architecture in the cloud. The core of our discussion will center on using integrated AI capabilities and semantic modeling to solve the "Metadata Mess" inherent in global manufacturing feeds without moving a single byte of data. From raw telemetry in object storage to semantic intelligence via large language models, we’ll show you the real-world application of AI in modern data engineering.

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment

Explore these curated resources to level up your engineering skills. If you find them helpful, a ⭐️ is much appreciated!

🏗️ Data Engineering

Focus: Real-world ETL & MTA Turnstile Data
Maintained License

🤖 Artificial Intelligence

Focus: LLM Patterns and Agentic Workflows
Status Topic

📉 Machine Learning

Focus: MLOps and Productionizing Models
Build Stage


💡 Contribute: Found a bug or have a suggestion? Open an issue! and be part of the open source project.

YouTube Video

Video Agenda

Phase 1: Foundations & The Zero-ETL Strategy

We kick off with the infrastructure layer. We'll discuss the design of cross-region telemetry tables and how modern cloud engines allow us to query raw files in object storage with the performance of a native table. We’ll establish why "0x data movement" is the goal for modern scalability.

Phase 2: Confronting the Metadata Mess

Schema drift and inconsistent naming across global regions are the enemies of unified analytics. We will look at why traditional manual mapping fails and how we can use AI inference to bridge these gaps and standardize naming conventions automatically.

Phase 3: AI-Driven Unification & Semantic Modeling

The "Cognitive" part of the Lakehouse. We’ll dive into the technical implementation of registering AI models directly within your data warehouse environment. You'll see how to create an abstraction layer that uses AI to normalize data on the fly, creating a robust semantic model.

Phase 4: Scaling to a Global Feed

Finally, we’ll demonstrate the DevOps workflow for integrating a new international factory feed into a global telemetry view. We'll show how to maintain a "Single Source of Intelligence" that BI tools and analysts can consume without needing to know the complexities of the underlying lake.

💡 Why Attend?

  • Master Modern Architecture: Learn the "Abstraction Layer" design pattern that is replacing traditional, slow ETL/ELT processes.
  • Hands-on AI for Data Ops: See exactly how to use AI and semantic modeling within SQL-based workflows to automate data cleaning and schema mapping.
  • Scale Without Pain: Discover how to manage global data sources (multi-region, multi-format) through a single governing layer.
  • Developer Networking: Connect with other data architects, engineering leaders, and professionals solving similar scale and complexity challenges.

Target Audience: Data Engineers, Analytics Architects, Cloud Developers, and anyone interested in the intersection of Big Data and Generative AI.

Presentation

Phase 1: The Zero-ETL Strategy

INFRASTRUCTURE: DATA STAYS LOCAL

Architecting for Scale

  • Storage Decoupling: Raw files remain in the Data Lake, eliminating replication overhead.
  • Virtual Access: Data Warehouse external tables allow immediate querying of CSV, Parquet, and JSON.
  • Minimal Latency: No waiting for ingest pipelines; analysis starts upon file arrival.

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment -  Medallion Architecture Design Diagram

UNMATCHED STORAGE EFFICIENCY

Zero Data Replication

  • Traditional ETL requires moving data across multiple tiers. Our architecture ensures a single source of truth with zero data movement between GCS and BigQuery compute.
  • This is similar to the Bronze Zone in a Medallion Architecture.

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment -  Medallion Architecture Design Diagram

Phase 2: The Metadata Mess

CHALLENGES OF UNIFICATION

Schema Friction

  • Feeds arrive with inconsistent headers (e.g., 'Device Number' vs 'deviceNo'). Manual aliasing is fragile and slow.

Entity Drift

  • Names and IDs vary across systems, preventing standard joins from matching records effectively.

Type Mismatches

  • Varying data types for the same concept (Integer vs String) crash standard SQL aggregation views.

Phase 3: The AI Solution

BIGQUERY STUDIO: THE AI INTERFACE

Remote AI Registration

  • Register Gemini Pro directly inside BigQuery to enable cognitive functions within your SQL workspace.
CREATE MODEL `gemini_remote`
REMOTE WITH CONNECTION `bq_connection`
OPTIONS(endpoint = 'gemini-1.5-pro');

Automated Inference

  • AI "reads" information schemas to infer mapping logic, moving you from Code Author to Logic Approver.
SELECT ml_generate_text_result
FROM ML.GENERATE_TEXT(
  MODEL `gemini_remote`,
  (SELECT "Compare Source A and B schemas. Write a SQL view to unify them." AS prompt)
);

AI-ASSISTED SCHEMA DISCOVERY

Prompting for Base Tables

  • Using AI to generate the DDL for external tables by pointing to compressed feeds in the lake (USA & MEX factories).
SELECT ml_generate_text_result
FROM ML.GENERATE_TEXT(
  MODEL `gemini_remote`,
  (SELECT "Create External Tables as smart_factory.us_telemetry with path 'gs://factory-dl/us/dev-540/telemetry-*.csv.gz' '. Include option CSV, GZIP compression and skip 1 row. Infer and add the schema using lower case" AS prompt));

SELECT ml_generate_text_result
FROM ML.GENERATE_TEXT(
  MODEL `gemini_remote`,
  (SELECT "Create External Tables as smart_factory.mx_telemetry with path 'gs://factory-dl/mx/dev-940/telemetry-*.csv.gz' '. Include option CSV, GZIP compression and skip 1 row. Use schema device_number STRING, bay_id INT64, factory STRING, created STRING" AS prompt));

Generated BigLake DDL

-- USA Factory Feed
CREATE OR REPLACE EXTERNAL TABLE `smart_factory.us_telemetry` (
  device_number STRING,
  bay_id INT64,
  factory STRING,
  created STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://factory-dl/us/dev-540/telemetry*.csv.gz'],
  skip_leading_rows = 1,
  compression = 'GZIP'
);

-- MEX Factory Feed
CREATE OR REPLACE EXTERNAL TABLE `smart_factory.mx_telemetry` (
  device_number STRING,
  bay_id INT64,
  factory STRING,
  created STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://factory-dl/mx/dev-940/telemetry*.csv.gz'],
  skip_leading_rows = 1,
  compression = 'GZIP'
);

AI-ABSTRACTION: THE VIEW LAYER

Generating the Interface

  • AI creates a clean abstraction view for each external table, decoupling raw storage from the analytics model.
    -- AI Instruction
    "Create a view named 
    smart_factory.vw_us_telemetry 
    selecting all columns from the
    usa_telemetry table. Safe cast the created column as datetime."
    

Abstraction Layer DDL

-- Semantic Abstraction Layer
CREATE OR REPLACE VIEW `smart_factory.vw_us_telemetry` AS
SELECT 
  device_number,
  bay_id,
  factory,
  SAFE_CAST(created as DATETIME) AS created
FROM `smart_factory.us_telemetry`;

COGNITIVE UNIFICATION

The Multi-Region Model

  • The unified view now consumes from the abstraction layer, ensuring that changes to raw storage don't break the views down stream.
-- AI Instruction
"Create a view with name
smart_factory.vw_telemetry that creates a union of all the fields from the views vw_[region]_telemetry. The regions include us and mx. List out all the field names. Never use * for field names"

Unified Global View

-- Semantic Abstraction Layer
CREATE OR REPLACE VIEW `smart_factory.vw_telemetry` AS
SELECT 
  device_number,
  bay_id,
  factory,
  created
FROM `smart_factory.vw_us_telemetry`
UNION ALL
SELECT 
  device_number,
  bay_id,
  factory,
  created
FROM `smart_factory.vw_mx_telemetry`

SCALING TO CHINA FACTORY

Evolving the Model

  • Adding the new China feed by generating the External Table definition via AI.
CREATE OR REPLACE EXTERNAL TABLE `smart_factory.cn_telemetry` (
  device_number STRING,
  bay_id INT64,
  factory STRING,
  created STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://factory-dl/cn/dev-900/telemetry*.csv.gz'],
  skip_leading_rows = 1,
  compression = 'GZIP'

Human-in-the-Loop DevOps

  • Use AI to update the unified view with the new data feed. Review and apply the changes by the DevOps team, as changes to a production view require approval.

Manufacturing SPC & Root Cause Analysis

  • This query calculates a rolling mean and standard deviation over the last 10 minutes of telemetry to detect anomalies, “Out of Control” conditions.
WITH TelemetryStats AS (
  SELECT
    machine_id,
    timestamp,
    sensor_reading,
    -- Calculate rolling stats for the "Control Chart"
    AVG(sensor_reading) OVER(PARTITION BY machine_id ORDER BY timestamp ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) as rolling_avg,
    STDDEV(sensor_reading) OVER(PARTITION BY machine_id ORDER BY timestamp ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) as rolling_stddev
  FROM `production_data.mx_telemetry_stream`
  WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
),
Anomalies AS (
  SELECT *,
    -- Define "Out of Control" (Reading > 3 Sigma from mean)
    ABS(sensor_reading - rolling_avg) > (3 * rolling_stddev) AS is_out_of_control
  FROM TelemetryStats
)
SELECT * FROM Anomalies WHERE is_out_of_control = TRUE;

Control Chart Visualization

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment - Control Charts

ADVANTAGE COMPARISON MATRIX

Metric Manual Data Engineering AI-Augmented Zero-ETL
Unification Speed Days/Weeks per Source Minutes via Generative AI
Schema Drift Manual Script Rewrites Adaptive AI View Discovery
Infrastructure Cost High (Data Redundancy) Minimal (In-place on GCS)

Strategic Intelligence ROI:

ROI(ai) = Insights Velocity / (Movement Cost + Labor Hours)

FINAL THOUGHTS: STRATEGIC SUMMARY

Legacy Challenges

  • Brittle ETL: Manual pipelines break with every schema change.
  • Cost Inefficiency: Redundant storage for processed data.
  • Semantic Silos: Hard-coded aliases for disparate naming conventions.
  • Slow Time-to-Insight: Weeks spent on manual schema alignment.

AI-Assisted Solutions

  • Zero-ETL Arch: Cost-effective storage with Data Lake virtual access.
  • Automated Inference: Vertex AI handles the "heavy lifting" of mapping.
  • Adaptive DevOps: Scalable model evolution (USA → MEX → China).
  • Unified Intelligence: One virtual source of truth for global analytics.

Moving from Data Reporting to Active Semantic Intelligence.

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia


📅 Upcoming Sessions

Our upcoming series expands beyond data engineering to bridge the gap between AI, Machine Learning, and modern cloud architecture. Using our Data, AI, and ML GitHub blueprints, we provide the code-first patterns needed to build everything from Zero-ETL pipelines to scalable LLM-powered systems. Join us to explore how these integrated disciplines work together to turn raw data into production-ready intelligence.


🌟 Let's Connect & Build Together

If you enjoyed these resources, let's stay in touch! I share deep-dives into AI/ML patterns and host community events here:

  • GDG Broward: Join our local dev community for meetups and workshops.
  • LinkedIn: Let's connect professionally! I share insights on engineering.
  • GitHub: Follow my open-source journey and star the repos you find useful.
  • YouTube: Watch step-by-step tutorials on the projects listed above.
  • BlueSky / X / Twitter: Daily tech updates and quick engineering tips.

👉 Originally published at ozkary.com

12/10/25

From Raw Data to Governance: Refining Data with the Medallion Architecture Dec 2025

Overview

Build upon your existing data engineering expertise and discover how Medallion Architecture can transform your data strategy. This session provides a hands-on approach to implementing Medallion principles, empowering you to create a robust, scalable, and governed data platform.

We'll explore how to align data engineering processes with Medallion Architecture, identifying opportunities for optimization and improvement. By understanding the core principles and practical implementation steps, you'll learn how to optimize data pipelines, enhance data quality, and unlock valuable insights through a structured, layered approach to drive business success.

From Raw Data to Governance: Refining Data with the Medallion Architecture

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video - Dec 2025

Video Agenda

  • Introduction to Medallion Architecture

    • Defining Medallion Architecture
    • Core Principles
    • Benefits of Medallion Architecture
  • The Raw Zone

    • Understanding the purpose of the Raw Zone
    • Best practices for data ingestion and storage
  • The Bronze Zone

    • Data transformation and cleansing
    • Creating a foundation for analysis
  • The Silver Zone

    • Data optimization and summarization
    • Preparing data for consumption
  • The Gold Zone

    • Curated data for insights and action
    • Enabling self-service analytics
  • Empowering Insights

    • Data-driven decision-making
    • Accelerated Insights
  • Data Governance

    • Importance of data governance in Medallion Architecture
    • Implementing data ownership and stewardship
    • Ensuring data quality and security

Why Attend:

Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

  • Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
  • Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
  • Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
  • Scalability: The layered approach can accommodate growing data volumes and complexity.
  • Cost Efficiency: Optimized data storage and processing can reduce costs.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

  • Key Characteristics:
    • Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
    • Data is ingested as-is, without any cleaning or transformation
    • High volume and velocity
    • Data retention policies are crucial
  • Benefits:
    • Preserves original data for potential future analysis
    • Enables data reprocessing
    • Supports data lineage and auditability

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Raw Zone Diagram

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

  • Key Characteristics:
    • Data is cleansed and standardized
    • Basic transformations are applied (e.g., data type conversions, null handling)
    • Data is structured into tables or views
    • Data quality checks are implemented
    • Data retention policies may be shorter than the Raw Zone
  • Benefits:
    • Improves data quality and consistency
    • Provides a foundation for further analysis
    • Enables data exploration and discovery

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

  • Key Characteristics:
    • Data is cleansed, standardized, and enriched
    • Data is structured for analytical purposes (e.g., normalized, de-normalized)
    • Data is optimized for query performance (e.g., partitioning, indexing)
    • Data is aggregated and summarized for specific use cases
  • Benefits:
    • Improved query performance
    • Supports self-service analytics
    • Enables advanced analytics and machine learning
    • Reduces query costs

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

  • Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
  • Key Characteristics:
    • Data is highly refined, aggregated, and optimized for specific use cases
    • Data is often materialized for performance
    • Data is subject to rigorous quality checks and validation
    • Data is secured and governed
  • Benefits:
    • Enables rapid insights and decision-making
    • Supports self-service analytics and reporting
    • Provides a foundation for advanced analytics and machine learning
    • Reduces query latency

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

  • Key Characteristics:
    • Data is accessible and easily consumable
    • Supports various analytical tools and platforms (BI, ML, data science)
    • Enables self-service analytics
    • Drives business decisions and actions
  • Examples of Consumption Tools:
    • Business Intelligence (BI) tools (Looker, Tableau, Power BI)
    • Data science platforms (Python, R, SQL)
    • Machine learning platforms (TensorFlow, PyTorch)
    • Advanced analytics tools

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.

Key components of data governance include:

  • Data Lineage: Tracking data's journey from source to consumption.
  • Data Ownership: Defining who is responsible for data accuracy and usage.
  • Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
  • Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

  • Key Benefits:
    • Improved data quality
    • Enhanced governance
    • Accelerated insights
    • Scalability
    • Cost Efficiency.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, Data Engineering Process Fundamentals, which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository Introduction to Data Engineering Process Fundamentals.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com

11/19/25

From Raw Data to Governance: Refining Data with the Medallion Architecture Nov 2025

Overview

Build upon your existing data engineering expertise and discover how Medallion Architecture can transform your data strategy. This session provides a hands-on approach to implementing Medallion principles, empowering you to create a robust, scalable, and governed data platform.

We'll explore how to align data engineering processes with Medallion Architecture, identifying opportunities for optimization and improvement. By understanding the core principles and practical implementation steps, you'll learn how to optimize data pipelines, enhance data quality, and unlock valuable insights through a structured, layered approach to drive business success.

From Raw Data to Governance: Refining Data with the Medallion Architecture

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

  • Introduction to Medallion Architecture

    • Defining Medallion Architecture
    • Core Principles
    • Benefits of Medallion Architecture
  • The Raw Zone

    • Understanding the purpose of the Raw Zone
    • Best practices for data ingestion and storage
  • The Bronze Zone

    • Data transformation and cleansing
    • Creating a foundation for analysis
  • The Silver Zone

    • Data optimization and summarization
    • Preparing data for consumption
  • The Gold Zone

    • Curated data for insights and action
    • Enabling self-service analytics
  • Empowering Insights

    • Data-driven decision-making
    • Accelerated Insights
  • Data Governance

    • Importance of data governance in Medallion Architecture
    • Implementing data ownership and stewardship
    • Ensuring data quality and security

Why Attend:

Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

  • Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
  • Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
  • Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
  • Scalability: The layered approach can accommodate growing data volumes and complexity.
  • Cost Efficiency: Optimized data storage and processing can reduce costs.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

  • Key Characteristics:
    • Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
    • Data is ingested as-is, without any cleaning or transformation
    • High volume and velocity
    • Data retention policies are crucial
  • Benefits:
    • Preserves original data for potential future analysis
    • Enables data reprocessing
    • Supports data lineage and auditability

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Raw Zone Diagram

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

  • Key Characteristics:
    • Data is cleansed and standardized
    • Basic transformations are applied (e.g., data type conversions, null handling)
    • Data is structured into tables or views
    • Data quality checks are implemented
    • Data retention policies may be shorter than the Raw Zone
  • Benefits:
    • Improves data quality and consistency
    • Provides a foundation for further analysis
    • Enables data exploration and discovery

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

  • Key Characteristics:
    • Data is cleansed, standardized, and enriched
    • Data is structured for analytical purposes (e.g., normalized, de-normalized)
    • Data is optimized for query performance (e.g., partitioning, indexing)
    • Data is aggregated and summarized for specific use cases
  • Benefits:
    • Improved query performance
    • Supports self-service analytics
    • Enables advanced analytics and machine learning
    • Reduces query costs

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

  • Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
  • Key Characteristics:
    • Data is highly refined, aggregated, and optimized for specific use cases
    • Data is often materialized for performance
    • Data is subject to rigorous quality checks and validation
    • Data is secured and governed
  • Benefits:
    • Enables rapid insights and decision-making
    • Supports self-service analytics and reporting
    • Provides a foundation for advanced analytics and machine learning
    • Reduces query latency

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

  • Key Characteristics:
    • Data is accessible and easily consumable
    • Supports various analytical tools and platforms (BI, ML, data science)
    • Enables self-service analytics
    • Drives business decisions and actions
  • Examples of Consumption Tools:
    • Business Intelligence (BI) tools (Looker, Tableau, Power BI)
    • Data science platforms (Python, R, SQL)
    • Machine learning platforms (TensorFlow, PyTorch)
    • Advanced analytics tools

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.

Key components of data governance include:

  • Data Lineage: Tracking data's journey from source to consumption.
  • Data Ownership: Defining who is responsible for data accuracy and usage.
  • Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
  • Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
  • Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

  • Key Benefits:
    • Improved data quality
    • Enhanced governance
    • Accelerated insights
    • Scalability
    • Cost Efficiency.

From Raw Data to Governance: Refining Data with the Medallion Architecture -  Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

Data Engineering Process Fundamentals - Book by Oscar Garcia Data Engineering Process Fundamentals - Book by Oscar Garcia

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, Data Engineering Process Fundamentals, which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository Introduction to Data Engineering Process Fundamentals.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com

10/29/25

From Raw Data to Analytics: The Modern Data Layer Architecture



Overview

This presentation is part of the Data Engineering Process Fundamentals series, focusing on the essential architectural components—the Data Lake and the Data Warehouse—and defining their respective roles in a modern analytics ecosystem.

From Raw Data to Analytics: The Modern Data Layer Architecture

  • Follow this GitHub repo during the presentation: (Star the project to follow and get updates)

👉 GitHub Repo

  • Data engineering Series:

👉 Blog Series

YouTube Video

Video Agenda

Agenda:

  1. Introduction to Data Engineering:

  2. Brief overview of the data engineering landscape and its critical role in modern data-driven organizations.

  3. Operational Data

  4. Understanding Data Lakes:

  5. Explanation of what a data lake is and its purpose in storing vast amounts of raw and unstructured data.

  6. Exploring Data Warehouses:

  7. Definition of data warehouses and their role in storing structured, processed, and business-ready data.

  8. Comparing Data Lakes and Data Warehouses:

  9. Comparative analysis of data lakes and data warehouses, highlighting their strengths and weaknesses.

  10. Discussing when to use each based on specific use cases and business needs.

  11. Integration and Data Pipelines:

  12. Insight into the seamless integration of data lakes and data warehouses within a data engineering pipeline.

  13. Code walkthrough showcasing data movement and transformation between these two crucial components.

  14. Real-world Use Cases:

  15. Presentation of real-world use cases where effective use of data lakes and data warehouses led to actionable insights and business success.

  16. Hands-on demonstration using Python, Jupyter Notebook and SQL to solidify the concepts discussed, providing attendees with practical insights and skills.

  17. Q&A and Hands-on Session:

  18. An interactive Q&A session to address any queries.

Conclusion:

This session aims to equip attendees with a strong foundation in data engineering, focusing on the pivotal role of data lakes and data warehouses. By the end of this presentation, participants will grasp how to effectively utilize these tools, enabling them to design efficient data solutions and drive informed business decisions.

This presentation will be accompanied by live code demonstrations and interactive discussions, ensuring attendees gain practical knowledge and valuable insights into the dynamic world of data engineering.

Supporting Materials Reminder

Subsequent Sessions: Join us for future sessions in our Data Engineering Process Fundamentals series, where we will build a data pipeline and delve deeper into topics like orchestration and governance.

Resources: This presentation is based on the book, Data Engineering Process Fundamentals, and all supporting code and examples are available on our popular GitHub repository.

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

  • Data Lake and Data Warehouse
  • Discovery and Data Analysis
  • Design and Infrastructure Planning
  • Data Lake - Pipeline and Orchestration
  • Data Warehouse - Design and Implementation
  • Analysis and Visualization

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Operational Data

Operational data is often generated by applications, and it is stored in transactional relational databases like SQL Server, Oracle and NoSQL (JSON) databases like MongoDB, Firebase. This is the data that is created after an application saves a user transaction like contact information, a purchase or other activities that are available from the application.

Features:

  • Application support and transactions
  • Relational data structure and SQL or document structure NoSQL
  • Small queries for case analysis

Not Best For:

  • Reporting system
  • Large queries
  • Centralized Big Data system

Data Engineering Process Fundamentals - Operational Data

Data Lake - Analytical Data Staging

A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. Analytical data is the transaction data that has been extracted from a source system via a data pipeline as part of the staging data process.

Features:

  • Store the data in its raw format without any transformation
  • This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files
  • Low Cost for massive storage power
  • Not Designed for querying or data analysis
  • It is used as external tables by most systems

Data Engineering Process Fundamentals - Analytical Data staging

Data Warehouse - Analytical Data

A Data Warehouse is a centralized storage system that stores integrated data from multiple sources. The system is designed to host and serve Big Data scenarios with lower operational cost than transaction databases, but higher costs than a Data Lake. This system host the Analytical Data that has been processed and is ready for analytical purposes.

Data Warehouse Features:

  • Stores historical data in relational tables with an optimized schema, which enables the data analysis process
  • Provides SQL support to query the data
  • It can integrate external resources like CSV and parquet files that are stored on Data Lakes as external tables
  • The system is designed to host and serve Big Data scenarios. It is not meant to be used as a transactional system
  • Storage is more expensive
  • Offloads archived data to Data Lakes

Data Engineering Process Fundamentals - Analytical Data Store

Discovery - Data Analysis

During the discovery phase of a Data Engineering Process, we look to identify and clearly document a problem statement, which helps us have an understanding of what we are trying to solve. We also look at our analytical approach to make observations about at the data, its structure and source. This leads us into defining the requirements for the project, so we can define the scope, design and architecture of the solution.

  • Download sample data files
  • Run experiments to make observations
  • Write Python scripts using VS Code or Jupyter Notebooks
  • Transform the data with Pandas
  • Make charts with Plotly
  • Document the requirements

Data Engineering Process Fundamentals - Data Analysis and discovery

Design and Planning

The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful system. It involves defining the system architecture, designing data pipelines, implementing source control practices, ensuring continuous integration and deployment (CI/CD), and leveraging tools like Docker and Terraform for infrastructure automation.

  • Use GitHub for code repo and for CI/CD actions
  • Use Terraform is an Infrastructure as Code (IaC) tool that enables us to manage cloud resources across multiple cloud providers
  • Use Docker containers to run the code and manage its dependencies

Data Engineering Process Fundamentals - Design and Planning

Data Lake - Pipeline and Orchestration

A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake, and monitor cloud resources.

  • This can be code-centric, leveraging languages like Python
  • Or a low-code approach, utilizing tools such as Azure Data Factory, which provides a turn-key solution
  • Monitor services enable us to track telemetry data
  • Docker Hub, GitHub can be used for the CI/CD process

Data Engineering Process Fundamentals - Data Lake - Data Pipeline and Orchestration

Data Warehouse - Design and Implementation

In the design phase, we lay the groundwork by defining the database system, schema model, and technology stack required to support the data warehouse’s implementation and operations. In the implementation phase, we focus on converting conceptual data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis. Create a repeatable and extendable process.

Data Engineering Process Fundamentals - Data Warehouse Design and Implementation

Data Warehouse - Data Analysis

Data analysis is the practice of exploring data and understanding its meaning. It involves activities that can help us achieve a specific goal, such as identifying data dimensions and measures, as well as data analysis to identify outliers, trends, and distributions.

  • We can accomplish these activities by writing code using Python and Pandas, SQL, Visual Studio Code or Jupyter Notebooks.
  • What's more, we can use libraries, such as Plotly, to generate some visuals to further analyze data and create prototypes.

Data Engineering Process Fundamentals - Data Analysis

Data Analysis and Visualization

Data visualization is a powerful tool that takes the insights derived from data analysis and presents them in a visual format. While tables with numbers on a report provide raw information, visualizations allow us to grasp complex relationships and trends at a glance.

  • Dashboards, in particular, bring together various visual components like charts, graphs, and scorecards into a unified interface that can help us tell a story
  • Use tools like PowerBI, Looker, Tableau to model the data and create enterprise level visualizations

Data Engineering Process Fundamentals - Data Visualization

Conclusion

Both data lakes and data warehouses are essential components of a data engineering project. The primary function of a data lake is to store large amounts of operational data in its raw format, serving as a staging area for analytical processes. In contrast, a data warehouse acts as a centralized repository for information, enabling engineers to transform, process, and store extensive data. This allows the analytical team to utilize coding languages like Python and tools such as Jupyter Notebooks, as well as low-code platforms like Looker Studio and Power BI, to create enterprise-quality dashboards for the organization.

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, Data Engineering Process Fundamentals, which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository Introduction to Data Engineering Process Fundamentals.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com

9/29/25

From Blueprint to Build - The Design and Planning Phase in Data Engineering

Overview

The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful and scalable solution. This phase ensures that the architecture is strategically aligned with business objectives, optimizes resource utilization, and mitigates potential risks.

Data Engineering Process Fundamentals

  • Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

  • Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

In this session, we embark on the next chapter of our data journey, delving into the critical Design and Planning Phase. As we transition from discovery to design, we'll unravel the intricacies of:

System Design and Architecture:

  • Understanding the foundational principles that shape a robust and scalable data system.

    Data Pipeline and Orchestration:

  • Uncovering the essentials of designing an efficient data pipeline and orchestrating seamless data flows.

    Source Control and Deployment:

  • Navigating the best practices for source control, versioning, and deployment strategies.

    CI/CD in Data Engineering:

  • Implementing Continuous Integration and Continuous Deployment (CI/CD) practices for agility and reliability.

    Docker Container and Docker Hub:

  • Harnessing the power of Docker containers and Docker Hub for containerized deployments.

    Cloud Infrastructure with IaC:

  • Exploring technologies for building out cloud infrastructure using Infrastructure as Code (IaC), ensuring efficiency and consistency.

Why Join:

  • Gain insights into designing scalable and efficient data systems.

  • Learn best practices for cloud infrastructure and IaC.

  • Discover the importance of data pipeline orchestration and source control.

  • Explore the world of CI/CD in the context of data engineering.

  • Unlock the potential of Docker containers for your data workflows.

Some of the technologies that we will be covering:

  • Cloud Infrastructure
  • Data Pipelines
  • GitHub and Actions
  • VSCode
  • Docker and Docker Hub
  • Terraform

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

  • Importance of Design and Planning
  • System Design and Architecture
  • Data Pipeline and Orchestration
  • Source Control and CI/CD
  • Docker Containers
  • Cloud Infrastructure with IaC

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Importance of Design and Planning

The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful and scalable solution. This phase ensures that the architecture is strategically aligned with business objectives, optimizes resource utilization, and mitigates potential risks.

Foundational Areas

  • Designing the data pipeline and technology specifications like flows, coding language, data governance and tools
  • Define the system architecture like cloud services for scalability, data platform
  • Source control and deployment automation with CI/CD
  • Using Docker containers for environment isolation to avoid deployment issues
  • Infrastructure automation with Terraform or cloud CLI tools
  • System monitor, notification and recovery

Data Engineering Process Fundamentals - Design and Planning

System Design and Architecture

In a system design, we need to clearly define the different technologies that should be used for each area of the solution. It includes the high-level system architecture, which defines the different components and their integration.

  • The design outlines the technical solution, including system architecture, data integration, flow orchestration, storage platforms, and data processing tools. It focuses on defining technologies for each component to ensure a cohesive and efficient solution.

  • A system architecture is a critical high-level design encompassing various components such as data sources, ingestion resources, workflow orchestration, storage, transformation services, continuous ingestion, validation mechanisms, and analytics tools.

Data Engineering Process Fundamentals - System Architecture

Data Pipeline and Orchestration

A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake, and monitor cloud resources.

  • This can be code-centric, leveraging languages like Python, SQL
  • Or a low-code approach, utilizing tools such as Azure Data Factory, which provides a turn-key solution
  • Monitor services enable us to track telemetry data to support operational requirements
  • Docker Hub, GitHub can be used for the CI/CD process and deployed our code-centric solutions
  • Scheduling, recovering from failures and dashboards are essentials for orchestration
  • Low-code solutions , like data factory, can also be used

Data Engineering Process Fundamentals - Data Pipeline

Source Control - CI/CD

Implementing source control practices alongside Continuous Integration and Continuous Delivery (CI/CD) pipelines is vital for facilitating agile development. This ensures efficient collaboration, change tracking, and seamless code deployment, crucial for addressing ongoing feature changes, bug fixes, and new environment deployments.

  • Systems like Git facilitates effective code and configuration file management, enabling collaboration and change tracking.
  • Platforms such as GitHub enhance collaboration by providing a remote repository for sharing code.
  • CI involves integrating code changes into a central repository, followed by automated build and test processes to validate changes and provide feedback.
  • CD automates the deployment of code builds to various environments, such as staging and production, streamlining the release process and ensuring consistency across environments.

Data Engineering Process Fundamentals - GitHub CI/CD

Docker Container and Docker Hub

Docker proves invaluable for our data pipelines by providing self-contained environments with all necessary dependencies. With Docker Hub, we can effortlessly distribute pipeline images, facilitating swift and reliable provisioning of new environments.

  • Docker containers streamline the deployment process by encapsulating application and dependency configurations, reducing runtime errors.
  • Containerizing data pipelines ensures reliability and portability by packaging all necessary components within a single container image.
  • Docker Hub serves as a centralized container registry, enabling seamless image storage and distribution for streamlined environment provisioning and scalability.

Data Engineering Process Fundamentals - Docker

Cloud Infrastructure with IaC

Infrastructure automation is crucial for maintaining consistency, scalability, and reliability across environments. By defining infrastructure as code (IaC), organizations can efficiently provision and modify cloud resources, mitigating manual errors.

  • Define infrastructure configurations as code, ensuring consistency across environments.
  • Easily scale resources up or down to meet changing demands with code-defined infrastructure.
  • Reduce manual errors and ensure reproducibility by automating resource provisioning and management.
  • Track infrastructure changes under version control, enabling collaboration and ensuring auditability.
  • Track infrastructure state, allowing for precise updates and minimizing drift between desired and actual configurations.

Data Engineering Process Fundamentals - Terraform

Summary

The design and planning phase of a data engineering project sets the stage for success. From designing the system architecture and data pipelines to implementing source control, CI/CD, Docker, and infrastructure automation with Terraform, every aspect contributes to efficient and reliable deployment. Infrastructure automation, in particular, plays a critical role by simplifying provisioning of cloud resources, ensuring consistency, and enabling scalability, ultimately leading to a robust and manageable data engineering system.

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, Data Engineering Process Fundamentals, which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository Introduction to Data Engineering Process Fundamentals.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com