2/25/26

AI Driven App Architecture - Smart Development Life Cycle Governance

Oscar Garcia @ozkary 2/25/2026 AI , ai-agents , ai-governance , typescript No comments

Overview

As development teams scale, maintaining architectural consistency becomes the biggest bottleneck. Documents are ignored, and linters only catch syntax errors, not design patterns.

In this session, we will demonstrate how to transform AI from a passive coding assistant into an active Architectural Enforcer. By embedding your "unwritten rules" directly into the repository configuration, you create a developer experience where the AI enforces your patterns in real-time.

We will explore how this shifts the workflow: new developers are guided by the AI from day one, preventing architectural leakage before a pull request is ever opened.

AI Driven App Architecture - Smart Development Life Cycle Governance

🚀 Featured Open Source Projects

Explore these curated resources to level up your engineering skills. If you find them helpful, a ⭐️ is much appreciated!

🏗️ Data Engineering

Focus: Real-world ETL & MTA Turnstile Data

🤖 Artificial Intelligence

Focus: LLM Patterns and Agentic Workflows

📉 Machine Learning

Focus: MLOps and Productionizing Models

💡 Contribute: Found a bug or have a suggestion? Open an issue! and be part of the open source project.

YouTube Video

👍 Subscribe to the channel to get notify on new events!

Video Agenda

The Problem: Architectural Drift

Why strict rules (Controller-View, Pascal/camelCase) degrade over time and how AI can fix it.

The Intelligence Engine

Breakdown of the core components: Global Rules, Contextual Guardrails, Agent Tools, and Directory Structure.

Configuration: Global Governance

Setting up global "system prompts" for the repository to enforce tech stack and naming conventions.

Configuration: Contextual Guardrails

Creating "firewalls" for specific folders (e.g., preventing logic in views, preventing API calls in Controllers).

Configuration: The Tooling

Building custom Slash Commands (/new-module) to automate "Vertical Slice" scaffolding.

Configuration: The Auditor Agent

Implementing a specialized "Gatekeeper" persona that scans imports to ensure strict layer separation.

Agent Mapping

A conceptual framework comparing repository configuration to autonomous agent architecture.

💡 Why Attend?

Stop writing boilerplate: Learn to automate complex folder structures with one command.
Reduce PR Reviews: Shift governance "left" by having the AI catch architectural errors instantly.
Interactive Demo: See the .github configuration in action on a real codebase.
Takeaway Code: Leave with the copy-paste markdown templates to implement this in your own repo tomorrow.

Target Audience

Tech Leads & Architects who need to enforce standards across scaling teams.
Developers who are tired of correcting the same patterns in code reviews.
DevOps Engineers interested in "Governance as Code."
Leadership teams that are trying to raise standards and productivity in their organizations.

Presentation

SETTING THE STAGE

The Context

We enforce a strict pattern using the ViCSA architecture
PascalCase for UI Components.
camelCase for Logic & Services.
Separation of Concerns (SoC) is non-negotiable.

The Problem

Architectural Drift: Patterns degrade over time.
Passive Docs: Wiki pages are ignored.
Linter Limits: Linters catch syntax, not architecture.
Solution: Active Governance via AI.

THE INTELLIGENCE ENGINE

Core AI Policies

Centralized Config: Rules live in the repo, not the user's IDE.
Global Rules: Applied to every interaction (System Prompt).
Contextual Rules: Triggered only when specific files are opened.
Agent Tools: Custom commands to scaffold new components, controllers or services.

AI Driven App Architecture - Smart Development Life Cycle Governance - Project Structure

CONFIGURATION: GLOBAL GOVERNANCE

Global Instructions

File: .github/copilot-instructions.md

This acts as the System Prompt for the entire repository. It is silently added to every interaction.

Tech Stack: TS, Tailwind, Hooks.
Naming: Pascal vs camelCase.
Flow: View → Controller → Service -> API.

AI Driven App Architecture - Smart Development Life Cycle Governance - Global Governance

DEV EXPERIENCE: THE SILENT ENFORCER

Without Config

A developer asks:

How do I create a new service?

AI suggests a generic Class-based service.
Suggests creating a utils.js file.
Ignores project folder structure.

With Config

A developer asks: How do I create a new service?"

AI reads the Governance.
Response: Create src/services/userAuth/index.ts using a functional export, as per project standards.

CONFIGURATION: CONTEXTUAL GUARDRAILS

View Layer Rules

File: .github/instructions/controller-layer.md

Trigger: Opening any **/*.tsx file.

"You are a View."
"No Logic allowed."
"No direct API calls."

Controller Layer Rules

File: .github/instructions/view-layer.md

Trigger: Opening any **/controller.ts file.

"You are a Controller."
"Use Services, NOT Fetch."
"Manage State here."

DEV EXPERIENCE: REAL-TIME INTERVENTION

The Scenario

A developer tries to write fetch() inside a UI Component (index.tsx).
They ask Copilot: "Write a fetch call here for me."

The Intervention

Ghost Text: Copilot refuses to autocomplete the network call.

Chat Reply:

I cannot. This is a View file. Please move this logic to the sibling Controller (index.ts) and import it.

CONFIGURATION: THE TOOLING

Prompt Library

File: .github/prompts/new-module.md

These act as Agent Tools or "Slash Commands".

Goal: Automate the "Vertical Slice".
Benefit: Complex scaffolding logic is stored in the repo, not in the developer's head.
Usage: /new-module

# Prompt Library (The Scaffolder)
File: `.github/prompts/new-component.md`
Goal: Automate the creation of a standalone UI Component with optional Service/API layers.

# Create New Component
I need to generate a new component following our **Folder-as-Namespace** pattern.
**Command:** `/new-component:{{componentName}} {{args}}`

Please generate the code blocks for the layers requested in the arguments (service, api). 
*Note: Logic folders must be camelCase. UI folders must be PascalCase.*

---

### Component Layer (Required)
**Folder:** `src/components/{{componentName (PascalCase)}}/`
- **File:** `controller.ts` (Controller): Logic and State only.
- **File:** `index.tsx` (View): Pure UI. Imports Controller.
---


### Service Layer (Optional)
*Condition: Generate only if 'service' is present in {{args}}.*

**File:** `src/services/{{componentName (camelCase)}}/index.ts`
- **Role:** Business logic and data transformation.
- **Code:** Import the API (if requested). Export a service object or functional exports.

---

### API Layer (Optional)
*Condition: Generate only if 'api' is present in {{args}}.*

**File:** `src/apis/{{componentName (camelCase)}}/index.ts`
- **Role:** Define specific endpoints.
- **Code:** Import `coreClient` from `src/apis/index.ts`. Export async functions with typed responses.

---

### Style Guidelines
- **Typing:** Use TypeScript interfaces for all Props and Data models.
- **Separation:** Logic stays in `controller.ts`, JSX stays in `index.tsx`.
- **Naming:** Components use PascalCase; Services/APIs use camelCase.

DEV EXPERIENCE: THE SCAFFOLDING

The Command

Starting a new feature called "Sales Dashboard".

Action:

/new-module featureName:Sales Dashboard

The Execution

Analyzes the request.
Applies PascalCase to Containers/Components folders.
Applies camelCase to api/service folders.
Generates the Controller-View pair instantly.

THE RESULT: GENERATED ARCHITECTURE

The Results

Layers generated instantly.
Correct naming conventions applied.
Zero manual boilerplate.

AI Driven App Architecture - Smart Development Life Cycle Governance - Project Structure

CONFIGURATION: THE AUDITOR AGENT

Specialized Persona

File: .github/agents/arch-auditor.md

This creates a named Agent that acts as a Gatekeeper. It doesn't write features; it verifies them.

Role: Architecture Enforcer.
Task: Scans imports to ensure strict layer separation.
Rule: "Views never talk to APIs."

# Custom AI Agent (The Reviewer)
Agent ID: `@vicsa-auditor`

Context: A bot that ensures the chain of command is respected using the ViCSA architecture (View Controller Service API)

## Primary Objective
name: Architecture Auditor
description: Verifies strict separation of Controller, Service, and View layers.
tools: [code-search]

---
## Role
You ensure the integrity of the data flow: View -> Controller -> Service -> API.

## Audit Logic
When asked to "Audit this feature":

1. **Check the View (.tsx):** - FAIL if it imports `src/services`.
   - FAIL if it imports `src/apis`.
   - PASS only if it imports `./index`.

2. **Check the Controller (.ts):**
   - FAIL if it uses `fetch` or `axios`.
   - PASS only if it delegates to `src/services`.

3. **Check the Service:**
   - FAIL if it defines its own URL logic.
   - PASS only if it imports `src/apis/index.ts`.

DEV EXPERIENCE: THE CODE REVIEW

The Interaction

Before raising a pull request, the developer invokes the auditor.

Prompt:

@vicsa-auditor check this component for violations.

Response:

✅ PASS: SalesDashboard/index.tsx imports only from its sibling controller. No direct API calls found.

AI Driven App Architecture - Smart Development Life Cycle Governance - Review Process

THE AUTONOMY ADVANTAGE

AI enforces the ViCSA architecture through continuous observation and autonomous execution.

Perception: Continuously observes the active workspace, file paths (e.g., src/components/), and context to understand the developer's structural intent.
Reasoning: Evaluates the perceived context against the repository's .github Guardrails, determining if a View is bypassing a Controller or violating Separation of Concerns, SoC.
Action: Executes autonomous scaffolding, enforces strict ViCSA governance, provides recommended fixes feedback.

SUMMARY & AGENT MAPPING

Embedding governance directly into the repository transforms the development lifecycle. It replaces passive wiki pages with active, real-time enforcement, ensuring that every AI suggestion aligns with architectural standards. This eliminates "drift", accelerates onboarding, and turns Copilot into a domain-expert partner.

Agent Component	GitHub Implementation
System Prompt	Global Instructions (copilot-instructions.md)
Context / RAG	Modular Instructions (instructions/*.md)
Tools / Functions	Prompt Library (prompts/*.md)
Human Prompt	Chat Window
Persona	Agent Personas (i.e. agents/arch-auditor.md)

RAG: Retrieval augmented generation

🌟 Let's Connect & Build Together

Thanks for reading! 😊 If you enjoyed these resources, let's stay in touch! I share deep-dives into AI/ML patterns and host community events here:

GDG Broward: Join our local dev community for meetups and workshops.
Global AI Events: Join Global AI Events.
LinkedIn: Let's connect professionally! I share insights on engineering.
GitHub: Follow my open-source journey and star the repos you find useful.
YouTube: Watch step-by-step tutorials on the projects listed above.
BlueSky / X / Twitter: Daily tech updates and quick engineering tips.

1/21/26

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment

Oscar Garcia @ozkary 1/21/2026 AI , cloud-engineering , data , data lake , data warehouse No comments

Overview

In the modern data landscape, the wall between "where data lives" and "how we get insights" is crumbling. This session focuses on the Cognitive Data Lakehouse. A paradigm shift that allows developers to treat a fragmented data lake as a unified, high-performance warehouse.

We will explore how to move beyond brittle ETL pipelines using Zero-ETL architecture in the cloud. The core of our discussion will center on using integrated AI capabilities and semantic modeling to solve the "Metadata Mess" inherent in global manufacturing feeds without moving a single byte of data. From raw telemetry in object storage to semantic intelligence via large language models, we’ll show you the real-world application of AI in modern data engineering.

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment

🚀 Featured Open Source Projects

Explore these curated resources to level up your engineering skills. If you find them helpful, a ⭐️ is much appreciated!

🏗️ Data Engineering

Focus: Real-world ETL & MTA Turnstile Data

🤖 Artificial Intelligence

Focus: LLM Patterns and Agentic Workflows

📉 Machine Learning

Focus: MLOps and Productionizing Models

💡 Contribute: Found a bug or have a suggestion? Open an issue! and be part of the open source project.

YouTube Video

Video Agenda

Phase 1: Foundations & The Zero-ETL Strategy

We kick off with the infrastructure layer. We'll discuss the design of cross-region telemetry tables and how modern cloud engines allow us to query raw files in object storage with the performance of a native table. We’ll establish why "0x data movement" is the goal for modern scalability.

Phase 2: Confronting the Metadata Mess

Schema drift and inconsistent naming across global regions are the enemies of unified analytics. We will look at why traditional manual mapping fails and how we can use AI inference to bridge these gaps and standardize naming conventions automatically.

Phase 3: AI-Driven Unification & Semantic Modeling

The "Cognitive" part of the Lakehouse. We’ll dive into the technical implementation of registering AI models directly within your data warehouse environment. You'll see how to create an abstraction layer that uses AI to normalize data on the fly, creating a robust semantic model.

Phase 4: Scaling to a Global Feed

Finally, we’ll demonstrate the DevOps workflow for integrating a new international factory feed into a global telemetry view. We'll show how to maintain a "Single Source of Intelligence" that BI tools and analysts can consume without needing to know the complexities of the underlying lake.

💡 Why Attend?

Master Modern Architecture: Learn the "Abstraction Layer" design pattern that is replacing traditional, slow ETL/ELT processes.
Hands-on AI for Data Ops: See exactly how to use AI and semantic modeling within SQL-based workflows to automate data cleaning and schema mapping.
Scale Without Pain: Discover how to manage global data sources (multi-region, multi-format) through a single governing layer.
Developer Networking: Connect with other data architects, engineering leaders, and professionals solving similar scale and complexity challenges.

Target Audience: Data Engineers, Analytics Architects, Cloud Developers, and anyone interested in the intersection of Big Data and Generative AI.

Presentation

Phase 1: The Zero-ETL Strategy

INFRASTRUCTURE: DATA STAYS LOCAL

Architecting for Scale

Storage Decoupling: Raw files remain in the Data Lake, eliminating replication overhead.
Virtual Access: Data Warehouse external tables allow immediate querying of CSV, Parquet, and JSON.
Minimal Latency: No waiting for ingest pipelines; analysis starts upon file arrival.

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment - Medallion Architecture Design Diagram

UNMATCHED STORAGE EFFICIENCY

Zero Data Replication

Traditional ETL requires moving data across multiple tiers. Our architecture ensures a single source of truth with zero data movement between GCS and BigQuery compute.
This is similar to the Bronze Zone in a Medallion Architecture.

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment - Medallion Architecture Design Diagram

Phase 2: The Metadata Mess

CHALLENGES OF UNIFICATION

Schema Friction

Feeds arrive with inconsistent headers (e.g., 'Device Number' vs 'deviceNo'). Manual aliasing is fragile and slow.

Entity Drift

Names and IDs vary across systems, preventing standard joins from matching records effectively.

Type Mismatches

Varying data types for the same concept (Integer vs String) crash standard SQL aggregation views.

Phase 3: The AI Solution

BIGQUERY STUDIO: THE AI INTERFACE

Remote AI Registration

CREATE MODEL `gemini_remote`
REMOTE WITH CONNECTION `bq_connection`
OPTIONS(endpoint = 'gemini-1.5-pro');

Automated Inference

AI "reads" information schemas to infer mapping logic, moving you from Code Author to Logic Approver.

SELECT ml_generate_text_result
FROM ML.GENERATE_TEXT(
  MODEL `gemini_remote`,
  (SELECT "Compare Source A and B schemas. Write a SQL view to unify them." AS prompt)
);

AI-ASSISTED SCHEMA DISCOVERY

Prompting for Base Tables

Using AI to generate the DDL for external tables by pointing to compressed feeds in the lake (USA & MEX factories).

SELECT ml_generate_text_result
FROM ML.GENERATE_TEXT(
  MODEL `gemini_remote`,
  (SELECT "Create External Tables as smart_factory.us_telemetry with path 'gs://factory-dl/us/dev-540/telemetry-*.csv.gz' '. Include option CSV, GZIP compression and skip 1 row. Infer and add the schema using lower case" AS prompt));

SELECT ml_generate_text_result
FROM ML.GENERATE_TEXT(
  MODEL `gemini_remote`,
  (SELECT "Create External Tables as smart_factory.mx_telemetry with path 'gs://factory-dl/mx/dev-940/telemetry-*.csv.gz' '. Include option CSV, GZIP compression and skip 1 row. Use schema device_number STRING, bay_id INT64, factory STRING, created STRING" AS prompt));

Generated BigLake DDL

-- USA Factory Feed
CREATE OR REPLACE EXTERNAL TABLE `smart_factory.us_telemetry` (
  device_number STRING,
  bay_id INT64,
  factory STRING,
  created STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://factory-dl/us/dev-540/telemetry*.csv.gz'],
  skip_leading_rows = 1,
  compression = 'GZIP'
);

-- MEX Factory Feed
CREATE OR REPLACE EXTERNAL TABLE `smart_factory.mx_telemetry` (
  device_number STRING,
  bay_id INT64,
  factory STRING,
  created STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://factory-dl/mx/dev-940/telemetry*.csv.gz'],
  skip_leading_rows = 1,
  compression = 'GZIP'
);

AI-ABSTRACTION: THE VIEW LAYER

Generating the Interface

AI creates a clean abstraction view for each external table, decoupling raw storage from the analytics model.

-- AI Instruction
"Create a view named 
smart_factory.vw_us_telemetry 
selecting all columns from the
usa_telemetry table. Safe cast the created column as datetime."

Abstraction Layer DDL

-- Semantic Abstraction Layer
CREATE OR REPLACE VIEW `smart_factory.vw_us_telemetry` AS
SELECT 
  device_number,
  bay_id,
  factory,
  SAFE_CAST(created as DATETIME) AS created
FROM `smart_factory.us_telemetry`;

COGNITIVE UNIFICATION

The Multi-Region Model

The unified view now consumes from the abstraction layer, ensuring that changes to raw storage don't break the views down stream.

-- AI Instruction
"Create a view with name
smart_factory.vw_telemetry that creates a union of all the fields from the views vw_[region]_telemetry. The regions include us and mx. List out all the field names. Never use * for field names"

Unified Global View

-- Semantic Abstraction Layer
CREATE OR REPLACE VIEW `smart_factory.vw_telemetry` AS
SELECT 
  device_number,
  bay_id,
  factory,
  created
FROM `smart_factory.vw_us_telemetry`
UNION ALL
SELECT 
  device_number,
  bay_id,
  factory,
  created
FROM `smart_factory.vw_mx_telemetry`

SCALING TO CHINA FACTORY

Evolving the Model

Adding the new China feed by generating the External Table definition via AI.

CREATE OR REPLACE EXTERNAL TABLE `smart_factory.cn_telemetry` (
  device_number STRING,
  bay_id INT64,
  factory STRING,
  created STRING
)
OPTIONS (
  format = 'CSV',
  uris = ['gs://factory-dl/cn/dev-900/telemetry*.csv.gz'],
  skip_leading_rows = 1,
  compression = 'GZIP'

Human-in-the-Loop DevOps

Use AI to update the unified view with the new data feed. Review and apply the changes by the DevOps team, as changes to a production view require approval.

Manufacturing SPC & Root Cause Analysis

This query calculates a rolling mean and standard deviation over the last 10 minutes of telemetry to detect anomalies, “Out of Control” conditions.

WITH TelemetryStats AS (
  SELECT
    machine_id,
    timestamp,
    sensor_reading,
    -- Calculate rolling stats for the "Control Chart"
    AVG(sensor_reading) OVER(PARTITION BY machine_id ORDER BY timestamp ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) as rolling_avg,
    STDDEV(sensor_reading) OVER(PARTITION BY machine_id ORDER BY timestamp ROWS BETWEEN 20 PRECEDING AND CURRENT ROW) as rolling_stddev
  FROM `production_data.mx_telemetry_stream`
  WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)
),
Anomalies AS (
  SELECT *,
    -- Define "Out of Control" (Reading > 3 Sigma from mean)
    ABS(sensor_reading - rolling_avg) > (3 * rolling_stddev) AS is_out_of_control
  FROM TelemetryStats
)
SELECT * FROM Anomalies WHERE is_out_of_control = TRUE;

Control Chart Visualization

The Cognitive Data Lakehouse: AI-Driven Unification and Semantic Modeling in a Zero-ETL Environment - Control Charts

ADVANTAGE COMPARISON MATRIX

Metric	Manual Data Engineering	AI-Augmented Zero-ETL
Unification Speed	Days/Weeks per Source	Minutes via Generative AI
Schema Drift	Manual Script Rewrites	Adaptive AI View Discovery
Infrastructure Cost	High (Data Redundancy)	Minimal (In-place on GCS)

Strategic Intelligence ROI:

ROI(ai) = Insights Velocity / (Movement Cost + Labor Hours)

FINAL THOUGHTS: STRATEGIC SUMMARY

Legacy Challenges

Brittle ETL: Manual pipelines break with every schema change.
Cost Inefficiency: Redundant storage for processed data.
Semantic Silos: Hard-coded aliases for disparate naming conventions.
Slow Time-to-Insight: Weeks spent on manual schema alignment.

AI-Assisted Solutions

Zero-ETL Arch: Cost-effective storage with Data Lake virtual access.
Automated Inference: Vertex AI handles the "heavy lifting" of mapping.
Adaptive DevOps: Scalable model evolution (USA → MEX → China).
Unified Intelligence: One virtual source of truth for global analytics.

Moving from Data Reporting to Active Semantic Intelligence.

We've covered a lot today, but this is just the beginning!

If you're interested in learning more about building cloud data pipelines, I encourage you to check out my book, 'Data Engineering Process Fundamentals,' part of the Data Engineering Process Fundamentals series. It provides in-depth explanations, code samples, and practical exercises to help in your learning.

📅 Upcoming Sessions

Our upcoming series expands beyond data engineering to bridge the gap between AI, Machine Learning, and modern cloud architecture. Using our Data, AI, and ML GitHub blueprints, we provide the code-first patterns needed to build everything from Zero-ETL pipelines to scalable LLM-powered systems. Join us to explore how these integrated disciplines work together to turn raw data into production-ready intelligence.

🌟 Let's Connect & Build Together

If you enjoyed these resources, let's stay in touch! I share deep-dives into AI/ML patterns and host community events here:

GDG Broward: Join our local dev community for meetups and workshops.
LinkedIn: Let's connect professionally! I share insights on engineering.
GitHub: Follow my open-source journey and star the repos you find useful.
YouTube: Watch step-by-step tutorials on the projects listed above.
BlueSky / X / Twitter: Daily tech updates and quick engineering tips.

👉 Originally published at ozkary.com

12/10/25

From Raw Data to Governance: Refining Data with the Medallion Architecture Dec 2025

Oscar Garcia @ozkary 12/10/2025 data , data analysis , data engineering , data lake , data warehouse No comments

Overview

Build upon your existing data engineering expertise and discover how Medallion Architecture can transform your data strategy. This session provides a hands-on approach to implementing Medallion principles, empowering you to create a robust, scalable, and governed data platform.

We'll explore how to align data engineering processes with Medallion Architecture, identifying opportunities for optimization and improvement. By understanding the core principles and practical implementation steps, you'll learn how to optimize data pipelines, enhance data quality, and unlock valuable insights through a structured, layered approach to drive business success.

From Raw Data to Governance: Refining Data with the Medallion Architecture

Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video - Dec 2025

Video Agenda

Introduction to Medallion Architecture
- Defining Medallion Architecture
- Core Principles
- Benefits of Medallion Architecture
The Raw Zone
- Understanding the purpose of the Raw Zone
- Best practices for data ingestion and storage
The Bronze Zone
- Data transformation and cleansing
- Creating a foundation for analysis
The Silver Zone
- Data optimization and summarization
- Preparing data for consumption
The Gold Zone
- Curated data for insights and action
- Enabling self-service analytics
Empowering Insights
- Data-driven decision-making
- Accelerated Insights
Data Governance
- Importance of data governance in Medallion Architecture
- Implementing data ownership and stewardship
- Ensuring data quality and security

Why Attend:

Gain a deep understanding of Medallion Architecture and its application in modern data engineering. Learn how to optimize data pipelines, improve data quality, and unlock valuable insights. Discover practical steps to implement Medallion principles in your organization and drive data-driven decision-making.

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
Scalability: The layered approach can accommodate growing data volumes and complexity.
Cost Efficiency: Optimized data storage and processing can reduce costs.

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

Key Characteristics:
- Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
- Data is ingested as-is, without any cleaning or transformation
- High volume and velocity
- Data retention policies are crucial
Benefits:
- Preserves original data for potential future analysis
- Enables data reprocessing
- Supports data lineage and auditability

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Raw Zone Diagram

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

Key Characteristics:
- Data is cleansed and standardized
- Basic transformations are applied (e.g., data type conversions, null handling)
- Data is structured into tables or views
- Data quality checks are implemented
- Data retention policies may be shorter than the Raw Zone
Benefits:
- Improves data quality and consistency
- Provides a foundation for further analysis
- Enables data exploration and discovery

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

Key Characteristics:
- Data is cleansed, standardized, and enriched
- Data is structured for analytical purposes (e.g., normalized, de-normalized)
- Data is optimized for query performance (e.g., partitioning, indexing)
- Data is aggregated and summarized for specific use cases
Benefits:
- Improved query performance
- Supports self-service analytics
- Enables advanced analytics and machine learning
- Reduces query costs

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
Key Characteristics:
- Data is highly refined, aggregated, and optimized for specific use cases
- Data is often materialized for performance
- Data is subject to rigorous quality checks and validation
- Data is secured and governed
Benefits:
- Enables rapid insights and decision-making
- Supports self-service analytics and reporting
- Provides a foundation for advanced analytics and machine learning
- Reduces query latency

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

Key Characteristics:
- Data is accessible and easily consumable
- Supports various analytical tools and platforms (BI, ML, data science)
- Enables self-service analytics
- Drives business decisions and actions
Examples of Consumption Tools:
- Business Intelligence (BI) tools (Looker, Tableau, Power BI)
- Data science platforms (Python, R, SQL)
- Machine learning platforms (TensorFlow, PyTorch)
- Advanced analytics tools

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Data governance is the framework that defines how data is managed within an organization, while data management is the operational execution of those policies. Data Governance is essential for ensuring data quality, consistency, and security.

Key components of data governance include:

Data Lineage: Tracking data's journey from source to consumption.
Data Ownership: Defining who is responsible for data accuracy and usage.
Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

The data transformation phase is a critical stage in a data warehouse project. This phase involves several key steps, including data extraction, cleaning, loading, data type casting, use of naming conventions, and implementing incremental loads to continuously insert the new information since the last update via batch processes.

From Raw Data to Governance: Refining Data with the Medallion Architecture - Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

Key Benefits:
- Improved data quality
- Enhanced governance
- Accelerated insights
- Scalability
- Cost Efficiency.

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

Upcoming Talks:

Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.

This presentation is based on the book, Data Engineering Process Fundamentals, which provides a more comprehensive guide to the topics we'll cover. You can find all the sample code and datasets used in this presentation on our popular GitHub repository Introduction to Data Engineering Process Fundamentals.

Thanks for reading! 😊 If you enjoyed this post and would like to stay updated with our latest content, don’t forget to follow us. Join our community and be the first to know about new articles, exclusive insights, and more!

👍 Originally published by ozkary.com

11/19/25

From Raw Data to Governance: Refining Data with the Medallion Architecture Nov 2025

Oscar Garcia @ozkary 11/19/2025 data , data analysis , data engineering , data lake , data warehouse , python No comments

Overview

From Raw Data to Governance: Refining Data with the Medallion Architecture

Follow this GitHub repo during the presentation: (Give it a star)

👉 https://github.com/ozkary/data-engineering-mta-turnstile

Read more information on my blog at:

👉 https://www.ozkary.com/2023/03/data-engineering-process-fundamentals.html

YouTube Video

Video Agenda

Introduction to Medallion Architecture
- Defining Medallion Architecture
- Core Principles
- Benefits of Medallion Architecture
The Raw Zone
- Understanding the purpose of the Raw Zone
- Best practices for data ingestion and storage
The Bronze Zone
- Data transformation and cleansing
- Creating a foundation for analysis
The Silver Zone
- Data optimization and summarization
- Preparing data for consumption
The Gold Zone
- Curated data for insights and action
- Enabling self-service analytics
Empowering Insights
- Data-driven decision-making
- Accelerated Insights
Data Governance
- Importance of data governance in Medallion Architecture
- Implementing data ownership and stewardship
- Ensuring data quality and security

Why Attend:

Presentation

Introducing Medallion Architecture

Medallion architecture is a data management approach that organizes data into distinct layers based on its quality and processing level.

Improved Data Quality: By separating data into different zones, you can focus on data quality at each stage.
Enhanced Data Governance: Clear data ownership and lineage improve data trustworthiness.
Accelerated Insights: Optimized data in the Silver and Gold zones enables faster query performance.
Scalability: The layered approach can accommodate growing data volumes and complexity.
Cost Efficiency: Optimized data storage and processing can reduce costs.

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Design Diagram

The Raw Zone: Foundation of Your Data Lake

The Raw Zone is the initial landing place for raw, unprocessed data. It serves as a historical archive of your data sources.

Key Characteristics:
- Unstructured or semi-structured format (e.g., CSV, JSON, Parquet)
- Data is ingested as-is, without any cleaning or transformation
- High volume and velocity
- Data retention policies are crucial
Benefits:
- Preserves original data for potential future analysis
- Enables data reprocessing
- Supports data lineage and auditability

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Raw Zone Diagram

The Bronze Zone: Transforming Raw Data

The Bronze Zone is where raw data undergoes initial cleaning, structuring, and transformation. It serves as a staging area for data before moving to the Silver Zone.

Key Characteristics:
- Data is cleansed and standardized
- Basic transformations are applied (e.g., data type conversions, null handling)
- Data is structured into tables or views
- Data quality checks are implemented
- Data retention policies may be shorter than the Raw Zone
Benefits:
- Improves data quality and consistency
- Provides a foundation for further analysis
- Enables data exploration and discovery

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Bronze Zone Diagram

The Silver Zone: A Foundation for Insights

The Silver Zone houses data that has been further refined, aggregated, and optimized for specific use cases. It serves as a bridge between the raw data and the final curated datasets.

Key Characteristics:
- Data is cleansed, standardized, and enriched
- Data is structured for analytical purposes (e.g., normalized, de-normalized)
- Data is optimized for query performance (e.g., partitioning, indexing)
- Data is aggregated and summarized for specific use cases
Benefits:
- Improved query performance
- Supports self-service analytics
- Enables advanced analytics and machine learning
- Reduces query costs

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Silver Zone Diagram

The Gold Zone: Your Data's Final Destination

Definition: The Gold Zone contains the final, curated datasets ready for consumption by business users and applications. It is the pinnacle of data transformation and optimization.
Key Characteristics:
- Data is highly refined, aggregated, and optimized for specific use cases
- Data is often materialized for performance
- Data is subject to rigorous quality checks and validation
- Data is secured and governed
Benefits:
- Enables rapid insights and decision-making
- Supports self-service analytics and reporting
- Provides a foundation for advanced analytics and machine learning
- Reduces query latency

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Gold Zone Diagram

The Gold Zone: Empowering Insights and Actions

The Gold Zone is the final destination for data, providing a foundation for insights, analysis, and action. It houses curated, optimized datasets ready for consumption.

Key Characteristics:
- Data is accessible and easily consumable
- Supports various analytical tools and platforms (BI, ML, data science)
- Enables self-service analytics
- Drives business decisions and actions
Examples of Consumption Tools:
- Business Intelligence (BI) tools (Looker, Tableau, Power BI)
- Data science platforms (Python, R, SQL)
- Machine learning platforms (TensorFlow, PyTorch)
- Advanced analytics tools

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Analysis Diagram

Data Governance: The Cornerstone of Data Management

Key components of data governance include:

Data Lineage: Tracking data's journey from source to consumption.
Data Ownership: Defining who is responsible for data accuracy and usage.
Data Stewardship: Managing data on a day-to-day basis, ensuring quality and compliance.
Data Security: Protecting data from unauthorized access, use, disclosure, disruption, modification, or destruction.
Compliance: Adhering to industry regulations (e.g., GDPR, CCPA, HIPAA) and internal policies.

By establishing clear roles, responsibilities, and data lineage, organizations can build trust in their data, improve decision-making, and mitigate risks.

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Data Governance

Data Transformation and Incremental Strategy

From Raw Data to Governance: Refining Data with the Medallion Architecture - Data transformation lineage

Data Lineage: Tracks the flow of data from its origin to its destination, including all the intermediate processes and transformations that it undergoes.

Data Governance : Metadata

Assigns the owner, steward and responsibilities of the data.

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Governance Metadata

Summary: Leverage Medallion Architecture for Success

Key Benefits:
- Improved data quality
- Enhanced governance
- Accelerated insights
- Scalability
- Cost Efficiency.

From Raw Data to Governance: Refining Data with the Medallion Architecture - Medallion Architecture Diagram

We've covered a lot today, but this is just the beginning!

Upcoming Talks:

👍 Originally published by ozkary.com

10/29/25

From Raw Data to Analytics: The Modern Data Layer Architecture

Oscar Garcia @ozkary 10/29/2025 cloud-engineering , code , data analysis , data lake , data warehouse , python , vscode No comments

Overview

This presentation is part of the Data Engineering Process Fundamentals series, focusing on the essential architectural components—the Data Lake and the Data Warehouse—and defining their respective roles in a modern analytics ecosystem.

From Raw Data to Analytics: The Modern Data Layer Architecture

Follow this GitHub repo during the presentation: (Star the project to follow and get updates)

👉 GitHub Repo

Data engineering Series:

👉 Blog Series

YouTube Video

Video Agenda

Agenda:

Introduction to Data Engineering:
Brief overview of the data engineering landscape and its critical role in modern data-driven organizations.
Operational Data
Understanding Data Lakes:
Explanation of what a data lake is and its purpose in storing vast amounts of raw and unstructured data.
Exploring Data Warehouses:
Definition of data warehouses and their role in storing structured, processed, and business-ready data.
Comparing Data Lakes and Data Warehouses:
Comparative analysis of data lakes and data warehouses, highlighting their strengths and weaknesses.
Discussing when to use each based on specific use cases and business needs.
Integration and Data Pipelines:
Insight into the seamless integration of data lakes and data warehouses within a data engineering pipeline.
Code walkthrough showcasing data movement and transformation between these two crucial components.
Real-world Use Cases:
Presentation of real-world use cases where effective use of data lakes and data warehouses led to actionable insights and business success.
Hands-on demonstration using Python, Jupyter Notebook and SQL to solidify the concepts discussed, providing attendees with practical insights and skills.
Q&A and Hands-on Session:
An interactive Q&A session to address any queries.

Conclusion:

This session aims to equip attendees with a strong foundation in data engineering, focusing on the pivotal role of data lakes and data warehouses. By the end of this presentation, participants will grasp how to effectively utilize these tools, enabling them to design efficient data solutions and drive informed business decisions.

This presentation will be accompanied by live code demonstrations and interactive discussions, ensuring attendees gain practical knowledge and valuable insights into the dynamic world of data engineering.

Supporting Materials Reminder

Subsequent Sessions: Join us for future sessions in our Data Engineering Process Fundamentals series, where we will build a data pipeline and delve deeper into topics like orchestration and governance.

Resources: This presentation is based on the book, Data Engineering Process Fundamentals, and all supporting code and examples are available on our popular GitHub repository.

Presentation

Data Engineering Overview

A Data Engineering Process involves executing steps to understand the problem, scope, design, and architecture for creating a solution. This enables ongoing big data analysis using analytical and visualization tools.

Topics

Data Lake and Data Warehouse
Discovery and Data Analysis
Design and Infrastructure Planning
Data Lake - Pipeline and Orchestration
Data Warehouse - Design and Implementation
Analysis and Visualization

Follow this project: Give a star

👉 Data Engineering Process Fundamentals

Operational Data

Operational data is often generated by applications, and it is stored in transactional relational databases like SQL Server, Oracle and NoSQL (JSON) databases like MongoDB, Firebase. This is the data that is created after an application saves a user transaction like contact information, a purchase or other activities that are available from the application.

Features:

Application support and transactions
Relational data structure and SQL or document structure NoSQL
Small queries for case analysis

Not Best For:

Reporting system
Large queries
Centralized Big Data system

Data Engineering Process Fundamentals - Operational Data

Data Lake - Analytical Data Staging

A Data Lake is an optimized storage system for Big Data scenarios. The primary function is to store the data in its raw format without any transformation. Analytical data is the transaction data that has been extracted from a source system via a data pipeline as part of the staging data process.

Features:

Store the data in its raw format without any transformation
This can include structure data like CSV files, unstructured data like JSON and XML documents, or column-base data like parquet files
Low Cost for massive storage power
Not Designed for querying or data analysis
It is used as external tables by most systems

Data Engineering Process Fundamentals - Analytical Data staging

Data Warehouse - Analytical Data

A Data Warehouse is a centralized storage system that stores integrated data from multiple sources. The system is designed to host and serve Big Data scenarios with lower operational cost than transaction databases, but higher costs than a Data Lake. This system host the Analytical Data that has been processed and is ready for analytical purposes.

Data Warehouse Features:

Stores historical data in relational tables with an optimized schema, which enables the data analysis process
Provides SQL support to query the data
It can integrate external resources like CSV and parquet files that are stored on Data Lakes as external tables
The system is designed to host and serve Big Data scenarios. It is not meant to be used as a transactional system
Storage is more expensive
Offloads archived data to Data Lakes

Data Engineering Process Fundamentals - Analytical Data Store

Discovery - Data Analysis

During the discovery phase of a Data Engineering Process, we look to identify and clearly document a problem statement, which helps us have an understanding of what we are trying to solve. We also look at our analytical approach to make observations about at the data, its structure and source. This leads us into defining the requirements for the project, so we can define the scope, design and architecture of the solution.

Download sample data files
Run experiments to make observations
Write Python scripts using VS Code or Jupyter Notebooks
Transform the data with Pandas
Make charts with Plotly
Document the requirements

Data Engineering Process Fundamentals - Data Analysis and discovery

Design and Planning

The design and planning phase of a data engineering project is crucial for laying out the foundation of a successful system. It involves defining the system architecture, designing data pipelines, implementing source control practices, ensuring continuous integration and deployment (CI/CD), and leveraging tools like Docker and Terraform for infrastructure automation.

Use GitHub for code repo and for CI/CD actions
Use Terraform is an Infrastructure as Code (IaC) tool that enables us to manage cloud resources across multiple cloud providers
Use Docker containers to run the code and manage its dependencies

Data Engineering Process Fundamentals - Design and Planning

Data Lake - Pipeline and Orchestration

A data pipeline is basically a workflow of tasks that can be executed in Docker containers. The execution, scheduling, managing and monitoring of the pipeline is referred to as orchestration. In order to support the operations of the pipeline and its orchestration, we need to provision a VM and data lake, and monitor cloud resources.

This can be code-centric, leveraging languages like Python
Or a low-code approach, utilizing tools such as Azure Data Factory, which provides a turn-key solution
Monitor services enable us to track telemetry data
Docker Hub, GitHub can be used for the CI/CD process

Data Engineering Process Fundamentals - Data Lake - Data Pipeline and Orchestration

Data Warehouse - Design and Implementation

In the design phase, we lay the groundwork by defining the database system, schema model, and technology stack required to support the data warehouse’s implementation and operations. In the implementation phase, we focus on converting conceptual data models into a functional system. By creating concrete structures like dimension and fact tables and performing data transformation tasks, including data cleansing, integration, and scheduled batch loading, we ensure that raw data is processed and unified for analysis. Create a repeatable and extendable process.

Data Engineering Process Fundamentals - Data Warehouse Design and Implementation

Data Warehouse - Data Analysis

Data analysis is the practice of exploring data and understanding its meaning. It involves activities that can help us achieve a specific goal, such as identifying data dimensions and measures, as well as data analysis to identify outliers, trends, and distributions.

We can accomplish these activities by writing code using Python and Pandas, SQL, Visual Studio Code or Jupyter Notebooks.
What's more, we can use libraries, such as Plotly, to generate some visuals to further analyze data and create prototypes.

Data Engineering Process Fundamentals - Data Analysis

Data Analysis and Visualization

Data visualization is a powerful tool that takes the insights derived from data analysis and presents them in a visual format. While tables with numbers on a report provide raw information, visualizations allow us to grasp complex relationships and trends at a glance.

Dashboards, in particular, bring together various visual components like charts, graphs, and scorecards into a unified interface that can help us tell a story
Use tools like PowerBI, Looker, Tableau to model the data and create enterprise level visualizations

Data Engineering Process Fundamentals - Data Visualization

Conclusion

Both data lakes and data warehouses are essential components of a data engineering project. The primary function of a data lake is to store large amounts of operational data in its raw format, serving as a staging area for analytical processes. In contrast, a data warehouse acts as a centralized repository for information, enabling engineers to transform, process, and store extensive data. This allows the analytical team to utilize coding languages like Python and tools such as Jupyter Notebooks, as well as low-code platforms like Looker Studio and Power BI, to create enterprise-quality dashboards for the organization.

Upcoming Talks:

👍 Originally published by ozkary.com

I am Oscar Garcia, OzkaryTM. I author this site, speak at conferences and events, contribute to OSS, mentor people. I use this blog to post ideas and experiences about software development, with the goal to both learn from and help the technology communities around the world.

2/25/26

Overview

🚀 Featured Open Source Projects

🏗️ Data Engineering

🤖 Artificial Intelligence

📉 Machine Learning

YouTube Video

Video Agenda

Presentation

SETTING THE STAGE

THE INTELLIGENCE ENGINE

CONFIGURATION: GLOBAL GOVERNANCE

DEV EXPERIENCE: THE SILENT ENFORCER

CONFIGURATION: CONTEXTUAL GUARDRAILS

DEV EXPERIENCE: REAL-TIME INTERVENTION

CONFIGURATION: THE TOOLING

DEV EXPERIENCE: THE SCAFFOLDING

THE RESULT: GENERATED ARCHITECTURE

CONFIGURATION: THE AUDITOR AGENT

DEV EXPERIENCE: THE CODE REVIEW

THE AUTONOMY ADVANTAGE

SUMMARY & AGENT MAPPING

🌟 Let's Connect & Build Together

1/21/26

Overview

🚀 Featured Open Source Projects

🏗️ Data Engineering

🤖 Artificial Intelligence

📉 Machine Learning

YouTube Video

Video Agenda

Presentation

Phase 1: The Zero-ETL Strategy

INFRASTRUCTURE: DATA STAYS LOCAL

UNMATCHED STORAGE EFFICIENCY

Phase 2: The Metadata Mess

CHALLENGES OF UNIFICATION

Phase 3: The AI Solution

BIGQUERY STUDIO: THE AI INTERFACE

AI-ASSISTED SCHEMA DISCOVERY

AI-ABSTRACTION: THE VIEW LAYER

COGNITIVE UNIFICATION

SCALING TO CHINA FACTORY

Manufacturing SPC & Root Cause Analysis

Control Chart Visualization

ADVANTAGE COMPARISON MATRIX

FINAL THOUGHTS: STRATEGIC SUMMARY

We've covered a lot today, but this is just the beginning!

📅 Upcoming Sessions

🌟 Let's Connect & Build Together

12/10/25

Overview

YouTube Video - Dec 2025

Video Agenda

Presentation

Introducing Medallion Architecture

The Raw Zone: Foundation of Your Data Lake

The Bronze Zone: Transforming Raw Data

The Silver Zone: A Foundation for Insights

The Gold Zone: Your Data's Final Destination

The Gold Zone: Empowering Insights and Actions

Data Governance: The Cornerstone of Data Management

Data Transformation and Incremental Strategy

Data Governance : Metadata

Summary: Leverage Medallion Architecture for Success

We've covered a lot today, but this is just the beginning!

11/19/25

Overview

YouTube Video

Video Agenda

Presentation

Introducing Medallion Architecture

The Raw Zone: Foundation of Your Data Lake

The Bronze Zone: Transforming Raw Data

The Silver Zone: A Foundation for Insights

The Gold Zone: Your Data's Final Destination

The Gold Zone: Empowering Insights and Actions

Data Governance: The Cornerstone of Data Management

Data Transformation and Incremental Strategy

I am Oscar Garcia, Ozkary^TM. I author this site, speak at conferences and events, contribute to OSS, mentor people. I use this blog to post ideas and experiences about software development, with the goal to both learn from and help the technology communities around the world.