Modern Data Architecture with Sentinel data lake

When discussing Microsoft Sentinel data lake, the narrative centers on immediate value: cheaper ingestion, long-term storage, and historical correlation. These benefits are real but they obscure something far more interesting.

Sentinel data lake with KQL Jobs and Notebooks transforms how SIEM data flows. It can replace the traditional ETL(Extract, Transform, Load) pipeline model with modern ELT (or hybrid) data pipelines. Engineers no longer have to predict what logs and fields will matter 2 years ahead - data architecture adapts as needs evolve.

The security operations world lags behind other disciplines in maturity. Most teams still deploy SIEMs and detections manually without automation, versioning, or testing. SOAR solutions lack standardization and reusability. And most SIEM platforms do not allow advanced data architecture like data lakehouses. The common practices are frequently there, but they reach the security operation field pretty late. Sentinel data lake brings Sentinel SIEM into a world in which proper data architecture can be utilized.

This is a high-level post that explores common data architecture practices from big data platforms and how they could be used in Sentinel data lake, as the SIEM world seemingly moves toward these modern approaches.

Why Traditional SIEMs Remain Locked in ETL Architecture

ETL is the traditional SIEM pipeline apporach: data is gathered(Extract), transformed (T: parsed, normalized, filtered), then loaded (L) into the SIEM. But ETL (Extract, Transform, Load) isn’t a choice - it’s a consequence of SIEM architectural constraints.

The fundamental constraints:

Single-tier architecture: Traditional SIEMs have one storage tier with nowhere to store raw and overlapping data cheaply. You must decide upfront what to keep, making ingestion a financial decision, not a technical one.
Detection engines require pre-parsed data: Some SIEM detection engines only operate on already-extracted fields. Post-ingestion parsing and retransformation either aren’t supported by the engine or the SIEM, or they are ineffective and cause timeout failures.

Result: SIEMs become optimization puzzles where “What can we afford to see?” replaces “What do we need to see?” This blindness hurts in a fast-evolving field - nobody can predict what data becomes essential in two years.

Sentinel data lake solves this with multi-tiered architecture. Detection-quality data flows into Sentinel SIEM while raw, untransformed data lives cheaply in the data lake as source of truth. Ingest everything, transform deliberately, detect strategically - your architecture evolves with threats instead of locking you into today’s expectations.

Sentinel data lake: Pipeline capabilities

Sentinel SIEM is fundamentally ETL-based - transformation happens before ingestion due to high analytics ingestion costs. Query-time parsing is supported but it doesn’t affect stored data, it is only a view.

Sentinel data lake’s cheaper storage and advanced pipeline capabilities enable store-first, transform-later (ELT) approaches - fundamentally different from traditional SIEM ingestion models.

Cheaper data lake storage
- Ingesting data directly into data lake is approximately 95% cheaper than analytics tier with Pay-as-you-Go pricing.
- This solves the “What can we afford to keep?” problem. You keep all logs and all fields without financial pressure to filter, eliminating the need to predict what you’ll need two years from now.
- Raw data ingestion into cheap storage eliminates parsing-driven data loss. Bugs in ETL parsers (incorrect DCR transforms, DCR bugs) can permanently destroy data before it’s stored. Raw data load without DCR transformation does not have this problem.

KQL Jobs and Notebooks: Batch processing
- KQL Jobs execute KQL queries on data lake data, enabling flexible transformation: filtering, parsing, normalizing.
- Notebooks execute Python code to process data lake data at scale, handling complex transformations beyond pure KQL. While Microsoft promotes it as a threat hunting tool for data lake, Notebooks serve as core ELT pipeline components for data parsing, enrichment, and transformation workflows - a practice refined in big data platforms like Microsoft Fabric for years.
- Both can run scheduled or on-demand, enabling you to process new data continuously (scheduled) and retroactively reprocess older data when detection strategies or threat models evolve.

These capabilities allow you to treat raw data as immutable source of truth and introduce additional tables for active use.

This enables multi-tiered data storage architectures like the Medallion model. In the medallion model there are three distinct storage tiers: Bronze, Silver and Gold.

Bronze stores raw, immutable data to act as long-term stored source of truth.
Silver: Then the pipeline cleans, deduplicates, and normalizes Bronze data into standardized datasets in Silver.
Gold holds curated, aggregated, enriched tables formatted for specific teams and use cases.

Medallion model with Sentinel SIEM and data lake A typical multi-tiered data storage implementation

Sentinel DCRs’ advanced capabilities often eliminate the need for complex pipelines, enabling simpler approaches - like pushing raw data to Bronze while simultaneously ingesting normalized, filtered data directly from DCR to Silver/Gold tiers. But understanding the advanced capabilities and preparing for future needs remains important.

A simplified setup with direct DCR ingestion

Multi-tiered Storage Option with SentinelData Lake

Unlike traditional big data platforms, SIEM best practices around tier separation don’t exist. In data platforms, Gold contains aggregated business datasets. But in security, the most aggregated outputs are detections and incidents, which don’t warrant a separate tier.

Traditional SIEMs also serve primarily the SOC. Gold tier’s value of “different views for different audiences” has limited application when there is only one team that needs comprehensive visibility across all data.

In the next sections, we’ll explore several tier separation strategies you should consider when architecting Sentinel data lake as an advanced, multi-tiered security platform.

Keep in mind, there’s no one-size-fits-all solution. Some organizations might skip Silver-Gold separation entirely. Others could use three tiers for completely different purposes.

1. Scenario: Dual-Tiered Cost-Optimization Setup

You don’t necessarily need all three tiers. In traditional SOC environments, a dual-tiered setup frequently makes more sense. For organizations starting fresh, beginning with two tiers is frequently better.

Bronze Tier: Immutable Source of Truth

All logs ingest to Sentinel data lake in native format - nothing lost to parsing errors or cost-driven discards. Data Collection Rules can still redact sensitive information without risky modifications. Bronze stores complete, unmodified data as forensic foundation and compliance archive.

Bronze is the same in every setup, so it won’t be mentioned again.

Silver Tier/Gold Tier: Operations and Detection

A single unified tier handles both detections and analyst queries. Data from Bronze flows through selective filtering and aggregation (via KQL Jobs or Notebooks) before reingestion to analytics tier for real-time detection and SOC analysis. Parse based on detection requirements: parsed for cost savings if detections accept this format, unparsed if they don’t.

This parsed, filtered data is typically enough for SOC investigation. In rare cases, teams access Bronze logs for additional context. Other teams, like Threat Hunt or Forensics might access the Bronze tier more frequently.

This is a similar usage model to what Microsoft suggest but with KQL Job and Notebook-based pipelines as addition.

Summary

Medallion Tier	Table Tier	Primary Users	Secondary Users	Retention	Cost Elements
Bronze	Data lake	- transformation pipelines - traditional machine learning	- forensic investigators - threat hunters - auditors	- Maximum required	- Full ingestion - Maximum retention (All data) - Data usage
Silver/Gold	Analytics	- detections - security teams (SOC, DFIR, Threat Hunters)	-	- Maximum required	- Filtered ingestion - Maximum retention (Filtered data)

Data retention costs in Sentinel data lake assume 6x compression, making long-term storage 6x cheaper than before data lake was available. Even if you store data twice in the two tiers, you achieve 3x+ cost savings versus pre-data lake architectures. For most organizations, data retention is a negligible cost element.

Setting both tiers to ‘Maximum required’ retention is often better from a management perspective with minimal cost impact. You can still configure different retention policies for individual tables within each tier if needed.

This is a good setup if:

You prefer a more traditional setup that can be easily extended in the future
DFIR teams prefer working on raw, unparsed logs - good choice if DFIR engagements are infrequent
Your aggregation/summarization serves other teams effectively - heavily summarized data may not fit all team workflows

2. Scenario: Full Three-Tiered with Dual Data Lake

This three-tier approach separates concerns: Silver stores parsed, normalized data without filtering or aggregation in data lake - this balances cost and efficiency. While Gold contains filtered, aggregated data in Analytics tier for real-time operations.

Silver tier ingestion to data lake is cheap enough that aggressive filtering isn’t necessary, yet storing parsed, deduplicated data in dedicated tables (like data model-based structures) makes subsequent queries more efficient - directly reducing query costs for further parsing or manual investigation.

Silver Tier: Optimized Supporting Tables

Fully parsed, deduplicated data without filtering or aggregation optimized for efficient further processing and occasional queries. When looking for data not in Gold tier teams don’t have to rely on the inefficient Bronze tier, they can rely on this low cost, query efficient tier. Because it is not filtered, it can satisfy most needs.

Organize logs into dedicated Silver tables by data model or log source for efficiency, but ensure you can generate Gold tier data from these specialized Silver tables for detection rules to process.

Since Bronze retains all data long-term (Maximum required), Silver can apply shorter retention policies if retention costs matter. Typically unnecessary given the low cost of retention, but an option if needed.

Gold Tier: Operations and Detection

Heavily Aggregated, filtered data optimized for fast real-time querying, detections and investigation. Only for detection and SOC-related logs.

If Silver uses shorter retention, keep Gold at Maximum required. Otherwise, apply selective retentions, so ‘Maximum required’ to critical data only - such as alerts and incidents.

A possible implementation of this scenario

Summary

This hybrid approach is effective when heavy filtering means SOC teams occasionally query Silver tier for investigation detail, while threat hunters access older Silver data cost-efficiently without querying raw Bronze datasets.

Forward less data to Analytics tier (Gold) since Silver provides efficient data lake lookups. So unlike in the previous scenario, here querying Bronze tier manually is avoidable. This reduces both Analytics ingestion and Data query/Advanced Data Insights charges.

However, it increases data lake ingestion and retention costs, because we store similar data both in Bronze (raw) and Silver (parsed).

Medallion Tier	Table Tier	Primary Users	Secondary Users	Retention	Cost elements
Bronze	Data lake	- transformation pipelines - traditional machine learning	auditors	- Maximum required	- Full ingestion - Maximum retention (All data) - Data usage
Silver	Data lake	- transformation pipelines - LLM AI usage	- security teams (SOC, DFIR, Threat Hunters)	- Maximum required	- Parsed ingestion - Maximum retention (Parsed data) - Data usage
Gold	Analytics	- human teams	-	- Maximum required OR - Critical data: Maximum required - Other data: 90days	- Filtered ingestion - Maximum or per-table retention (Filtered data)

Best setup

Each scenario differs subtly—pick the one matching your team’s workflow: lookback windows, data needs, and historical data access frequency. You don’t need complex pipelines immediately, but having team members who understand data architecture helps as Sentinel evolves.

Sentinel data lake is still improving with new features coming from Microsoft. Limitations you face today may disappear in the next months, so revisit your design choices as the platform matures.

Sentinel Data Collection Rules handle filtering and transformations at ingestion, often eliminating complex pipeline needs. For most, ingesting all raw data to Bronze tier while directly sending filtered data from the DCR to your Silver/Gold is usually good enough. Use DCRs when:

Logs need immediate availability without pipeline delays
Filtering and transformations are straightforward (DCRs have limited capabilities versus KQL Jobs)
Data aggregation isn’t required (DCRs process events individually, no summarization)

For heavy parsing, complex logic, or summarization, KQL Jobs and Notebooks are necessary, but for many teams, DCR filtering is sufficient and faster.

Limitations of KQL Jobs and Notebooks

Both tools have practical drawbacks:

Processing delay: Added latency versus DCR-based ingestion filtering
Data lake table creation: Don’t natively output to data lake tables - workaround requires creating Analytics tables first, then converting to data lake type (contrary to what MS’s documentation says)
Default table restrictions: Can’t write directly to default tables, requiring detection modifications
Stability issues: Recent bugs have affected both tools, so test thoroughly before production deployment.
Data safety: Custom tables created by KQL Jobs and Notebooks lack protections against removal - deletion is permanent, unlike standard tables with brief recovery windows

Start simple with DCRs and Bronze tier. Use your Bronze tier as source-of-truth. As your team matures and understands your data patterns, gradually introduce pipelines and additional tiers. Sentinel data lake’s flexibility means your architecture can evolve without costly rework.

Modern Data Architecture with Sentinel data lake

Why Traditional SIEMs Remain Locked in ETL Architecture

Sentinel data lake: Pipeline capabilities

Multi-tiered Storage Option with SentinelData Lake

1. Scenario: Dual-Tiered Cost-Optimization Setup

Bronze Tier: Immutable Source of Truth

Silver Tier/Gold Tier: Operations and Detection

Summary

2. Scenario: Full Three-Tiered with Dual Data Lake

Silver Tier: Optimized Supporting Tables

Gold Tier: Operations and Detection

Summary

Best setup

Limitations of KQL Jobs and Notebooks

Data Architecture for AI Era #2

Data Models in the Age of AI #1