Data Architecture for AI Era #2

In today’s big data landscape, establishing proper data architecture is essential before you begin collecting information. As data generation continues to accelerate, making informed decisions about what to store, where to store it, and in what format becomes increasingly critical. In the age of AI, having accessible, well-structured data is even more important—and since you can’t predict what the future holds, it’s crucial to future-proof your data collection strategies.

[]

The challenge isn’t just about volume anymore; it’s about creating a foundation that can adapt to emerging AI technologies, evolving analytical needs, and unforeseen use cases. A robust data model serves as the blueprint for this foundation, ensuring that your data remains usable, searchable, and valuable regardless of how technology advances.

The previous article discussed the key benefits and challenges of data models. In this one, I’ll focus on recommendations for planning your data model strategies, along with some challenges and solutions involving AI.

Streamlining SIEM Data Model Deployment: Essential Supporting Components Beyond Raw Data and Parsing

When deploying a SIEM data model, organizations usually focus on raw data format, parsers, and schema. However, adding supporting code elements can greatly improve efficiency, troubleshooting, and enable AI-driven tool usage on top of the data.

In the explanation, I’ll use the Fortinet FortiGate ASIM data model as an example.

1. Log Source Identifier Code

A log source identifier is a specialized code component that serves as a validation mechanism to determine whether a specific log source has been successfully onboarded to the SIEM platform. This can range from a simple boolean indicator to a comprehensive counter that tracks log ingestion metrics.

This is not only useful in onboarding confirmation and automated health monitoring scenarios, but it can be also helpful in finding unidentified, ‘unsanctioned’ logs in the environment. If you are not aware of a log, you won’t be able to successfully use, monitor and cost/value optimize it.

For example, in Sentinel the Syslog table can be a collection of various log sources, which can be hard to identify. If you don’t know which log sources a log is assigned to, you are potentially losing out on its value.

Recommendation:

If possible, tag your traffic during ingestion, to ensure your logs are easy to find and identifiy. With some log sources this is trivial, e.g the log source uses its own table, or a log source id is already parsed in the logs. Based on the Fortinet ASIM parser, for Fortinet this would be:
```
source
| extend logsource = iif(DeviceVendor == "Fortinet" and DeviceProduct startswith "FortiGate", 'Fortinet', column_ifexists('logsource','')) 
```
If you cannot do this ingestion-time, then ensure that your query-time traffic identifier code is ’tagging’-based and not ‘filter’-based. If you use a where filter instead of adding a tag with extend, your code will be harder to use later on to find all the unidentified logs. It is difficult to find all the events that does not fit any (all) of the listed code identifier if all of them drops unwanted events.
Any identifier that requires a pre-defined list of values (e.g Hostname-s, Computers, ProcessNames) which are not common but specific to an environment can be difficult to manage if you are an MSSP. In this case, try to tag your traffic somewhere further down the ingestion pipeline (e.g tag the traffic on the Syslog collector based on the unique port it sends the traffic to instead of the general 514.)

This can be and should be done both on raw data and the parsed data. For Fortinet, this on the raw would be DeviceVendor == “Fortinet” and DeviceProduct startswith “FortiGate” – benefit On the ASIM parsed data this would be union _ASim_AuditEvent() – benefit | getschema

2. Schema Identifier

The Schema identifier begins with the Log Source Identifier Code and moves forward with further categorization. It’s a tagging system that not only identifies the log source but also categorizes and tags all logs related to a specific schema without requiring full parsing operations.

For Fortigate it would look like this [!!! image] CommonSecurityLog | where not(disabled) | where DeviceVendor == “Fortinet” and DeviceProduct startswith “FortiGate” | where (column_ifexists(“DeviceEventCategory”,"") has “traffic” or AdditionalExtensions has “cat=traffic”) | where DeviceAction != “dns” and Activity !has “dns”

Recommendation: This is an extended version of the Log Source Identifier code, so the recommendation is technically the same. Tag it ingestion time if you can, try to use tagging in your code instead of filtering.

When all your logs in your SIEM are properly tagged by a schema identifier - e.g ASIMNetworkSession schema in case of these Fortinet logs:

You can easily tell which of your logs can be and will be used by your rules and various data model-based queries.
You can easily tell if one of your log type you expected are not working. E.g you enabled Fortinet logs but you cannot see any ASIMNetworkEvents logs.

This can be and should be done both on raw data and the parsed data. For Fortinet, this on the raw would be the query above On the ASIM parsed data this would be union _ASim_Network….

3. Parser code

This is the bit of the code that parses all the relevant fields, so the source schema is translated into the final data model schema.

This can be done ingestion-time - this is a frequenty approach for SIEMs, because it can increase the query speed. Can be done query time - this is a frequenty approach with data lake solutions that utilizes schema-on-read approach. In that case the parser is implemented by the data processor entity.

For Fortinet, this is the whole query available here…!!!

4. Table correlation table

While it is a highly redundant data, it could be a good idea to store all the original input fields that are being used, all the kept output fields and the connection between them in a table or json format. All this information can be deducated from the parser, but having it in this format makes it much easuer for traditional automations to human operators to process and understand the requirements.

For example the following part of the Fortinet parser code:

| parse-kv AdditionalExtensions as (
            FortinetFortiGatestart:datetime,
            FortinetFortiGatesrcintfrole:string,
        ) with (pair_delimiter=';', kv_delimiter='=')
| project-rename EventStartTime = FortinetFortiGatestart
| extend EventStartTime = coalesce(EventStartTime, TimeGenerated)

Could be stored in a table for easy processing

Original field	New field	Details
AdditionalExtensions	FortinetFortiGatestart	extract(@“FortinetFortiGatestart=([^;]+)”, 1, AdditionalExtensions)
AdditionalExtensions	FortinetFortiGatesrcintfrole	extract(@“FortinetFortiGatesrcintfrole=([^;]+)”, 1, AdditionalExtensions)
FortinetFortiGatestart	EventStartTime	project-rename EventStartTime = FortinetFortiGatestart
EventStartTime	TimeGenerated	coalesce(EventStartTime, TimeGenerated)

Having the info in this format can help a lot when trying to find out which field to keep, which one are being utilized - at least by the parser - and how a change in the raw log can affect parsing.

Usually having just the parser is enough to deduct most of the info, but having the other information as well can help the further processing and using of the data.

Stored data format

The format you store your data in defines how you can utilize later on. For example, it is typical that SIEMs contain lots of data in a parsed format, because it enhances the speed of queries and decreases the chance of query timeouts. While more and more data lake solutions and processing tools actually suggest schema-on-read approach that can decrease the ingestion delay. This also allows all the readers to utilize the data however they want in any format they prefer.

The different storage formats can make querying the data more painful, especially if parsers are not ready in the code. For example:

You pushed important security data into your SIEM in a parsed format, only keeping data model relevant fields.
You push firewall logs into a Storage Account blob for long-term storages in their native format.
You push all your Windows logs into a Data Lake in Parquet format, in its raw format.

If your tool is capable of handling the defined data model, it won’t be any problem on the SIEM data. But it can still have challenges for your Storage Account blogs and Data Lake logs.

So, what challenges this can introduce to our tool:

SIEM

Potentially no challenges. You have to teach your tool what data model you usage, the query language is potentially already known by it, so it can generate you queries.

Knowledge needed:

Data model’s final schema

Storage Account

SIEMs usually store logs as simple row-oriented files, which Storage Blobs can also support. Logs are represented as CSV entries (with a header row) or JSON entries in JSONL files, often organized by timestamp for easy access. This provides a straightforward, metadata-light way to store SIEM logs.

Even if the format is similar, having only the final schema is not enough because the logs are stored in raw, unparsed form. They require additional processing to be transformed into a structured, usable format for analysis.

So, what will you need to handle a data store like this:

The bit of the code that defines how to pick up the correct log file and define its file format (csv, jsonl, etc) - but this can be typically really static.
Parser code to process raw log entries
Data model schema to define field structure and values

What you don’t need:

The original data schema

AI can generate code by combining your parser and data model, creating processing scripts - e.g. Python codes - that mimic your SIEM’s query logic for these files. For best results, your tool should use both the final data model schema and the parser code.

Please be aware, you might need to implement the part of the code that defines which logfile to pick up.

Data Lake Parquet format

Many big data solutions utilize columnar storage formats instead of traditional row-based formats. Unlike the typical “one log per line” approach, columnar formats group all values for each field together, optimizing data for analytics and enabling faster field-specific queries.

This setup enables better compression, faster column reads, and efficient aggregation and filtering 0 key for security analytics. It also lowers data query costs when tools charge based on data read.

Effective querying requires understanding your data lake’s storage configuration. Depending on partitioning, indexing, and file organization, some filters work at the storage level while others don’t. Adding ineffective filters can negate the benefits of effective ones, potentially eliminating performance gains from formats like Parquet.

Without proper knowledge of which fields filter at the data level versus code level, AI tools miss significant optimization opportunities.

Similar to Storage Blob implementations, you need:

Final data model schema: Defines what information is stored and its location
Parser code: Instructs tools how to generate final output from source data
Original schema with Parquet metadata: Identifies which fields filter at storage level versus post-read processing

Here is code that reads CommonSecurityLog, applying storage-level filters for TimeGenerated and Severity, then post-read filtering for SourceIP. When data is optimized for the first two fields, only a small subset of data has to be read:

df = spark.table("CommonSecurityLog")

# Filter TimeGenerated > ago(3d) and Severity > 5 (supports predicate pushdown)
three_days_ago = datetime.now() - timedelta(days=3)
df_filtered = (
    df.filter(col("TimeGenerated") > three_days_ago)
      .filter(col("Severity") > 5)
      .select("TimeGenerated", "DeviceVendor", "SourceIP", "DestinationIP", "Severity")
)

# Collect data to driver as pandas DataFrame for further filtering on SourceIP
pandas_df = df_filtered.toPandas()
# Filter SourceIP == '1.2.3.4' in pandas
final_df = pandas_df[pandas_df["SourceIP"] == "1.2.3.4"]

It is obvious in this case, that our tool has to be aware of some metadata, to be able to tell tha tonly TimeGenerated and Severity should be filtered (via predicate pushdown).