BigQuery Vector Search for Log Analysis: A Security Researcher's Perspective

PUBLISHED:

August 20, 2024

BY:

Ganga Sumanth

Ideal for

Cloud Engineer

Developer

Security Engineer

Powered by Vector search and LLMs

Introduction

Let's face it: in today's cybersecurity landscape, we're drowning in data. Every server, firewall, and application is constantly spewing out logs, and buried somewhere in that digital deluge are the breadcrumbs that could lead us to the next big threat. It's like trying to find a needle in a haystack – if that haystack were growing exponentially and the needle kept changing shape.

Traditional log analysis? It's starting to feel like bringing a butter knife to a gunfight. We've all been there, eyes glazing over as we scroll through endless lines of text, knowing that somewhere in there is the clue we need, but it's just out of reach.

Enter BigQuery with its shiny new vector search capabilities. Now, I'm not usually one to get excited about database features, but this… this could be a game-changer. Imagine being able to sift through that mountain of logs not just by keywords, but by context and meaning. It's like suddenly having a metal detector in that haystack.

So, buckle up. We're about to dive into how this technology could reshape the way we approach threat detection and investigation. It's not going to solve all our problems – let's not kid ourselves – but it might just give us the edge we've been looking for in this never-ending game of tag.

‍

Introduction
Understanding BigQuery Vector Search
Enhancing Threat Analysis
Practical Applications
Enhancing Threat Analysis with LLM Integration
Scale, Cost, and Performance Considerations
Conclusion

‍

Understanding BigQuery Vector Search

BigQuery, Google Cloud's fully managed, serverless data warehouse, has been a powerful tool for analyzing large datasets for many years. However, the recent addition of vector search capabilities has significantly enhanced its utility for complex analytical tasks, including log analysis in cybersecurity contexts. Vector search in BigQuery, introduced in 2023, allows for similarity searches on high-dimensional vector data. This feature enables more nuanced and efficient analysis of complex data types, including text, images, and other unstructured data that can be represented as vectors. Here's how BigQuery Vector Search works:

Vector Embedding - Data points (such as log entries) are converted into high-dimensional vectors using machine learning models. These vectors capture semantic meaning and contextual relationships within the data.
Indexing - BigQuery creates and maintains an index of these vectors, optimizing for fast similarity searches.
Similarity Search - When querying, you can find the most similar vectors to a given input vector using distance metrics like cosine similarity or Euclidean distance.

‍

Understanding Vector Search in Log Analysis

As we dive deeper into BigQuery's vector search capabilities, it's important to understand how it differs from traditional text search and why it's so powerful for log analysis.

Traditional text search is like looking for a specific word in a book. It's great when you know exactly what you're looking for - a particular IP address, an error message, or a user's email. It's your go-to tool for those "needle-in-a-haystack" scenarios, like when you need to pull up all logs related to a specific user for GDPR compliance.

Vector search, on the other hand, is more like having a conversation with your data. It doesn't just match words; it understands context and meaning. Here's how it works: we convert each log entry into a high-dimensional numerical vector - think of it as a unique fingerprint that captures the essence of that log. These vectors are created using advanced machine learning models, like the ones provided by Vertex AI.

Now, why is this so exciting for us security folks? Because it allows us to ask questions we couldn't before. Instead of just asking "Show me all logs with this exact error message," we can now ask, "Is this pattern of activity normal?" or "Does this configuration change look suspicious?"

Let me give you a real-world example. Imagine you're monitoring for potential security breaches. With traditional text search, you might look for known malicious IP addresses or specific error messages. But what about sophisticated attacks that don't trigger these obvious red flags? This is where vector search shines. It can identify patterns of behavior that are subtly different from the norm, potentially catching that clever attacker who's flying just under the radar.

There's a trade-off, though. Vector search in BigQuery uses something called Approximate Nearest Neighbor (ANN) search. It's blazingly fast and cost-effective, but it might not catch absolutely everything. It's a bit like having a highly trained detective who can quickly spot suspicious behavior but might occasionally miss a minor detail. For most security applications, this trade-off is well worth it, given the speed and insights we gain.

BigQuery lets us set up both full-text and vector indexes, so we're not choosing one or the other - we're adding a powerful new tool to our arsenal. We can still do those precise, compliance-related searches when we need to, but now we can also ask those broader, more nuanced questions that often lead to the most interesting discoveries.

‍

Enhancing Threat Analysis

Faster Detection of Anomalies

Vector representations of log data can help identify anomalies that might be missed by traditional keyword-based searches. By analyzing the relationships between different log entries in vector space, unusual patterns can be more easily detected.

Example: Consider a scenario where an attacker uses slightly modified commands to avoid detection. We can use SQL and vector operations in BigQuery to identify such anomalies:

WITH log_vectors AS (

SELECT

log_id,

ML.GENERATE_EMBEDDING(log_text) AS vector

FROM `project.dataset.log_table`

)

SELECT

l1.log_id,

l2.log_id,

ML.DISTANCE(l1.vector, l2.vector, 'EUCLIDEAN') AS distance

FROM log_vectors l1

CROSS JOIN log_vectors l2

WHERE l1.log_id < l2.log_id

AND ML.DISTANCE(l1.vector, l2.vector, 'EUCLIDEAN') > 0.0

AND ML.DISTANCE(l1.vector, l2.vector, 'EUCLIDEAN') < 0.5

ORDER BY distance ASC

LIMIT 100

This query generates vector embeddings for log entries, then compares them to find highly similar but not identical logs, which could indicate modified attack patterns.

‍

Improved Correlation of Events

Vector search can help correlate disparate events that might be related to a single security incident. By analyzing the vectors, it can identify connections between log entries across different systems and time periods.

Example: To correlate events across different log sources, we might use a query like this:

WITH log_vectors AS (

SELECT

log_id,

source_system,

timestamp,

ML.GENERATE_EMBEDDING(log_text) AS vector

FROM `project.dataset.log_table`

)

SELECT

l1.log_id AS event1_id,

l2.log_id AS event2_id,

l1.source_system AS system1,

l2.source_system AS system2,

l1.timestamp AS time1,

l2.timestamp AS time2,

ML.DISTANCE(l1.vector, l2.vector, 'EUCLIDEAN') AS distance

FROM log_vectors l1

JOIN log_vectors l2

ON l1.source_system != l2.source_system

AND ABS(TIMESTAMP_DIFF(l1.timestamp, l2.timestamp, MINUTE)) < 60

WHERE ML.DISTANCE(l1.vector, l2.vector, 'EUCLIDEAN') < 1.5

ORDER BY distance ASC

LIMIT 100

This query finds similar log entries from different systems within a 60-minute window, potentially revealing related events in an attack sequence.

‍

Practical Applications

Use Case: Identifying Phishing Attempts

Vector search can be effective in analyzing email content to detect phishing attempts. By converting email text into vectors, we can compare them against known phishing patterns.

Example:

WITH email_vectors AS (

SELECT

email_id,

sender,

subject,

ML.GENERATE_EMBEDDING(CONCAT(subject, ' ', body)) AS content_vector

FROM `project.dataset.email_logs`

known_phishing_vectors AS (

SELECT ML.GENERATE_EMBEDDING(phishing_content) AS phishing_vector

FROM `project.dataset.known_phishing_patterns`

)

SELECT

e.email_id,

e.sender,

e.subject,

MIN(ML.DISTANCE(e.content_vector, p.phishing_vector, 'EUCLIDEAN')) AS min_distance

FROM email_vectors e

CROSS JOIN known_phishing_vectors p

GROUP BY e.email_id, e.sender, e.subject

HAVING min_distance < 1.0

ORDER BY min_distance ASC

This query compares incoming emails against known phishing patterns, flagging those with high similarity for further investigation.

‍

Enhancing Threat Analysis with LLM Integration

While vector search significantly improves our ability to detect anomalies and correlate events, integrating Large Language Models (LLMs) can supercharge our analysis and significantly speed up the initial triaging process.

‍

Triage Anomalies Using Vector Search with LLM Reasoning (RAG)

Here's an example of how we can use BigQuery's ML.GENERATE_TEXT function to incorporate LLM reasoning into our vector search process:

SELECT prompt, ml_generate_text_llm_result AS generated

FROM ML.GENERATE_TEXT(

MODEL `[MY_PROJECT].[MY_DATASET].gemini_model`,

(

SELECT CONCAT(

"You are a cloud administrator and a log forensics expert. ",

"Determine if the following suspected administrative action is a high or low risk given the following prior history of valid administrative actions.",

"Include your reasons in a bullet list.", "\n",

FORMAT("New suspected administrative action:\n%s\n", suspicious_action),

"Previous administrative actions:\n",

STRING_AGG(FORMAT("- %s", past_similar_action),'.\n')

) AS prompt,

FROM (

-- Insert vector search query here to find similar past actions

)

GROUP BY

suspicious_action

STRUCT(

600 AS max_output_tokens, -- increase tokens to account for explanation

0.1 AS temperature, -- more deterministic response

TRUE AS flatten_json_output

)

);

We're trying to do the following:

Perform a vector search to find similar past actions.
Construct a prompt for the LLM, including the suspicious action and similar past actions.
Uses the Gemini model to analyze the risk level of the suspicious action.

An example output from this LLM-enhanced analysis might look like this:

Low risk

‍

* The principal (someone@google.com) has performed the same operation (google.logging.v2.ConfigServiceV2.DeleteSink) on the same logging sink (<redacted>) in the same project (<redacted>) multiple times in the past using Terraform.

* The IP address (<redacted>) used in the suspected administrative action is not significantly different from the IP addresses (<redacted> and <redacted>) used in previous administrative actions.

* The frequency of the suspected administrative action (1 time) is consistent with the frequency of previous administrative actions (1 time, 3 times, 1 time, and 1 time).

This automated triage can help security teams quickly prioritize which actions require immediate attention, significantly improving response times to potential threats.

‍

Scale, Cost, and Performance Considerations

When implementing vector search and LLM integration for log analysis, it's crucial to consider the scale of your data and the associated costs and performance implications.

Vector Search Efficiency

Vector search, like full-text search, uses a specialized index to enhance performance. This allows for efficient searching of embeddings, resulting in faster response times and reduced data scanning. However, this efficiency may come at the cost of slightly less precise results compared to exhaustive searches.

Data Volume Reduction

Let's consider a practical example to illustrate the benefits of vector search in terms of data volume:

Assume a project with 50GB of audit logs covering the last 12 months, with an average payload of 2KB per log entry. This translates to approximately 25 million log records. By applying data reduction techniques and vector embeddings, we can significantly reduce this volume:

25M log entries @ 2KB each → 1.25M (aggregated) logs @ 200 bytes each
50GB raw logs → ~240MB log embeddings

This reduction means that your vector index would be around 240MB, allowing for fast (1-2 seconds) and cost-effective semantic search over a 12-month lookback period.

Scaling to Larger Datasets

Now, let's consider a scenario where 50GB of logs are ingested daily. For a one-year retention period, this would result in:

17TB of raw log data
Only 81.72GB of log embeddings

And therein lies the crux of this exercise. We suddenly have log volumes that are not prohibitively large anymore. It is several orders of magnitude easier to convince management to foot the bill for under 100 GB of logs especially given today's regulatory mandates.

‍

Conclusion

As I reflect on the potential of BigQuery's vector search capabilities combined with LLM integration, I can't help but feel a mix of excitement and caution. This technology represents a significant leap forward in how we approach log analysis for cybersecurity, but it's not without its challenges.

On one hand, the ability to efficiently detect anomalies and correlate events across massive volumes of log data is genuinely game-changing. I've seen security teams struggle for years with the sheer volume of data they need to sift through, often missing critical connections simply because they're buried in the noise. Vector search could be the key to cutting through that noise.

The integration of LLMs for automated triage is particularly intriguing. Having spent countless hours manually reviewing potential security incidents, the thought of having an AI assistant to provide an initial assessment is both thrilling and a little unnerving. Will it catch things we might miss? Probably. Might it also lead us down the wrong path occasionally? Almost certainly.

What really strikes me is the potential for democratizing advanced security analysis. The dramatic reduction in data volume we can achieve with these techniques means that even smaller organizations might be able to retain and analyze long-term log data in ways that were previously only feasible for tech giants with enormous budgets.

But let's not get carried away. These are powerful tools, not magic bullets. They still require skilled professionals to implement effectively and interpret the results. There's also the question of how attackers might adapt their techniques once they understand how we're using AI to detect them.

As we move forward with these technologies, I think it's crucial that we:

Remain critical and always verify the outputs of our AI-assisted analyses.
Continuously educate ourselves and our teams on both the capabilities and limitations of these tools.
Stay vigilant about the ethical implications of using AI in security contexts, especially when it comes to privacy concerns.

In the end, I believe the combination of vector search and LLMs will become an indispensable part of the security analyst's toolkit. But like any tool, its true value will lie in how skillfully we wield it. As we embrace these advancements, let's do so thoughtfully, always keeping in mind that our ultimate goal is not just to leverage cool tech, but to genuinely improve our ability to protect our systems and data.

The future of log analysis in cybersecurity is bright, but it's up to us to navigate it wisely. I, for one, am looking forward to the journey.

‍

Ganga Sumanth

Blog Author

Ganga Sumanth is an Associate Security Engineer at we45. His natural curiosity finds him diving into various rabbit holes which he then turns into playgrounds and challenges at AppSecEngineer. A passionate speaker and a ready teacher, he takes to various platforms to speak about security vulnerabilities and hardening practices. As an active member of communities like Null and OWASP, he aspires to learn and grow in a giving environment. These days he can be found tinkering with the likes of Go and Rust and their applicability in cloud applications. When not researching the latest security exploits and patches, he's probably raving about some niche add-on to his ever-growing collection of hobbies. Hobbies: Long distance cycling, hobby electronics, gaming, badminton, football, high altitude trekking SM Links: He is a Hermit, loves his privacy