Research shows: Security agents are only as good as the context they’re provided with

TL;DR

  • AI models already know how to reason about security. What limits them is whether they can see how your systems connect to each other.
  • Cross-platform identity risks won’t surface from any tool that evaluates systems in isolation. This visibility gap is structural, not a model capability problem.
  • Adding a cross-vendor relationship map raised answer correctness by 34% and cut exploratory queries by roughly 70%. Both improved together, without any tradeoff.
  • Relational context matters more than model choice. The finding held across every frontier model we tested.
  • When evaluating AI for security, the right question isn’t which model scores highest. It’s what map the agent is working from.

“Context is everything” has officially become the new “data is the new oil.” You’ve seen the conference decks and read the LinkedIn posts. Everyone in AI agrees: give your model more context and get better answers. Good advice, genuinely.

But there’s a version of that advice that’s doing a lot of hand-waving over a specific, uncomfortable question: what happens when an AI agent gets the context wrong inside your security stack?

Foundation models are generic by design. They’ve never seen your Okta tenant, your AWS naming conventions, or how your HR directory connects to anything. Context is the bridge that closes that gap. And when that bridge is incomplete, you’re not dealing with a slightly worse answer. You’re dealing with a confident-sounding verdict that sends a remediation team after the wrong identities, misses cross-system connections, and closes a ticket that should have stayed open.

Our research team wanted to measure that gap scientifically. They built a benchmark, based on evidence from fifty real cross-system identity tasks, eight live enterprise platforms, and four frontier models. We published the full methodology and results. Here we’ll focus on the part that matters for your stack.

The identity problem that lives between your platforms

Most security platforms test their AI on one system at a time. That’s not a criticism of the vendors, it’s a structural limitation of how these tools were built.

Take a concrete example from the benchmark. Recognizing that [email protected] in your identity provider corresponds to arn:aws:iam::123:role/JDoe-Dev in AWS requires inferring a connection that exists in neither system’s data. No platform references the other. An AI agent working without that explicit relational map has to guess, and guessing across disconnected schemas produces the kind of errors that look like answers.

Our research team named this the Correlation Gap: the challenge of linking the same identity across platforms that share no common schema and no foreign keys between them. An offboarded employee whose HR record marks them inactive but whose AWS SSO access is still live won’t surface from one platform alone.

In that scenario, an agent trying to close that gap doesn’t stop and say it can’t. Instead, it starts probing: it queries tables that may not exist, infers connections that aren’t documented, and builds a picture of your environment through trial and error. That process is slow, expensive in tokens, and wrong often enough to matter.

While identity is where so many damaging exposures start, the problem isn’t unique to it. Any AI agent reasoning across disconnected systems faces the same structural gap.

This benchmark was designed specifically to test it.

Cross-platform ISPM: what we actually tested

The industry has shipped benchmarks for AI in cybersecurity over the last few years, from CTF-style challenges to SOC investigations and vulnerability discovery. But zoom in on identity security and almost none of them resemble how access actually works in a real company: messy org charts, legacy directories, multiple clouds, and a mix of human and machine identities spread across systems that were never designed to talk to each other. We wanted numbers that held up against that reality.

So our research team ran the benchmark on a live, production-grade environment connecting eight platforms across HR, identity, cloud, code, and collaboration, as described below. Each of the fifty tasks required pulling data from more than one of them, mirroring investigations a security team would actually run.

On top of that sits an organizational knowledge layer that ingests internal policies, pentest reports, Jira history, or past approval decisions, so the platform reasons about your environment, not an industry average. Most “AI-powered” security products probably skipped this part. Without a real data foundation, AI in security is a parlor trick that breaks the first time someone asks a real question.

We tested four frontier models across five levels of context, from raw environment access with zero supporting metadata all the way up to full schemas, example queries, and a cross-vendor relationship map. We graded the whole reasoning chain, not just the final answer, scored by a two-judge consensus panel to keep any single grader’s bias out of the results.

The finding that holds across every model we tested

The models already knew how to think about security. With zero context injected, every model scored between 0.94 and 0.99 on reasoning quality. They asked the right questions and went looking for the right kinds of data. Their security instincts were solid.

What they couldn’t do was prove it. Without an explicit map of how systems connect, the same models queried tables that didn’t exist, made wrong assumptions about which identity matched which, and missed affected accounts even after correctly flagging that a risk was present.

They knew the “what”, but they couldn’t reliably deliver the “who” and the “where.”

Then we added context. Schema definitions, example query patterns, and Sola’s Security Graph, a cross-vendor relationship map that explicitly connects identities across platforms. As a result,. accuracy climbed across the board.

Moving from no context to full context lifted answer correctness by 34% relatively, and the pattern held no matter which model was running underneath. Complete failures dropped just as sharply: Claude Sonnet 4.6, for example, went from failing 38% of tasks down to 8%.

Even in the best configuration, Claude Opus 4.8 reached the right high-level verdict 94% of the time but produced the complete, evidence-backed answer only 78% of the time. For remediation, the difference matters: knowing a privilege escalation path exists doesn’t tell you which identities it touches.

Of the context layers we tested, one drove the largest single accuracy jump: the Security Graph.

Sola’s Security Graph made the difference

Sola’s Security Graph is a cross-relation map our team built that spells out, explicitly, how an entity in one platform connects to an entity in another. The join paths and relationships that exist in no single system’s data, written down in one place an AI agent can actually read.

Adding it produced the single largest accuracy jump of any context layer in the benchmark. The model stopped guessing at how Okta relates to AWS and started following a map that already knew. The gains showed up in efficiency too, not just accuracy. Moving from no context to full context cut the average number of exploratory queries per task by roughly 70% across every model.

Without a relational map, an agent probes dead ends, querying tables that might not exist and working backward from wrong assumptions. With the Security Graph in place, it goes straight to the right join path. Fewer queries means lower token cost, faster answers, and far less noise for your team to wade through.

The takeaway holds across all four models: the quality of the relational context around an agent matters more than which model runs underneath it.

What this means if you’re evaluating AI for security

“Which model is best” is the wrong first question. The data says model choice is a secondary decision next to the relational context the agent can actually draw on.

So here’s what’s worth asking a vendor instead:

  • How does the agent resolve the same identity across platforms? If it’s inferring connections on the fly, you’ll get the verdict-evidence gap from the benchmark: confident conclusions with incomplete proof.
  • What structured context does the agent receive? Schemas help. A real cross-platform relationship map helps far more, and it’s the piece most tools don’t have.
  • What happens when relational data is missing? A trustworthy agent surfaces the gap. A risky one fills it with a guess that reads like an answer.

None of these questions are about model benchmarks or parameter counts, but about whether the system around the model understands your environment. That’s the variable you can actually control.

The map your agent needs already exists

The benchmark is public: the full question suite, methodology, and results are available for any team that wants to run their own evaluation or pressure-test a vendor’s claims.

If you’re ready to see how Sola’s context layer handles your actual environment, across your platforms and your identity data, Lumina Signals is where to start.

See what your environment looks like with the full map.

Frequently asked questions

Why does 78% answer correctness matter if the model gets the verdict right 94% of the time?
The verdict tells you a risk exists. The full answer tells you which identities are exposed, which access paths are involved, and what you need to close. For a team actually running remediation, 78% means roughly one in five investigations is missing evidence. That’s not a scoring footnote. It’s the difference between a complete investigation and an incomplete one that looks finished.
The benchmark ran on Sola’s own infrastructure. Doesn’t that bias the results?
The Security Graph is a Sola-proprietary artifact, and its role in the results is part of what the benchmark measures. The point isn’t that every team already has it. It’s that when a cross-platform relationship map is available, accuracy and efficiency improve significantly across every frontier model tested, regardless of which model runs underneath. The finding about relational context matter more than model choice holds independent of the vendor.
The benchmark covers identity security. Does any of this apply to broader security operations?
The benchmark was scoped to ISPM visibility: misconfiguration detection, identity enumeration, and cross-system access tracing. It doesn’t cover remediation, behavioral analytics, or broader SOC workflows. That said, the core finding, that AI agents reasoning across disconnected systems need explicit relational context to produce complete evidence, applies wherever data lives across platforms that don’t share a common schema. Identity is the most acute version of that problem, not the only one.
The benchmark tested specific model versions. What happens as models improve?
Faster, more capable models will close some of the gaps measured here. But the verdict-evidence gap in this benchmark wasn’t primarily a reasoning problem. Models scored near-perfect on reasoning quality even with zero context. The failure was at the evidence layer: without a relational map, even the strongest models couldn’t reliably enumerate affected identities. That’s a data availability problem, not a model capability one. Better models don’t solve it. Better context does.
If the Security Graph is the key variable, how does a team get one?
Building a cross-vendor relationship map from scratch requires maintaining explicit join logic across every platform in your environment, keeping it current as systems change, and making it readable by an AI agent at query time. Most teams don’t have the infrastructure to do that at scale. It’s the core problem Sola’s context layer was built to solve. If you want to see how it works against your actual environment, Lumina Signals is the starting point.
About the author
Guy Flechter

CEO & Co-Founder, Sola Security

A self-proclaimed technophobe (we know, very believable), with over 20 years of security grit: from leading teams at AppsFlyer and LivePerson to co-founding Cider Security (acquired by Palo Alto Networks in 2022) and Sola. On a mission to redefine the cybersecurity industry.

Read more
What are you waiting for?

Get started for free, like, right now.