Detection Engineering and SOC Scalability Challenges (Part 2)

Anton Chuvakin
Anton on Security
Published in
6 min readSep 21, 2023

--

This blog series was written jointly with Amine Besson, Principal Cyber Engineer, Behemoth CyberDefence and one more anonymous collaborator.

This post is our second installment in the “Threats into Detections — The DNA of Detection Engineering” series, where we explore the challenges of detection engineering in more detail — and where threat intelligence plays (and where some hope appears … but you need to wait for Part 3 for this!)

Contrary to what some may think, a detection and response (D&R) success is more about the processes and people than about the SIEM. As one of the authors used to say during his tenure at Gartner, “SOC is first a team, then a process and finally a technology stack.” (and just repeated this at mWISE 2023) And here is another: “A great team with an average SIEM will run circles around the average team with a great SIEM

SIEMs, or whatever equivalent term you may prefer (A security data lake perhaps? But please no XDR… we are civilized people here), are essentially large scale telemetry analysis engines, running detection content over data stores and streams of data. The signals they produce are often voluminous without on-site tuning and context, and won’t bring value in isolation and without the necessary process stack.

It is the complex cyber defenders’ knowledge injected at every step of the rule creation and alert (and then incident) response process that is the real value-add of a SOC capability. Note that some of the rules/content may be created by the tool vendor while the rest is created by the customer.

So, yes, process is very important here, yet under the shiny new name of TDIR (Threat Detection and Incident Response), lies essentially a creaky process stack riddles by inefficiencies and toil:

  • Inconsistent internal documentation — and this is putting it generously, enough SOC teams run on tribal knowledge, and even an internal Wiki would be a huge improvement for them
  • Staggered and chaotic project management — SOC project management that is hard to understand and improve, that doesn’t happen smoothly, with release/delivery process that is completely irregular and traceability is often lost midway through the operational noise.
  • No blueprint to do things consistently — before we talk automation, let’s talk consistency. And this is hard with ad hoc process that is reinvented every time…
  • No automation to do things consistently and quickly — once the process is clear, how do we automate it? The answer often is “well, we don’t”, anyhow see the item just above…
  • Long onboarding of new log sources — while the 1990s are over, the organizations where a SOC needs to shove paper forms deep inside some beastly IT organizations to enable a new log source have not vanished yet.
  • Low awareness of removed or failed log sources — SOCs with low awareness of removed or failed log sources are at risk of missing critical security events and failed — worse, quietly failed — detections.
  • Large inertia to develop new detection content, low agility — if you make an annual process into quarterly, but what you need is a daily response, have you actually improved things?
  • Inscrutable and unmaintainable detection content — if the detection was not developed in a structured and meaningful way, then both alert triage and further refinement of detection code will ..ahem … suffer (this wins the Understatement of the Year award)
  • Technical bias, starting from available data rather than threats — this is sadly very common at less-mature SOCs. “What data do we collect?” tends to predate “what do we actually want to do?” despite “output-driven SIEM” concept having been invented before 2012 (to be honest, I stole the idea from a Vigilant consultant back in 2012).

While IT around your SOC may live in the “future” world of SRE, DevOps, GitOps and large scale automation, releasing new detections to the live environment is, surprisingly, often heavy on humans, full of toil and friction.

Not only is it often lacking sophistication (copy pasting from a sheet into a GUI), but it is also not tracked or versioned in many cases — which makes ongoing improvement challenging at best.

Some teams made good progress toward automation by using detection as-code, but adoption is still minimal. And apart from a handful of truly leading teams, it is often limited to deploying vendor-provided rules or code from public repositories (ahem, “detection as code written by strangers on the internet”, if you’d like…). As a result, it then poses a real challenge of reconciling internal and external rule tracking.

An astute reader will also point out that the very process of “machining” raw threat signals into polished production detections is very artisanal in most cases; but don’t despair, we will address this in the next parts of this series! This would be fun!

Apart from that, much of the process of creating new detections has two key problems:

  • Often it starts from available data, and not from relevant threats.
  • Prioritization is still very much a gut feeling affair based on assumption, individual perspective and analysis bias.

Instead, there should be a rolling evaluation of relevant and incoming threats, crossed with current capabilities. In other words, measuring detection coverage (how well we detect in our environment against the overall known threat landscape) which allows us to build a rolling backlog of threats to detect, identify logging / telemetry gaps and key improvement points to steer detection content development. This will turn an arts and crafts detection project into an industrial detection pipeline.

💸 How about relying on vendors ?

What about avoiding all the above trickery, and relying on a wise third party for all your detection content? Well, not only does external detection content quality vary drastically from provider to provider, such dependency can occasionally become counter-productive.

In theory, the MSSPs / MDRs with a pristine ecosystem to research, build and release should have solid detections for most clients conventional threats, but the good ones are far and few between. They often cannot spend their development time to create custom detections (“economies of scale” argument that is often used to justify an MSSP actually prevents that). Instead, some build broad, generic detection logic, and then spend their — and customer! — time on tuning out False Positives on the live environments of their clients.

From the end-client perspective, there is neither a guarantee nor complete understanding that the process is going measurably well (especially, and I insist here, in regards to False Negatives) and that it actually leads to increase in client detection coverage. In this regard, some would say that MSSPs / MDRs with regard to detections and detection coverage compete in a market of lemons .

For internal or co-managed SOCs (where a small internal team works with a MSSP), relying excessively on externally sourced rules also has the occasional side-effect of lowering the actual understanding of both the threat and the detection implementation, further encouraging a downward spiral.

When working the other way around, handing over in-house detections to the provider for alert response, there are often slowness and protests as it interferes in their processes and they are (legitimately) concerned that their analysts won’t understand how to process the incidents since they lack on-site context and tribal knowledge. This plays a real part in the industry where a 70% False Positive rate is rather common, by over-relying on response capacity to tune out noisy rules (or, a beautiful SOAR at the output of an ugly SIEM, now a classic!), rather than having a defined development lifecycle where lowering FPs is a priority.

With all this being said, integrating vendor-made (either ready-made rules that come with SIEM tools or outsourcing their implementation) into creating detections is perfectly viable. However, this is true only as long as you have a strong grasp of the end-to-end process, and understand technical objectives very well. Only then will the external rules fit in your environment without adding burden…

UPDATE: the story continues in “Build for Detection Engineering, and Alerting Will Improve (Part 3)”

Related blog:

--

--