More SRE Lessons for SOC: Release Engineering Ideas

Anton Chuvakin
Anton on Security
Published in
6 min readSep 1, 2022

--

As we discussed in our blogs, “Achieving Autonomic Security Operations: Reducing toil” and “Achieving Autonomic Security Operations: Automation as a Force Multiplier,” “Achieving Autonomic Security Operations: Why metrics matter (but not how you think)”, your Security Operations Center (SOC) can learn a lot from what IT operations discovered during the Site Reliability Engineering (SRE) revolution.

Let’s dive into another fascinating area of SRE wisdom — Release Engineering, covered in Ch 8 of the SRE book (and related workbooks too). BTW, did you know that the SRE book is — by a wide margin — the most recommended resource by guests on our podcast?

First, we need to address the elephant in the room, ahem, in the SOC. Namely, people who hear about “release engineering” and say: my SOC neither releases nor engineers. So, how can people who think their SOC does not “release” anything because “we just use alerts spawned by rules from somebody else” learn from the SREs here? Well, I am pretty sure the 1990s system admins believed “they didn’t develop”, but look at all this IaaC stuff and the SRE movement itself. In fact, back in 2016 we were creating a modern SOC model at Gartner and “detection engineering” was very much part of the SOC model. So, we address the elephant by providing a compassionate wave as it departs … slowly but surely.

In fact, one SRE we talked to explained release engineering to us like this: “I want to get it out of my head and into the computer. Safely.” So, in this sense, you are almost certainly releasing things in your SOC.

Still, some organizations use a very narrow definition of a SOC, where the SOC label only applies to the “alert shoveling team” and not to the alert/rule/content development team. In this case, what we talk about here applies to the combination of these two groups.

Now, perhaps a true 2002 (!) SOC (not necessarily “clown-grade”, this label is reserved for people who want to hunt before they detect or even log…) may rely solely on vendor-provided detection content. What about all the people who say they just enable the detections / rules rather than engineer them? There is no release process as such, right? Perhaps, but should there be? Do you have 10 detections that you enabled or 10K of them? How do you decide when to enable and disable a rule? How do you approach rule quality? What I imply is that you do have a trivial “release engineering” process for detections here.

Our journey into SRE wisdom starts with this question. Are your detections “built in a reproducible, automated way so that [their] releases are repeatable and aren’t “unique snowflakes.””? [reminder, all italic quotes are from Ch 8 of the SRE book]. This is what the SRE thinking on release engineering enables you to do. This advice is sorely needed at many SOCs I’ve seen where panic-driven (“OMG a new threat, how/where do we detect it?! Let’s go search SOC Prime real quick!”) and no repeatable processes are practiced. In other words, they do security, but not security engineering. And if you recall, “you can’t Ops your way to 10X SOC, but you can Dev there” (well, I made it up myself, but it does sound cool, no?)

As we touched in our metrics post, do you know how long it takes to research a new threat and deploy a reliable (!) detection for it? SREs recommend to “have tools that report on a host of metrics, such as how much time it takes for a code change to be deployed into production (in other words, release velocity) and statistics on what features are being used in build configuration files.” SOC versions of such metrics help you decide where to deploy your automation, toil-busting efforts.

Generally, this topic of automation comes up a lot in release engineering (it is kinda a big deal, as you have guessed!). However, would you trust a machine to create detections for you? In SRE land, “many projects are automatically built and released using a combination of our automated build system and our deployment tools.” Where can we use this in and around our SOC? Can we have a full-auto detection creation pipeline? Admittedly, there are examples where the entire detection pipeline from threat research to detection content deployment is run by the machines, the humans are very often in the chain, at least until your SOC maturity goes up a lot. Still, this is something I would start thinking about as your SOC develops.

Automatic testing is also hugely useful for detections, the SRE release advice covers this well: “A continuous test system runs unit tests against the code in the mainline each time a change is submitted, allowing us to detect build and test failures quickly.” Naturally, for many tricky detections this is easier said than done. While nobody wants 90% false detections employed, the route to automating the testing (such as via tools like Breach and Attack Simulation) may not always be obvious. Another lesson for SOC here is that while some detection rules are updated daily, others are perhaps touched annually. This means automated testing brings different value to each kind, but you will only know this when you collect the metrics from your environment..

Next, “During the release process, we re-run the unit tests using the release branch and create an audit trail showing that all the tests passed.“ In your SOC, this means creating documents that record that the testing is done; naturally, such documenting should also happen automatically. While this sounds like a formality, it ends up being useful for future detection tuning and development, and in case of an incident may prove a degree of diligence on your behalf perhaps. Ever tried to make sense of a sneaky detection use case your predecessor cooked up, complete with multi-line regexes, and loops, and such?

SREs also say that “our goal is to fit the deployment process to the risk profile of a given service.” In your SOC, there are “risky” rules that wake people up at 3AM vs fairly safe rules that feed the hunting team clue funnel. Test and deploy differently based on “risk” and usage of detections.

Now, what is the SOC equivalent of the “canary deployment”? This is going to be very cool! “A typical canary deployment involves starting a few jobs in our production environment after the completion of system tests.” They further explain the concept like this: “initially expose just some of your production traffic to the new release using a canary. Canarying allows the deployment pipeline to detect defects as quickly as possible with as little impact to your service as possible.”

So, is it simply deployment in phases? First, drop a rule to a few EDR agents, then others? Not exactly, those SRE ideas are sneaky! A canary secret is not merely in phased deployment, but in enhanced telemetry and rapid lesson learning, followed by rapid change and rollback (if needed). After all, if you never know that the canary died (in a coal mine), there is no value in the approach. Similarly, if you do learn that said canary died, but it takes too long for you to change the approach, there is again no value in it.

Now, we’re going to send the new detection rule to some sacrificial sensors and then look at those sensors and compare them to non-sacrificial ones to check if they are happy? Did anybody crash that we didn’t expect to crash? Did anybody trigger an alert off every network packet? Next, we can rapidly adjust so that more problems are prevented. This applies even more to SOAR playbooks, by the way. BTW, SRE book actually defines three conditions and says that “canarying for a given service requires specific capabilities.” Cool verb, guys!

Now, “How should you handle versioning of your packages? Should you use a continuous build and deploy model, or perform periodic builds?” This is the question many SOC (or, rather, aspiring detection engineering) teams better ask themselves. Pick one method, and apply this to your detection content and SOAR playbooks as well. In a year, you’d be happy you did!

So, admittedly, this post ended up being less juicy than the previous three, but I suspect there is value for some SOCs in this. At the very least, if as a result of reading this somebody realizes that their SOC releases stuff, this may help….

Misc fun detection engineering resources:

Related blogs:

--

--