How to SLO Your SOC Right? More SRE Wisdom for Your SOC!

Anton Chuvakin
Anton on Security
Published in
7 min readMar 16, 2022

--

As we discussed in “Achieving Autonomic Security Operations: Reducing toil” (or it’s early version “Kill SOC Toil, Do SOC Eng”) and “Stealing More SRE Ideas for Your SOC”, your Security Operations Center (SOC) can learn a lot from what IT operations learned during the SRE revolution. In this post of the series, we plan to extract the lessons for your SOC centered on another SRE principle — Service Level Objectives (SLOs).

In brief, this is about metrics. SOC metrics have long fascinated me, and this is a chance to learn from a new domain that is generally ahead of security in its systems thinking.

Before we go there, what’s an SLO? “An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. “ OK, what’s an SLI? Well, “An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided.” (all quotes are from the SRE book here)

So? We measure something (SLI) and we set the target value (SLO). Now, what about people who have only heard of SLAs? Well, SLA is an agreement about the above: “an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.”

I am not going to spout clichés like “what gets measured gets done” here, but metrics and SLIs/SLOs will to a large extent determine the fate of your SOC. My favorite, if a bit dated, example is: SOCs (including at some MSSPs) that obsessively focus on “time to address the alert” (that they naively consider to be the same as MTTD, BTW … WTF) end up radically reducing their security effectiveness while making things go “whoosh” fast. If you equate MTTD with “time to address the alert” and then push the analyst to shorten this time, you will not have a good time … while the attacker will.

So, yes, SREs also start the SLO discussion with the reminder that “choosing appropriate metrics helps to drive the right action.”

Now, a naïve view of metrics would be that “whatever sounds bad” (problems per second, incidents per employee, etc) need to be minimized while “whatever sounds good” (successes, reliability, uptime, etc) need to be maximized … ad infinitum. But hey… here is a new insight: sometimes good metrics have an optimum level, and yes, even reliability (and maybe even security). Read the SLO chapter in the book for a full example, but they have an example of a service where the reliability was too high. How is it bad? “Its high reliability provided a false sense of security because the services could not function appropriately when the service was unavailable, however rarely that occurred. […] SRE makes sure that global service meets, but does not significantly exceed, its service level objective.”

There is a fun SOC lesson here: some security metrics have optimum value. The above-mentioned time to detect, I bet, has an optimum for your organization at least, if not perhaps a global optimum (similar to my patch sound barrier). Another example: the number of phishing incidents — that screams to be pushed to 0 , right? — may have an optimum too: if nobody phishes you, this is probably because they already have credentialed access to many of your systems. So in your SOC, think of SLI optimums, and don’t automatically assume 0 or infinite for metrics.

The SRE book reminds us that “good metrics” may need to be balanced with other metrics, rather than blindly pushed up. “User-facing serving systems generally care about availability, latency, and throughput. […] Storage systems often emphasize latency, availability, and durability. […] Big data systems, such as data processing pipelines, tend to care about throughput and end-to-end latency.” In a SOC, this may mean that you can detect fast, review all context, perform deep threat research — but the balance may differ for various threats and situations. So, think combinations of metrics, not mere numbers.

Another lesson from SREs of value to your SOC: “Whether or not a particular service has an SLA, it’s valuable to define SLIs and SLOs and use them to manage the service.“ Indeed, I agree that SLIs and SLOs matter more for your SOC then any agreements i.e SLAs. Metrics and targets before handshakes!

Now, here is a gem: “Most metrics are better thought of as distributions rather than averages.” For you, my sole statistically skilled reader, this is obvious. For others: what do you make of an average alert response of 20 minutes? Is this “all alerts are addressed in 18–22 minutes” or “all alerts are addressed in 5 minutes, while 1 alert is addressed in 6 hours”?

In fact, “The higher the variance in response times, the more the typical user experience is affected by long-tail behavior.” This is definitely something I’ve seen in SOCs — that one outlier event is probably the one that matters most. To this, SRE advice is “Using percentiles for indicators allows you to consider the shape of the distribution.”

The other epically useful concept from SRE is of course “the error budget.” This foundational concept may not be clear to my security peers so here is SRE advice verbatim: “allow an error budget — a rate at which the SLOs can be missed — and track that on a daily or weekly basis. (An error budget is just an SLO for meeting other SLOs!)

SOC value here is not immediately obvious, but here it is: this is security, and the game is not about the metrics, ultimately, it is about the threat actor. I’d rather miss the SLO, but get the threat in my environment. I’d rather spend more time than comply with rigid time metrics. Ultimately, the defenders win when the attacker loses, not when the defenders “comply with an SLA.” Error budget concept is your friend here.

Further, the SRE thinking goes like this: “It’s both unrealistic and undesirable to insist that SLOs will be met 100% of the time: doing so can reduce the rate of innovation and deployment.“ More broadly, and as we say in our recent paper with Deloitte on SOC (“Future Of The SOC: Process Consistency and Creativity: a Delicate Balance”), “this adherence to process and lack of ability for the SOC to think critically and creativity provides potential attackers with another opportunity to successfully exploit a vulnerability within the environment, no matter how well planned the supporting processes are.“ (read more here)

Now, here is another brain teaser from our SRE brethren: “Don’t pick a target based on current performance.” Huh? This is very common, in my experience, so is this really that bad? Let’s see, an analyst handles 30 alerts a day (SLI), their manager wants to improve by 15% so they set the SLO to 35 alerts a day. All good? Well, wait a second. How many alerts are there? Leaving aside the question of whether it is the right SLI for your SOC (spoiler: it is not), what if you have 5000 alerts, and you drop 4970 of them on the floor. When you “improve” you will drop merely 4965 on the floor. Is this a good SLO? No, you need to hire, automate, filter, tune or change other things in your SOC, not set better SLO targets.

To this, our SRE peers say: “As a result, we’ve sometimes found that working from desired objectives backward to specific indicators works better than choosing indicators and then coming up with targets.“ AND “Start by thinking about (or finding out!) what your users care about, not what you can measure.” In the SOC, this probably means start with threat models and use cases, not the current alert pipeline performance.

Now, here is a cryptic one: how many metrics do I need in my SOC? SREs wax philosophical here: “Choose just enough SLOs to provide good coverage of your system’s attributes.” In my experience, I’ve not seen people succeed with more than 10, and I’ve not seen people describe and optimize SOC performance with less than 3. In other words, I don’t know. However, SREs offer a neat test: “if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO.”

SLOs will get to define your SOC so define them the way you want your SOC to be: It’s better to start with a loose target that you tighten than to choose an overly strict target that has to be relaxed when you discover it’s unattainable. SLOs can — and should — be a major driver in prioritizing work for SREs and product developers, because they reflect what users care about.” This of course applies verbatim in a SOC.

Finally, make SLOs for your SOC public (well, within the company) just as SREs say (Publishing SLOs sets expectations for system behavior.) The benefit is that nobody can blame you for non-performance if you perform to those agreed upon SLOs (that may became SLAs).

At the end, if you are tired of reading my ramblings, here is a SOC metrics resource list I solicited from the community:

(for other advice and specific metrics, mine this thread)

Related posts:

--

--