What is your Cloud SIEM Migration Approach?

Anton Chuvakin
Anton on Security
Published in
7 min readOct 18, 2022

--

This blog is written jointly with Konrads Klints.

TL;DR:

  • Migration from one SIEM to another raises the question of what to do with all the data in the old SIEM. A traditional approach was to let the old SIEM hardware languish until its data was no longer required.
  • When migrating from a cloud-based SIEM “A” to another cloud-based SIEM “B”, you have to contend that data is not easily transferable across SIEMs and/or that there will be significant data retention costs with the old SIEM (if you keep it up even without any new log collection)
  • A proposed solution to this would be to export data from SIEM “A” to a temporary data lake, and then use data analytics methods and serverless queries such as Google Dataproc or AWS Athena — for select use cases.
  • This requires only a moderate cloud and data analytics expertise which can be readily sourced if not available in-house.
  • Big cost savings come from keeping the data compressed in an inexpensive data storage system such as Google Storage nearline or similar.
  • The overall functionality is reduced as compared to a full blown SIEM, but still acceptable for common older data use cases — search by keywords for IR, IOCs during threat hunts or compliance data retrievals.

Problem statement

Deploying and using a Security Information and Event Management (SIEM) tool tends to occasionally generate negative emotions in people (no way, right!?). This sometimes leads to teams deciding to replace one product with another. As SIEM is simultaneously a security technology and a data management (and analytics) technology, one of the migration challenges is dealing with log data that has been accumulated, sometimes over a year or more.

Back in Anton’s analyst days, my advice to clients has been to NOT even try to migrate the log data. For example, this is what I said in a 2019 blog post:

There is no migration of collected log data, in most cases. It is just not worth the effort. Prepare to keep the old SIEM running for 9–12 months (some choose to do so unsupported, but YMMV) as a “legacy data store.” Now, you can try to export/convert/import, but this is messy, labor intensive, annoying and, frankly, can just be avoided by keeping the old data in your old SIEM or in some independent log repository (this reminds us why separate SIEM and CLM has been a good idea for many)”

I’d say this view originated in the age of software/appliance SIEM (say 1998–2010 or so) and has been accepted as conventional wisdom for SIEM migrations by most people. Organizations will keep an old database of SIEM data or a set of old SIEM appliances running, but not collecting data, just in case they need to see the data or in case some pesky auditor shows up and asks for it. Some older SIEM tools also allowed offline data archival, so one essentially needed a copy of the software and a bunch of tapes (yes, tapes!) that can — with some luck — be loaded back if needed.

Naturally, their support contract would have expired by then, and in some cases keeping the product running would be considered a license violation, but it was done regardless (after all, they were not using the product). Some would downgrade the license to the cheapest possible option and schedule decommission a year from now.

However, suppose you have decided to migrate from one SaaS or cloud-based SIEM platform to another (in this post, we used them interchangeably, but sometimes people do point out the differences). With a cloud-based SIEM you typically pay for data retention, thus potentially doubling your SIEM costs during the retention period — pay for new SIEM and pay for old SIEM (example)

For large deployments, this could be financially prohibitive. Also, this just sounds wrong, but what are the choices? Buy another, temporary SIEM? Use cheap (eh…not for petabytes) intermediate storage? Use open source (and then who pays for hardware or cloud storage)?

Or, perhaps we can pay somebody really clever to enable export / convert / import of the data? First, this requires your old SIEM to have the capability to bulk export. Second, SIEM tools don’t reliably support ingestion of “raw SIEM dumps” from another tool. Such raw SIEM dump usually looks like lines of text without metadata. SIEM engineers would have to write custom parsers for an unknown amount of log types with a high probability of errors in parsing. Even if the old SIEM can export raw logs, the process of ingestions can be really tricky, for a long list of reasons to painful to list here…

How can we inexpensively store security log files for long periods of time, while maintaining a reasonable, SIEM-like functionality?

Objectives

Now, what do we really want?

  • Have a searchable log storage for 1 year or longer (the number is driven by both IR needs as well as compliance), so we can retire the old SIEM immediately.
  • Able to support use cases such as simple hunt searches, incident response investigation and compliance data queries: search/retrieve logs by a few, common keywords such as IP address, domain name, username — essentially a substring match.
  • Must be able to select a date range for performance reasons (don’t search all logs if you want only to search logs in a specific date range, say from May to April 2021 )
  • Focus on select essential fields such as a hostname which would usually be present in the first 100 or bytes of any SIEM bulk log dump lines and be somewhat consistent across logs types

Couldn’t we just OpenSearch/ELK it?

One approach would be to somehow export the data from the old SIEM and then shove all of the data into an ElasticSearch/OpenSearch instance with minimal pre-processing. At first glance, there are several advantages to this: ELK is a well understood solution that security engineering teams could reasonably cope with. A small, 3–4 node cluster with ample storage seemingly would fit the bill.

The cost breakdown is as follows:

  • Storage: the “compression” ratio of raw vs on-disk in Elastic is best case 50% so, a 10TB raw dump would require at least 5TB hot storage attached to a VM which would cost $2000 per month or $24,000 per year. More likely the storage requirements will be higher.
  • Compute: 4 instances of e2-highmem-4 at $131 per month per instance or $6228 per year (with ~40% discount for committed use it will come down to less than $4000 per year).

We don’t expect the performance to be amazing (because ELK), but for the occasional search this should be just fine. The above example may set you back mid-five digits (say up to $50K) plus the upfront cost of engineering time which will vary depending on how many fields you will want to extract (this could be as low as one). That’s not overall terrible. However, if you have petabytes of data, it really is that. Terrible!

The main advantage of this approach is that it feels familiar, there’s the search box and the time slider.

Using Google Cloud Platform — A Cheaper, Serverless Approach

It’s 2022 and we live in a world with flexible computing capabilities. Surely we could engineer something that eats less money while offering a lot of flexibility and some analyst creature comforts — an experience that is better than grep?

Google Dataproc serverless with Google Cloud storage backend seemingly meets most of these needs. It has a low running cost and Spark Notebooks allow analysts to make use of SQL-like queries.

Experience shows that unless the goal is to extract all fields from logs (not what we are aiming at here), then a regex’y solution for the desired types of events is usually good enough. This requires moderate data analytics and cloud expertise, perhaps someone with ~3–5 years of experience in the field.

Economics

  • Google Cloud Platform Hot Storage is $0.2 per GB/month or $2500 per TB/year. Nearline storage is 50% cheaper.
  • Conservative log compression ratio of 10:1 using gzip. So 1TB of actual storage would be only $2500/y.
  • You only pay for compute capacity when you use it using Dataproc serverless.
  • If you keep data in hot storage and do not move out of the same region GCP service, then data transfer is free.

An AWS Athena approach

Similarly, AWS Athena is a serverless bulk data query method. The user is charged for storage and only for the query execution time. Under the hood, Athena is a wrapper around the Presto engine. This makes it very attractive to store data that isn’t used all the time — just our case here. We followed a naive approach of no pre-processing of the data and querying it using the runtime extraction — regular expressions. We uploaded data to S3. From the user’s perspective as long as the wait time isn’t too big, there are no incentives to optimize query performance.

Economics

  • You are billed only for data transfer from S3. This means we don’t care to make queries optimized.
  • No costs associated with computing. Cool!

We encountered several problems along the way:

  • Athena doesn’t understand ISO8859 formatted timestamps nor it understands timezones. This means that data either needs to be preprocessed to bring it to a common, agreed time zone such as UTC or queries must take it into account. Not a major obstacle, but it highlights that some data analytics skills are required for the whole export-SIEM-to-serverless approach.
  • One means of cost reduction per query is to store data in a partitioned fashion in S3 in a hierarchical folder structure: there would be an increased performance and a reduction of cost as Athena wouldn’t have to scan the entire data set. However, the data would have to be arranged in the hierarchical format before which may require some pre-processing.

Overall, this approach also works.

Conclusions

Migration of one SIEM to another raises the question of what to do with data in the old SIEM. A traditional approach was to let the old SIEM and hardware languish until its data was no longer required.

A solution to this would be to export data from SIEM “A” as a log dump and use data analytics methods — data lake and serverless queries such as Google Dataproc serverless or AWS Athena — for select use cases. This requires only a moderate cloud and data analytics expertise which can be readily sourced if not available in-house.

The overall functionality is reduced as compared to a full blown SIEM, but still acceptable for typical older data use cases — search by keywords for IR, IOCs during threat hunts or compliance data retrievals.

Finally, one can use the same approach for migration from on-prem SIEM, provided the data can be uploaded to the cloud cost-effectively.

P.S. All cost estimates are essentially somewhat educated guesses; do your own math on your own data, please.

--

--