Motivations

When we try to play with several machines, Sysmon and WECs (Windows Event Collectors) we quickly see the amount of logs growing. Knowing quite good the EVTX file format we realized that no compression is builtin. In order to optimize the storage of the events collected, we decided to run basic compression tests that we are going to describe here.

It is worth noting that we are not going to discuss about storing the logs into whatever kind of database to allow us to query the events afterwards. Indeed, doing so has pros and cons. The main advantage is probably that we can grab quicker the information available in your logs files. The downside of this approach is has a cost in engineering, storage and maintenance. In my opinion the best approach would be to index the files (by time for instance) into a database and keep a flat storage aside to look for interesting information on need. But it is not the topic of this post!

Interlude

Maybe (or not) you are asking yourself why EVTX files cannot be bigger than 4GB? In order to satisfy your curiosity, we are going to briefly explain the mistery behind this.

Hereafter is the structure describing the EVTX header.

// Go style structure
type FileHeader struct {
    Magic           [8]byte
    FirstChunkNum   uint64
    LastChunkNum    uint64
    NextRecordID    uint64
    HeaderSpace     uint32
    MinVersion      uint16
    MajVersion      uint16
    ChunkDataOffset uint16
    ChunkCount      uint16
    Unknown         [76]byte
    Flags           uint32
    CheckSum        uint32
}

As you can see the ChunkCount member is a 16 bits unsigned integer, so we can have at maximum 65535 chunks in the file. The second thing is that a chunk has a fixed size of 64kB. After a simple calculation we got the answer to the question.

Experiments

  1. Keeping the evtx files as they are and compress those with ZIP
  2. Convert the evtx into JSON (one event per line) and compress the resulting file

Experimental Settings

  • MacBook Pro 2011
  • Intel(R) Core(TM) i5-2415M CPU @ 2.30GHz
  • 16GB of RAM
  • Middle range SSD
  • 104GB of EVTX files collected from a WEC
  • Raw logs compression achieved with 7z
  • JSON post processing done with evtxdump and compressed with GZIP: evtxdump -u input.evtx | gzip > output.json.gzip

Compression

We have summarised the results of our compression tests in the following table. As expected, we notice that the compression speed of the raw Windows Event Logs is faster since no parsing is involved before compression.

However the good surprise is that it seems that EVTX post processed to JSON are quite compressible since we managed to compress 104GB into 1.65GB .

Table 1: Compression Experiment Results
Test Compressed Size Compression Speed Compression Ratio
1 (raw compressed) 4.12GB 19.6MB/s 25
2 (json compressed) 1.69GB 4.5MB/s (2002 eps) 61.5

Search

We detail here the search performances we can expect according to the compression chosen for storing the events.

To do so we used the following tools:

  • libevtx: parsing library for Windows Event Logs which includes command line utilities
  • gene: an engine to search within EVTX based on rules
  • grep: I think we do not need to introduce this one

We decided to do a very simple search consisting of finding all the Sysmon events containing the string svchost.exe. It is rather complicated to define complex searches with grep since it is unaware of the input semantic. Therefore this benchmarking can be seen as a best case searche scenario.

Table 2: Search Experiment Results
Test Search Time with evtxexport and grep Search Time with grep Search Time with gene
1 (raw compressed) N/A (bug in evtxexport) N/A (would lead to unexploitable results) 2127 eps
2 (json compressed) N/A 225446 eps 18080 eps

The best search performances are achieved with the JSON compressed storage format. The slow search speed in raw evtx compressed can be explain by the fact that we delocalized the parsing step from the compression phase to the search phase. However, in the second storage scenario, we take avantage of the lightness of the JSON format in term of parsing. The fact that JSON compressed files can be quickly uncompressed also makes this format suitable for human readability.

We also note that the grep search, in spite of its speed, is not the best way to search into the files. Depending on the search you want to achieve, it can become pretty inaccurate due to the semantic unawereness of grep. So if you are dealing with high volume of events it is probably a bad idea to grep into your logs since it can pop up many false positives. A prefered way would be to use our homemade tool gene.

Conclusion

To conclude, the solution we would recommend to store your Windows Event Logs in flat files is to first pre-process them with evtxdump and compress them. We have chosen GZIP as a compression algorithm in our tests, however we let the reader trying with other algorithms and chose its favourite.

Table 3: Pros & Cons matrix of chosen approach
Pros Cons
Best compression ratio
Best Search Time
Grepable
Human Readable format
Slow Compression Time (4.5MB/s)

It is worth noting that the compression time can become acceptable if the compression step would be ran on a more recent and powerful machine. Indeed the limiting factor in our case is the computing power (CPU) of the machine used for the experiment.