An efficient way to store trillions of access log lines
http://ajdiaz.me/doc/2016/10171-efficient-way-to-store-trillions-of-accesslog.txt
Version: 2016-10-17


When handling a lot of access log records (actually this approach should
work for any other kind of structured log), usually you spend a lot of disk
space saving the logs. But, if you analize the content of that logs you
probably realized that there are common patterns there. In fact, you have
a limited number of available paths for your server requests.

For example, if you expose three different paths, lets said: /one, /two,
/three, then even if you have trillions of requests per second, all of them
will match to the corresponding path. So, our first approach could be save
the logs by path.

Saving logs by path, we can reduce the number of lines to be saved, just
record that path /one is hitting n times, and so on for the other paths.

Furthermore, we can also separate other components too. For example, path
/one usually receives GET, and no post there, so we can create a directory
structure in the following way:

  /<path>/<method>/<user-agent>/<other>

Where other contains only the most variable fields (IP address, response
time etc.) In that model, the other file could be a database which contains
n-uples with a counter field, like this:

  (<ip>, <response_time>) = <n hits>

For example:

  (127.0.0.1, 123.2) = 12

Which means that there are 12 hits for the pair composed by that concrete IP
address and the time. Of course using a very variable fied, like time will
generate a long list of one hit elements, but you usually do not care so
much about what is the exact response time, but also a scale of time, for
example is irrelevant if a petition took 123.2 ms or 121.3 ms. The relevant
information is that it's between 100ms and 150ms. Discretizing this values
should reduce a lot the final size of the hits database.

Also using creation time of the file and current time of the system you can
derivate the number of hits per second without saving any other information.