An efficient way to store trillions of access log lines http://ajdiaz.me/doc/2016/10171-efficient-way-to-store-trillions-of-accesslog.txt Version: 2016-10-17 When handling a lot of access log records (actually this approach should work for any other kind of structured log), usually you spend a lot of disk space saving the logs. But, if you analize the content of that logs you probably realized that there are common patterns there. In fact, you have a limited number of available paths for your server requests. For example, if you expose three different paths, lets said: /one, /two, /three, then even if you have trillions of requests per second, all of them will match to the corresponding path. So, our first approach could be save the logs by path. Saving logs by path, we can reduce the number of lines to be saved, just record that path /one is hitting n times, and so on for the other paths. Furthermore, we can also separate other components too. For example, path /one usually receives GET, and no post there, so we can create a directory structure in the following way: //// Where other contains only the most variable fields (IP address, response time etc.) In that model, the other file could be a database which contains n-uples with a counter field, like this: (, ) = For example: (127.0.0.1, 123.2) = 12 Which means that there are 12 hits for the pair composed by that concrete IP address and the time. Of course using a very variable fied, like time will generate a long list of one hit elements, but you usually do not care so much about what is the exact response time, but also a scale of time, for example is irrelevant if a petition took 123.2 ms or 121.3 ms. The relevant information is that it's between 100ms and 150ms. Discretizing this values should reduce a lot the final size of the hits database. Also using creation time of the file and current time of the system you can derivate the number of hits per second without saving any other information.