2009-04-29 09:42 Logster is out! I'm scared!

image.png

Finally. Logster is out for everyone to download and use. Logster is a easy tool for quickly visualizing where the traffic to your webserver comes from. All you need is your access logs, in the Common (or Combined) Log Format which e.g. Apache produces by default. I've also heard that certain someone created a quick script to convert their firewall logs into something vaguely resembling Common Log Format, and that it worked. Logster isn't picky.

We've also launched our web shop where you can buy full licenses for logster. With a license Logster won't nag you every 60 seconds and interrupt your totally rad visualization X-perience.

And that's not all! In fact, Logster was available for everyone a week ago. We just haven't been very vocal about it. That's because we thought it would be interesting to actually see how the word gets out. Therefore we have devised The Great Logster Social Experiment. The idea is to phase our advertising efforts. Instead of going all-out straight away, we started out small by informing only our immediate social circle. Now we're at week 2, so it's the time to hit the blogosphere. Then, week after week we'll bring out bigger and bigger guns by blogging, contacting the media etc. Along the way we intend to visualize the web traffic for Logster with Logster and put the results up for everyone to see.

Jani made a visualization from pre-kickoff and kickoff, just to get a baseline:

So, download Logster, spread the word, and nobody gets hurt! ;)

WARNING: Technical rambling

image2.png

Logster in itself is pretty neat, but the coolest feature (in my opinion) is easy to miss. At least it was the most fun thing to implement: Random access (seekable) gzipped files. To save disk space Apache often compresses access logs with gzip, and this posed an interesting problem.

With Logster you can quite freely seek through your access logs (and time itself!) with a slider. This is easy with normal uncompressed access logs, as one can always start reading from the just seeked offset and get out more or less coherent data. Not so with gzipped files. If you want to uncompress a piece of a gzipped file, the uncompressor often needs to know stuff about how the data preceding that point uncompresses. And to have that knowledge it may have to have some knowledge of data preceding even that. And so on. And the data may be bit-packed. My writing skills fail me, so just remember that to our eyes a gzipped file look like an impenetrable jumble of bits, where to make anything out of a piece of data you must usually know everything that came before that piece.

The easiest thing, for us but not for the user, would have been to say "no bonus, Logster supports only uncompressed stuff". That would have been a drag. The next obvious option would have been just to automatically uncompress the whole file into memory or a temporary directory. This would again make things easy and fast for us, as seeking through uncompressed data is a problem solved. But it could also potentially eat up lot of memory or disk space. At least I usually tend to be a bit short of both. Python (yeah, we use Python) also has a library for gzipped data which has something emulating seeking within such data. But that is done by starting from the beginning of the data and uncompressing until the seek point has been reached. Slow.

Turns out that zlib, the library for handling gzipped data, has a mechanism for taking a snapshot of the state of the uncompressor. Hmm. Would it be feasible to just run through the data once, uncompressing as we go, and keeping only one snapshot per every uncompressed megabyte or so? Then when we want to seek a certain place in the file we can just check out what is the closest snapshot preceding that point. Push that snapshotted state back into the machinery and uncompress until the desired data offset is reached. But that can't be fast, can it?

Turns out that if we cache the last accessed uncompressed megabyte, the seeking and reading can indeed be quite fast. Sometimes even faster than roaming around pure uncompressed data, as gzipped files are smaller and disk caches do their tricks. The initial step of going through and indexing the data is also bearable, often faster than uncompressing it to the disk, in fact.

It took a day or two to dig up the necessary info and to implement a basic Python module that emulates random access files closely enough. Then it Just Worked(TM) with our existing log parsing code. It was a good moment. Coder's nirvana.

This does have some drawbacks, of course. Snapshotting consumes some memory, but it isn't that much, and can be controlled with decreasing the snapshotting frequency. And there's the wait one has to endure when the initial snapshots are created. But that's life.

-- ?jvi 2009-04-29 08:03:34


return to the blog ...