2002-12-13 : I decided it was finally time to write something that could analyze the log files from my web server. I'm very curious to find out what parts of my web site people are the most interested it.

I started by first just being able to parse all the fields and dump them to the screen. Then I started adding more record keeping as I thought of interesting things I wanted to know. For example, I divided hits into four categories - myself, attackers, web crawlers, and actual visitors. I decided that with so much information, the best thing to do would be to produce a web page with the results. Here is an example of what it does right now (2002-12-14).

There are lots of things I want to add. The analysis of a single day is moderately interesting. I'm more interested in trends over time. So, I'll have to extract summary info from each log and then combine all of those into a separate page that shows trends over time.

I was suprised how hard it turns out to be to identify a visitor, or even a web crawler. After all, many hits can represent a single visit, and I'm more interested in visits. I want to massage the data more until I can give a better estimate of visits. I can't just track IP. People who use AOL seem to go through a web cache, so related hits come from different IPs. I want to try to tie these together into one visit.

The referer field often contains a search string. It would be interesting to show that information in the report. Sometimes it's pretty funny!

I also want to be able to do some automated analysis to detect attacks - For example, I'd like to get e-mailed when someone starts accessing weird URLs.

Lots to do. :)

2003-09-14 : added some more details to the output, notably the list of search strings that led people to my site.

analyzeLog [options]
    -?                this help
    -log <fileName>   log file to analyze
    -out <fileName>   optional file to write results to

You can get the source via anonymous CVS at

  • cvs -d co 2002/analyzeLog
