Thursday, March 27, 2008

Webalizer DNS problems (fails to resolve any domains)

I had the (mis)pleasure of reinstalling webalizer the other night on a server to run some stats, I ended up wasting a bunch of my time on a silly problem, and thought I should post my solution since my googled in vain for an answer.

Basically, the coles notes version: I had run a grep based on a particular domain name, on all the (several gigs) of apache2 log files sitting in the log folder, in hopes of creating just a log for a particular domain. Unfortunately, I forgot that when grepping multiple files (using filename wildcards on grep command line), that it will output the filename that any search results are found in (once per line of text found). I didn't notice this happening, since I piped the output to a new file. I later went on to try running webalizer on this file, only to find that the DNS wasn't working properly, and it wasn't resolving any of the visitor's IP's.

After googling, it became obvious that this was/is a common problem for webalizer (DNS issues), unfortunately for me, my erroneous log files were the cause of the problem, and I wasted *alot* of time trying to rebuild webalizer from source using the suggested --enable-dns option, tried downloading a newer src rpm from a newer distribution (opensuse 10.0 based system, tryed 10.2 or 10.3 src), unfortunately it had different requirements that didn't seem solveable with Opensuse 10.0.

I eventually actually looked at my log file I was trying to process (more closely.. as I had looked at it several times before... without noticing my/the error). I went on to also determine that webalizer doesn't like the vhost at the beginning of each line (don't quote me on this.. I may be dreaming.. it was late... ), so I went on to find a way to remove that as well.

Tools/methods to fix the problems:

To remove the filename "access_log-20060131:"(and variations (dates) thereof) from in front of the vhost name on each line (note, make sure all the original log filenames were the same number of characters length for this search/delete to work and not miss any, otherwise it could be modified a bit I guess):

sed -e 's/access_log-2006.....//g' testlog

This removed the filename and the colon (:) from in front of any vhost names in the log file.

To remove the vhost names for webalizer (once again, not positive you need to do this, but I wanted to make sure I had only gotten log entries that originated from the particular vhost.. my grep may have caught some that were referred by the particular vhost to other vhost domains.. so this step eliminates that problem:

split-logfile2 < testlog

split-logfile2 splits a vhost_combined format apache2 logfile into separate files based on the vhost names contained in the logfile, and also removes the vhost name from the first part of each line, since the file it outputs to (for each domain) is based on the vhost name, this isn't an issue in terms of recognizing what domain each log is for.

I hadn't known about split-logfile2 before all of this, and now that I do, it will definitely come in handy down the road.

After all this, webalizer was more then happy to finally parse the logs, and even reverse dns all the IP's providing much better results in terms of visitor information, etc.

on a side note, in my several hours of pissing around, I ran across several programs / projects / packages that either have forked off of webalizer, or forked off of other common log analyzers (seems there are many), and I will try to put a post up with what I found... if only for my own reference later. Webalizer results really do seem kind of basic after all is said and done, so finding another analyzer suite (I've used awstat before as well) may be in the cards if I want to garner more information... another alternative is also google analytics of course, but this involves having it in the pages being served, and during the fact of things happening.. its always nice to be able to process log files after the fact using tools such as webalizer and awstats.

No comments: