Anonlog: the logfile anonymizer


Contents


Introduction

Anonlog is a program to "anonymize" web server logfiles. This means that sensitive details are encoded so that you can send your logfiles to someone else without them being able to see confidential data.

This documentation describes version 0.91beta of the program. See the anonlog home page for the latest version.

Anonlog is a program from the author of analog.

Licence

Anonlog is copyright (c) Stephen R. E. Turner 2000, and is licensed under version 2 of the GNU General Public License. This licence allows you to modify and redistribute the program under certain conditions - principally that the modified or redistributed program is licensed in the way as the original.

Since the program is free software, it is distributed without any warranty, even the implied warranties of merchantability or fitness for a particular purpose.

See the file Licence.txt for the full licence conditions. (If this file is missing, see http://www.gnu.org/copyleft/gpl.html or get a copy from the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA).

Details

Anonlog anonymizes the following items from the original logfile: filenames (but preserves the extension), visitors' hostnames, referrers, usernames and virtual hostnames. (Some of these items, especially the last two, may not be present in every logfile.)

The following items are left unchanged: date and time of each request, HTTP status code, file size, processing time and browser name.

Search arguments on files and referrers are deleted, and replaced with an indication that they were present.

Anonlog can read logfiles in several different commonly-used formats. The anonymized logfile is written to a new file.

The translation uses real words where possible. Furthermore, items are translated "hierarchically" - for example, if maths.cam.ac.uk became lemon.bee.to.de then statslab.cam.ac.uk might become greatest.bee.to.de. (It is configurable whether the new names should be the same length as the old ones).

The key to the translation can be written to a file if you want. Note that running the program on the same file twice will not produce the same results. (This is a deliberate security feature.)

If you run analog on the original and the anonymized logfiles, the results should be almost exactly analogous (with minor differences due to different parsing routines) except for analog's Search Word Report and Search Query Reports, which will be lost, and the Organisation Report, which will be wrong.

Running anonlog

Anonlog is written in Perl. To run it you need at least version 5.004 of Perl.

On Unix or Linux

Change settings in the configuration file anonlog.cfg (see below). Then type perl anonlog.pl to run the program.

(If you don't have a new enough version of Perl, download it free from http://www.perl.org/).

On Windows

If you don't have Perl already, download it free of charge from http://www.activestate.com/Products/ActivePerl/ . Then change settings in the configuration file anonlog.cfg (see below) and run anonlog.pl.

Configuration

The configuration file is called anonlog.cfg. In this file, you can control the behaviour of anonlog.

In the configuration file, anything following a # is a comment. Other lines follow the format "variable = value".

Here is the full list of variables which you can set. You will want to set at least the first three.

(There is no reason to declare the same variable more than once, but if you do, only the last occurrence will take effect.)

logfile
The logfile to be anonymized. Unix users might like to set logfile=- for stdin.
newlog
Where to write the translated logfile. Unix users might like to set newlog= for stdout.
servernames
Names by which your server is known (a comma-separated list). These are treated specially in the referrer field. For referrers from these servers, the hostname is left un-anonymized, and the filename is translated as a local filename.
logformat
Anonlog can parse logfiles in several commonly-used formats. Normally it can detect the format of your logfile, but if it has trouble you can coerce it with this command. Legal values are common, combined, extended, ms-extended (a buggy version of extended in IIS) and iis (IIS native format).
dictionary
A file from which to select words for use in the translated logfile. One is supplied with the program (it's all the words from Jane Austen's Pride and Prejudice, in case you're wondering), but any text file will do.
translations
Where to write the key to the translations. Leave blank if you don't want it to do this.
unchfiles
Filenames to leave alone (a comma-separated list). It is convenient to leave index.html (or equivalent) alone so that /dir/index.html is still the same as /dir/ after the anonymization.
matchlength
Whether the new names should be the same length as the old ones (1 for yes, 0 for no). The default is 0 for maximum security. Setting this to 1 tends to lead to shorter and more readable output.
case_sensitive
Whether filenames on your server are case-sensitive or not (1 for yes, 0 for no). This is normally 1 on a Unix machine, 0 on a Windows machine.
usercase_sensitive
The same for usernames.

Notes on security

The default configuration file is set for maximum security. You can reduce security a little, but with increased functionality, by setting translations and changing matchlength to 1.

Notes on speed

The program needs to make sure that no two original names are given the same anonymized name. This is a memory- and processor-intensive task, so the program does not run very fast - I'm processing about 10,000 logfile lines per minute on a 266 MHz chip with 96 MB of RAM.

If you want to increase the speed, you can try unsetting dictionary (although this will make the output substantially less readable). If you keep the dictionary, leaving matchlength at 0 may help.

What's new?

Version 0.91beta (23-Jun-2000)
First public version. Only trivial changes.
Version 0.9beta (30-May-2000)
First version released to sponsor.

Feedback

I welcome feedback on this program. Contact me at analog-author@lists.isite.net.

Acknowledgements

Many thanks to an anonymous sponsor for funding the development of this program.
Stephen Turner
E-mail: analog-author@lists.isite.net
23-Jun-2000