This document aims at describing the way Spamstats works internally,
and how it was thought out.  If you are looking for Usage help, run
spamstats with the -help switch !

Any dark area /improvment for this document should be sent to Vincent
Deffontaines <vincent@gryzor.REMOVETHIS.ANDTHIS.com>

Spamstats uses 3 distinct tables (or in perl-speak: hashes) to
crosslink information from mailer and spamassassin : First table is
called mailer_table.

####################

Whenever an email "comes in" the system through SMTP (this is not true
for locally generated messages), an entry is added to that table : in
the case of postfix, if you have a line like :

Mar 11 00:06:45 mybox postfix/cleanup[16313]: 4DC5213BA1:
message-id=<000101c2eplentyofnumbers@remotemailer.com>

this entry is added into the mailer_table :
        mailer_table{'4DC5213BA1'} = '000101c2eplentyofnumbers@remotemailer.com'


Spamd-oriented tables :

####################
Whenever a spamd input line is encountered, the spamd_pid is filled in:

if spamd line is :
Mar 11 00:06:39 mybox spamd[2364]: processing message \
<000101c2eplentyofnumbers@remotemailer.com> for spamd:1003, expecting 3022 bytes.

we will have :
        spamd_pid{'2364'} = '000101c2eplentyofnumbers@remotemailer.com'


####################
Whenever a spamd clean or spam detection is found, the spamd_table is filled in:

If line is 
Mar 11 00:06:44 mybox spamd[2364]: clean message (1.1/6.0) for spamd:1003 \
in 5.0 seconds, 3022 bytes.
we have :
        spamd_table{spamd_pid{'2364'}} = 'clean'

If it was identified as a spam, we would have symmetrically :
Mar 11 00:06:54 mybox spamd[2364]: identified spam (16.1/6.0) for spamd:1003 \
in 5.3 seconds, 3022 bytes.
we would have :
        spamd_table{spamd_pid{'2364'}} = 'spam'

And at this point, in either case, we delete the spamd_pid reference to '2364',
since it has finished its job and won't be of any use.

#####################

Whenever an email is delivered, mailer_table and spamd_table
informations get crosslinked :

Mar 11 00:06:55 mybox postfix/pipe[1119]: 4DC5213BA1: to=<myuser@mydomain.com>,
relay=filter, delay=7, status=sent (mybox.mydomain.com)
We look at  spamd_table{mailer_table{'4DC5213BA1'}}, and extract the recipient
('myuser@mydomain.com')

We will use this information for instance to extract top spammed recipients.

>From spamstats 0.4b two different behaviours are possible at this point :

Default behaviour for 0.4b is to process every delivered email, even
if a single email has several recipient.

If you issue the -agglo-recipients options (and in spamstats 0.4 and
earlier), spamd_table and mailer_table are cleared after the first
recipient has been processed.

This all depends on the way you see it :
 - You can say you count spam as "bandwidth annoyance" in which case counting one
spam per effective mailer id is what you want.
 - You can say spam is a individual user annoyance, and if a spam has 2 recipients
it must be counted twice; this is starting from version 0.4b the default.
