May 18, 2012

interpreting sa-learn output

sa-learn is the SpamAssassin’s Bayesian classifier. Read all about it here. Use this classifier to train your engine to catch spam better.

Too many people ask “Is there a corpus database where I can do download ALL spam and train my database ? ”

Answer is: NO. The only way to train your spam engine properly to catch the spam that is coming  to YOUR site  is to train with the spam that is  not caught by YOUR engine. Get it ?

To see what the sa-learn command is doing, you can view the contents of the bayesian database

#sa-learn --dump

0.000          0          3          0  non-token data: bayes db version
0.000          0       6756          0  non-token data: nspam
0.000          0     356965          0  non-token data: nham
0.000          0     164268          0  non-token data: ntokens
0.000          0 1148413672          0  non-token data: oldest atime
0.000          0 1154993621          0  non-token data: newest atime
0.000          0 1154992178          0  non-token data: last journal sync atime
0.000          0 1154956630          0  non-token data: last expiry atime
0.000          0     345600          0  non-token data: last expire atime delta
0.000          0      18535          0  non-token data: last expire reduction count
0.084         31      17930 1154992169  c0614089c0
0.071        167     115771 1154992733  0623e506fc
0.066          5       3766 1154982387  d01f245fcc
0.424         43       3086 1154966772  3bd6a7ead4
0.054          3       2756 1154978213  672bc0c09a
0.134         11       3746 1154976906  90775ea219
0.017          1       3111 1154984595  15cc1b9b67
0.000          0       9970 1154981964  06c4f30daa
0.433        148      10232 1154969474  0e6066addf
0.081         21      12538 1154990693  e19b6b377a
0.000          0       5844 1154989749  efcac76195
0.178          4        977 1154950009  2b0f558e29
0.027          1       1895 1154976608  c6804c3c36
0.000          0       1711 1154986938  69411708b9
0.094          6       3071 1154969808  62239653c8
0.111          4       1688 1154980006  0fbc565738
0.000          0       1106 1154989801  490950b5b5
0.066          3       2260 1154986546  e33c66040d
0.000          0        581 1154955185  3b8bc3e0c5

There is some logic to the above gibberish. The format for the output is:

spam_probability, #_in_spam, #_in_ham, timestamp, token

The tokens were the patterns found in the spam messages. SA used to store it as words but I believe since 3.0, only hashes are stored. If you were running pre-3.0 version of SA, first of all, UPGRADE!!!!!. Getting past that, the sa-learn output would look like the following:

.000 250 0 1108594451 N:HContent-Transfer-Encoding:NBit
1.000 256 0 1108268857 HTo:U*ken_peacock
1.000 271 0 1108594445 N:sk:NNNNNNc
1.000 277 0 1108594445 medical
1.000 284 0 1108589059 H*r:license
1.000 290 0 1108588985 H*r:sk:shaun_b
1.000 294 0 1108589059 H*r:vK.4.04.00
1.000 294 0 1108589059 N:H*r:vK.N.NN.NN
1.000 335 0 1108590786 N:H*r:NNN.NN.NN

Speak Your Mind

*