sa-learn is the SpamAssassin’s Bayesian classifier. Read all about it here. Use this classifier to train your engine to catch spam better.
Too many people ask “Is there a corpus database where I can do download ALL spam and train my database ? ”
Answer is: NO. The only way to train your spam engine properly to catch the spam that is coming to YOUR site is to train with the spam that is not caught by YOUR engine. Get it ?
To see what the sa-learn command is doing, you can view the contents of the bayesian database
#sa-learn --dump
0.000 0 3 0 non-token data: bayes db version0.000 0 6756 0 non-token data: nspam0.000 0 356965 0 non-token data: nham0.000 0 164268 0 non-token data: ntokens0.000 0 1148413672 0 non-token data: oldest atime0.000 0 1154993621 0 non-token data: newest atime0.000 0 1154992178 0 non-token data: last journal sync atime0.000 0 1154956630 0 non-token data: last expiry atime0.000 0 345600 0 non-token data: last expire atime delta0.000 0 18535 0 non-token data: last expire reduction count0.084 31 17930 1154992169 c0614089c00.071 167 115771 1154992733 0623e506fc0.066 5 3766 1154982387 d01f245fcc0.424 43 3086 1154966772 3bd6a7ead40.054 3 2756 1154978213 672bc0c09a0.134 11 3746 1154976906 90775ea2190.017 1 3111 1154984595 15cc1b9b670.000 0 9970 1154981964 06c4f30daa0.433 148 10232 1154969474 0e6066addf0.081 21 12538 1154990693 e19b6b377a0.000 0 5844 1154989749 efcac761950.178 4 977 1154950009 2b0f558e290.027 1 1895 1154976608 c6804c3c360.000 0 1711 1154986938 69411708b90.094 6 3071 1154969808 62239653c80.111 4 1688 1154980006 0fbc5657380.000 0 1106 1154989801 490950b5b50.066 3 2260 1154986546 e33c66040d0.000 0 581 1154955185 3b8bc3e0c5
There is some logic to the above gibberish. The format for the output is:
spam_probability, #_in_spam, #_in_ham, timestamp, token
The tokens were the patterns found in the spam messages. SA used to store it as words but I believe since 3.0, only hashes are stored. If you were running pre-3.0 version of SA, first of all, UPGRADE!!!!!. Getting past that, the sa-learn output would look like the following:
.000 250 0 1108594451 N:HContent-Transfer-Encoding:NBit
1.000 256 0 1108268857 HTo:U*ken_peacock
1.000 271 0 1108594445 N:sk:NNNNNNc
1.000 277 0 1108594445 medical
1.000 284 0 1108589059 H*r:license
1.000 290 0 1108588985 H*r:sk:shaun_b
1.000 294 0 1108589059 H*r:vK.4.04.00
1.000 294 0 1108589059 N:H*r:vK.N.NN.NN
1.000 335 0 1108590786 N:H*r:NNN.NN.NN

Recent Comments