Every day spammers invent new technics to bypass spam filters. Whereas modern spam filters cope good with different text mail using regex rules and bayesian classifiers, they're useless when spammers send messages with attached image and random, non-spam text in the message body. But solution to this problem is already available! It is FuzzyOcr plugin for spamassassin.
This plugin checks for specific keywords in image/gif, image/jpeg or image/png attachments, using gocr (an optical character recognition program). This plugin can be used to detect spam that puts all the real spam content in an attached image. The mail itself only random text and random html, without any URL's or identifiable information. It also do approximate matches on words, so errors in recognition or attempts to obfuscate the text inside the image will not cause the detection to fail. It can be easely extended, because all words reside in a simple plain text file.
Setup it on Ubuntu takes a few simple steps:- apt-get install gocr netpbm imagemagick libstring-approx-perl
- mkdir /tmp/fuzzyocr
- cd /tmp/fuzzyocr
- apt-get source libungif-bin
- wget http://users.own-hero.net/~decoder/fuzzyocr/ {giftext-segfault.patch,fuzzyocr-latest.tar.gz}
- patch libungif4-4.1.4/util/giftext.c ./giftext-segfault.patch
- cd libungif4-4.1.4
- dpkg-buildpackage -rfakeroot -us -uc
- cd ..
- dpkg -i libungif4g_4.1.4-1_i386.deb libungif-bin_4.1.4-1_i386.deb
- tar xzf fuzzyocr-latest.tar.gz
- mkdir -p /usr/local/lib/site-perl
- cp FuzzyOcr-2.3b/FuzzyOcr.pm /usr/local/lib/site-perl
- cp FuzzyOcr-2.3b/FuzzyOcr.cf /etc/spamassassin
- cp FuzzyOcr-2.3b/FuzzyOcr.words.sample /etc/spamassassin/FuzzyOcr.words
Steps 4-10 required only because segfault was discovered in giftext utility from that package. Now all you have to do is to enable FuzzyOcr plugin in spamassassin and tweak your word list.
Edit /etc/spamassassin/FuzzyOcr.cf and:- Remove loadplugin FuzzyOcr FuzzyOcr.pm
- Set focr_pre314 to 1.
- Set focr_logfile to /var/log/FuzzyOcr.log
Add following line to /etc/spamassassin/v312.pre:
- loadplugin FuzzyOcr /usr/local/lib/site_perl/FuzzyOcr.pm
To test FuzzyOcr plugin you can use image spam message samples in FuzzyOcr-2.3b/samples:
[denis@sun:test]$ spamassassin -t FuzzyOcr-2.3b/samples/animated-gif.eml
...
...
Content analysis details: (24.4 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.8 EXTRA_MPART_TYPE Header has extraneous Content-type:...type= entry
0.7 DATE_IN_PAST_06_12 Date: is 6 to 12 hours before Received: date
2.8 TVD_FW_GRAPHIC_ID1 BODY: TVD_FW_GRAPHIC_ID1
0.0 HTML_MESSAGE BODY: HTML included in message
20 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"alert" in 4 lines
"charts" in 1 lines
"symbol" in 1 lines
"alert" in 4 lines
"stock" in 2 lines
"company" in 3 lines
"trade" in 1 lines
"meridia" in 1 lines
"growth" in 1 lines
(18 word occurrences found)










Cool, thanks to this I managed to fix my fuzzyocr setup :-)
Thanks. However, I am using the new branch. v2.3 is deprecated.