Friday, September 2, 2005

Bayesian filters and infrequent senders

Over 90% of the email I get now is spam. Like many, I've turned to Bayesian filtering to help catch and shunt it aside, but lately I've been thinking about infrequent public correspondents. To whit, Paypal, EBay, and so forth.

At this moment I have 11 messages in my inbox purporting to be from Paypal.  They are all, every one, phishing attempts. By my way of thinking, that's spam.

If I classify them as spam, though, what happens? With no Ham (that is, not spam) from Paypal, how does the learning filter figure out that only a certain class of apparent Paypal messages are Spam? The ones within which there is a 'Click Here' link to a non-paypal IP address? It seems to me as if the filter might just as easily figure that 'From: Paypal' is the culprit. Or emails with <img> links to paypal logos.  When it comes right down to it, the vast majority of the content of this email is quite reasonable, and Paypal could very well send an email that looks a whole lot like this. Which is, of course, the point of phish attempt.  But my filter doesn't necessarily know this.

Next month, or next year, when Paypal really send me an email, will I see it?

What's interesting to me about all this is: what should, can or will Paypal do about it? Skipping the true root cause, that jerks and criminals are sending these phish attempts, as unfixable, the second cause is that I have no Ham from Paypal. The friendly folks at Paypal might reasonably think to send some real messages then, in essence to keep our Bayesian filters 'honest'.  That, of course, requires me to read them, and to ensure the filters know they're not spam or phish attempts. AKA, manual effort on my part.

Another solution might be for me to stop classifying phishes as spam ... but then I'll want another solution to catch phishes.

Or the folks writing trainable (Bayesian or otherwise) filters might figure out a way for me to tell the learning algorithm which bits of any message to disregard.

None of these are great solutions - all require more manual intervention than I appreciate.

No comments:

Post a Comment