SpamFlow FAQ

SpamFlow Frequently Asked Questions

In presenting SpamFlow and related research, many of the same questions are commonly raised. To allay some of these concerns, we provide this frequently asked questions subsection.

Q: You're disenfranchising distant servers! In our dataset, RTT is indeed one of the strongest indicators of a remote spam source. In many cases however, this may be desirable; we envision SpamFlow being customized on a per-user, per-network basis in the same way current content filters are tailored. A North American user who rarely receives email from China may in fact wish to bias against that email, particularly if e.g.\ content analysis flags the message as spam. Thus, SpamFlow need not be the sole determinant of mail validity. Further, RTT is but one feature. In our dataset, the minimum congestion window also figures strongly into the discriminant function. A user who typically receives valid email from remote sources will leverage properties of the TCP flow other than RTT for differentiation.
Q: Can SpamFlow be more conservative in using RTT? Note that approximately 5\% of the spam flows we examine have an initial RTT greater than a full second -- longer than even the expected latency from a satellite link or trans-oceanic crossing. Even a highly conservative filter can still leverage RTT to eliminate these extremely large RTT spam flows.
Q: Doesn't SpamFlow privilege well-connected senders? Insofar as SpamFlow will detect poorly connected servers attempting to send large volumes of mail. Personal, home or small business servers do not have the same volume requirement as spammers and thus are unlikely to induce the same TCP congestion effects we observe. In reality, there is a value judgment that makes SpamFlow practical and reasonable. Specifically, users who wish to ensure that their emails are delivered typically invest in suitable infrastructure, contract with an outside provider or use their service provider's email systems. Companies are not sourcing large amounts of crucial email from hosts attached by consumer-grade connections. The vast majority of home users utilize their provider's email infrastructure or employ popular web-based services. Thus, SpamFlow only discriminates against sources that are both poorly connected \emph{and} injecting large volumes of mail.
Q: What about email lists? In contrast to spam, which must be sent continually, email list traffic can be scheduled in order to not cause local congestion. For instance, even a 64kbps link (very slow in current terms) can support hundreds of serial recipients every five minutes for 10KB sized messages.
Q: Are there other transport features you can use? Yes; this is the subject of active research.
Q: Your sample size, particularly the ham messages, is too small Our CEAS paper is geared toward presenting the method and providing a first-order analysis of the viability of that method. Our current research is examining the viability of SpamFlow in practise on a large-scale.
Q: Isn't your false positive rate too high? Our current results exhibit a higher than desired false positive rate, largely due to the disproportionate number of spam mails in our training set. In future research, we plan to obtain a much larger quantity of legitimate mails in order to even the training complexion and better train the machinery. However, the existing system is still highly usable as a component, or vote, in an overall system that also utilizes other mail evaluation techniques such as content filters.
Q: How do you anticipate spammers will react to SpamFlow? We believe SpamFlow addresses spam at a different layer of abstraction than existing solutions, one where the spammers cannot easily defend. To reduce congestion or other tell-tale signs within their traffic stream, spammers either must reduce their sending rate, distribute their sources more widely or obtain better-connected hosts. All three potential solutions are expensive in real economic terms that matter to spammers. Even if spammers perform scheduling to ensure that their flows do not self-interfere and cause resource contention, the reduction in spam volume is beneficial. Therefore, our hope is that SpamFlow proves to be difficult to subvert.

Return to SpamFlow