SpamFlow Frequently Asked Questions
In presenting SpamFlow and related research, many of the same
questions are commonly raised. To allay some of these concerns, we
provide this frequently asked questions subsection.
- Q: You're disenfranchising distant servers!
In our dataset, RTT is indeed one of the strongest indicators of a
remote spam source. In many cases however, this may be desirable; we
envision SpamFlow being customized on a per-user, per-network basis in
the same way current content filters are tailored. A North American
user who rarely receives email from China may in fact wish to bias
against that email, particularly if e.g.\ content analysis flags the
message as spam. Thus, SpamFlow need not be the sole determinant of
mail validity.
Further, RTT is but one feature. In our dataset, the minimum
congestion window also figures strongly into the discriminant
function. A user who typically receives valid email from remote
sources will leverage properties of the TCP flow other than RTT for
differentiation.
- Q: Can SpamFlow be more conservative in using RTT?
Note that approximately 5\% of the spam flows we examine have an
initial RTT greater than a full second -- longer than even the
expected latency from a satellite link or trans-oceanic crossing.
Even a highly conservative filter can still leverage RTT to
eliminate these extremely large RTT spam flows.
- Q: Doesn't SpamFlow privilege well-connected
senders? Insofar as SpamFlow will detect poorly connected servers
attempting to send large volumes of mail. Personal, home or small
business servers do not have the same volume requirement as spammers
and thus are unlikely to induce the same TCP congestion effects we
observe.
In reality, there is a value judgment that makes SpamFlow
practical and reasonable. Specifically, users who wish to ensure
that their emails are delivered typically invest in suitable
infrastructure, contract with an outside provider or use their
service provider's email systems. Companies are not sourcing
large amounts of crucial email from hosts attached by consumer-grade
connections. The vast majority of home users utilize their
provider's email infrastructure or employ popular web-based
services. Thus, SpamFlow only discriminates against sources
that are both poorly connected \emph{and} injecting
large volumes of mail.
- Q: What about email lists? In contrast to spam,
which must be sent continually, email
list traffic can be scheduled in order to not cause local congestion.
For instance, even a 64kbps link (very slow in current terms) can
support hundreds of serial recipients every five minutes
for 10KB sized messages.
- Q: Are there other transport features you
can use? Yes; this is the subject of active research.
- Q: Your sample size, particularly the ham
messages, is too small Our CEAS paper is geared
toward presenting the method and providing a first-order
analysis of the viability of that method. Our current
research is examining the viability of SpamFlow in
practise on a large-scale.
- Q: Isn't your false positive rate too
high? Our current results exhibit a higher than desired
false positive rate, largely due to the disproportionate number
of spam mails in our training set. In future research, we plan
to obtain a much larger quantity of legitimate mails in order
to even the training complexion and better train the machinery.
However, the existing system is still highly usable as a
component, or vote, in an overall system that also utilizes other
mail evaluation techniques such as content filters.
- Q: How do you anticipate spammers will react to SpamFlow?
We believe SpamFlow addresses spam at a different layer of
abstraction than existing solutions, one where the spammers cannot
easily defend. To reduce congestion or other tell-tale signs
within their traffic stream, spammers either must reduce their
sending rate, distribute their sources more widely or obtain
better-connected hosts. All three potential solutions are
expensive in real economic terms that matter to spammers. Even
if spammers perform scheduling to ensure that their flows do
not self-interfere and cause resource contention, the reduction
in spam volume is beneficial. Therefore, our hope is that
SpamFlow proves to be difficult to subvert.
Return to SpamFlow
|