[Pmwiki-users] Wiki-Spammers 1 - captchas

Mon Jan 10 14:58:07 CST 2005

On Mon, Jan 10, 2005 at 05:11:18PM +0100, Anselm R. Garbe wrote:
> On Mon, 10 Jan 2005 15:41:57 +0100, Thomas -Balu- Walter
> <list+pmwiki-users at b-a-l-u.de> wrote:
> > Another idea: having the visitors to enter a code that is randomly
> > generated inside a graphic if they add content with more than x external
> > links?
> 
> I vote for doing this (as an option) on _every_ edit regardless of the
> content. That would make the life of spammers much more complicated
> and safe me a lot of time, because http://wmi.modprobe.de gets spammed
> also every day by bob (but his script/whatever only spams known
> pages).

[This is the first of two messages about wikispam; this message deals
with captcha anti-bot systems, while the next will talk about wikispammer
motives.  --Pm]

I've looked a bit more at captcha systems (i.e., systems that require
someone to enter a random code inserted in a graphic), and my impression
at the moment is that we may soon discover that spammers can easily
circumvent those as well.  This isn't to say they aren't useful, or
that they can't/won't be implemented in PmWiki, it's just that I think
they'll end up with a limited shelf-life.  Here's why...

A captcha test is a program that can generate and grade tests that
(1) most humans can pass, and (2) current computer programs can't pass 
(see [1]).  A common captcha test in use today is optical character 
recognition (OCR), which has traditionally thought of as something that 
humans can easily do but is difficult to do computationally (i.e., it's a
"hard AI problem").  The theory is that since computer programs still
have trouble with OCR, it's an effective way to filter bots from humans.

Unfortunately, we're discovering that OCR in this context may not be all
that hard, as illustrated by Mori and Malik in "Breaking a 
Visual CAPTCHA" [2].  It's only a matter of time before spammers have
these tools at their disposal.

Unfortunately, I think the basic premise behind image captchas  fails to 
recognize that OCR and spammers have different goals.  Being able to
pick out characters on a page isn't all that difficult to do -- what is
difficult is doing OCR *with a high degree of accuracy* over large
amounts of (usually scanned) text.  Thus, while someone using OCR for
capturing english text would consider 70% accuracy to be completely 
inadequate, a spammer who can get 70% accuracy guessing the codes to 
post pages would see no problem with it.  

In order for OCR-based captcha to work in the long run, the field
will have to:
  1.  Develop ways of obfuscating text such that humans can continue to
      recognize the text while OCR programs get a much smaller success rate
      than we've had in the past (especially for short sequences of
      characters).  *This* is a difficult problem -- a leading contender
      at the moment is BaffleText from Xerox PARC [3], but even here
      one can see that it's easy to obscure words to the point that
      humans can't pass the test.
  2.  Incorporate auto-blocking components with captcha that automatically 
      block posts from a source IP address after a certain number of failed
      attempts.  But we have to be careful not to block humans who have
      just received a sequence of undecipherable captchas.

Finally, I don't know that anyone has a good solution to the problem
of spammers that simply enlist humans to unwittingly solve captchas on 
their behalf (e.g., in the process of gaining access to another site
such as one containing pornography).  This alone seems to me to be a 
highly effective mechanism for defeating captchas...

Anyway, this isn't to say that it won't be useful to make captcha
features available in PmWiki -- it's just that at the moment I don't 
think they'll going to be of much help in the long run (i.e., a year
from now.)

More next message...

Pm

References:

1.  http://www.captcha.net/
2.  http://www.cs.berkeley.edu/~mori/gimpy/gimpy.html
3.  http://www.parc.xerox.com/research/istl/projects/captcha/