|
|
|
View previous topic :: View next topic |
Author |
Message |
kal Forum Administrator
Joined: 06 Mar 2006 Posts: 17850 Location: Ottawa, Canada
TV/Projector: JVC DLA-NZ7
|
Link Posted: Wed Apr 07, 2010 8:11 pm Post subject: |
|
|
I've changed our CAPTCHA to an even better one called "reCAPTCHA" which has better security but seems easier to read, has a refresh button in case the words are hard to read, and an audio playback option. See: http://recaptcha.net/
It's the same one that AVS uses. reCAPTCHA is owned by Google!
Best part is that it's helping digitize old books. The words are scanned out of old books and by having people enter them they're doing automatic OCR'ing on them! (Pretty cool if you ask me).
Quote: |
Teaching computers to read: Google acquires reCAPTCHA
9/16/2009 09:20:00 AM
The image above is a CAPTCHA — you can read it, but computers have a harder time interpreting the letters. We tried to make it hard for computers to recognize because we wanted to give humans the scoop first, but we're happy to announce to everybody now that Google has acquired reCAPTCHA, a company that provides CAPTCHAs to help protect more than 100,000 websites from spam and fraud.
Since computers have trouble reading squiggly words like these, CAPTCHAs are designed to allow humans in but prevent malicious programs from scalping tickets or obtain millions of email accounts for spamming. But there’s a twist — the words in many of the CAPTCHAs provided by reCAPTCHA come from scanned archival newspapers and old books. Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text.
In this way, reCAPTCHA’s unique technology improves the process that converts scanned images into plain text, known as Optical Character Recognition (OCR). This technology also powers large scale text scanning projects like Google Books and Google News Archive Search. Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users. So we'll be applying the technology within Google not only to increase fraud and spam protection for Google products but also to improve our books and newspaper scanning process.
That's why we're excited to welcome the reCAPTCHA team to Google, and we're committed to delivering the same high level of performance that websites using reCAPTCHA have come to expect. Improving the availability and accessibility of all the information on the Internet is really important to us, so we're looking forward to advancing this technology with the reCAPTCHA team.
Posted by Luis von Ahn, co-founder of reCAPTCHA, and Will Cathcart, Google Product Manager
|
Kal
_________________
Support our site by using our affiliate links. We thank you!
My basement/HT/bar/brewery build 2.0
|
|
Back to top |
|
|
kschmit2
Joined: 09 Mar 2006 Posts: 1141 Location: Heidelberg, Germany
|
Link Posted: Thu Apr 08, 2010 7:56 am Post subject: |
|
|
How exactly does that help them recognize computer-illegible text?
If a computer cannot decipher the word, then it cannot be used as a captcha, because the computer won't know if what you typed is actually what can be seen on screen.
Captchas only work when the computer already knows the answer.
|
|
Back to top |
|
|
huggy
Joined: 02 Aug 2008 Posts: 927 Location: Melbourne,Australia
|
Link Posted: Thu Apr 08, 2010 8:17 am Post subject: |
|
|
Kal
The 5 post thingy is great,may I suggest that to avoid clogging up threads by new users with "post count" posts,you could start a new "sticky" thread just for that purpose.
This is how it's done in our local DTV forum and works well.
Dave
|
|
Back to top |
|
|
AnalogRocks Forum Moderator
Joined: 08 Mar 2006 Posts: 26690 Location: Toronto, Ontario, Canada
TV/Projector: Sony 1252Q, AMPRO 4000G
|
Link Posted: Thu Apr 08, 2010 1:34 pm Post subject: |
|
|
huggy wrote: | Kal
The 5 post thingy is great,may I suggest that to avoid clogging up threads by new users with "post count" posts,you could start a new "sticky" thread just for that purpose.
This is how it's done in our local DTV forum and works well.
Dave |
Can you elaborate on this?
_________________ Tech support for nothing
CRT.
HD done right!
|
|
Back to top |
|
|
kal Forum Administrator
Joined: 06 Mar 2006 Posts: 17850 Location: Ottawa, Canada
TV/Projector: JVC DLA-NZ7
|
Link Posted: Thu Apr 08, 2010 2:10 pm Post subject: |
|
|
kschmit2 wrote: | How exactly does that help them recognize computer-illegible text?
If a computer cannot decipher the word, then it cannot be used as a captcha, because the computer won't know if what you typed is actually what can be seen on screen.
Captchas only work when the computer already knows the answer. |
Good point. I don't know! It does seem like a catch-22 doesn't it? Something here obviously works (it is Google after all...). We're just missing a piece of the puzzle explaining how it works...
Ok, here's the exaplanation:
Quote: | Our apparatus, called “reCAPTCHA,” is used
by more than 40,000 Web sites (6) and dem
onstrates that old print material can be tran
scribed, word by word, by having people solve
CAPTCHAs throughout the World Wide Web.
Whereas standard CAPTCHAs display images
of random characters rendered by a computer,
reCAPTCHA displays words taken from scanned
texts. The solutions entered by humans are used to
improve the digitization process. To increase effi
ciency and security, only the words that automated
OCR programs cannot recognize are sent to hu
mans. However, to meet the goal of a CAPTCHA
(differentiating between humans and computers),
the system needs to be able to verify the user’s
answer. To do this, reCAPTCHA gives the user
two words, the one for which the answer is not
known and a second “control” word for which
the answer is known. If users correctly type the
control word, the system assumes they are human
and gains confidence that they also typed the other
word correctly (Fig. 1). We describe the exact
process below.
We start with an image of a scanned page.
Two different OCR programs analyze the image;
their respective outputs are then aligned with
each other by standard string matching algo
rithms (7) and compared to each other and to an
English dictionary. Any word that is deciphered
differently by both OCR programs or that is not
in the English dictionary is marked as “suspicious.”
These are typically the words that the OCR pro
grams failed to decipher correctly. According to
our analysis, about 96% of these suspicious words
are recognized incorrectly by at least one of the
OCR programs; conversely, 99.74% of the words
not marked as suspicious are deciphered correctly
by both programs. Each suspicious word is then
placed in an image along with another word for
which the answer is already known, the two words
are distorted further to ensure that automated pro
grams cannot decipher them, and the resulting
image is used as a CAPTCHA. Users are asked to
type both words correctly before being allowed
through. We refer to the word whose answer
is already known as the “control word” and to
the new word as the “unknown word.” Each
reCAPTCHA challenge, then, has an unknown
word and a control word, presented in random
order. To lower the probability of automated pro
grams randomly guessing the correct answer, the
control words are normalized in frequency; for
example, the more common word “today” and
the less common word “abridged” have the same
probability of being served. The vocabulary of
control words contains more than 100,000 items,
so a program that randomly guesses a word would
only succeed 1/100,000 of the time (8). Addi
tionally, only words that both OCR programs
failed to recognize are used as control words.
Thus, any program that can recognize these words
with nonnegligible probability would represent an
improvement over state of the artOCR programs.
To account for human error in the digitiza
tion process, reCAPTCHA sends every suspi
cious word to multiple users, each time with a
different random distortion. At first, it is displayed
as an unknown word. If a user enters the correct
answer to the associated control word, the user’s
other answer is recorded as a plausible guess for
the unknown word. If the first three human
guesses match each other, but differ from both
of the OCRs’ guesses, then (and only then) the
word becomes a control word in other chal
lenges. In case of discrepancies among human
answers, reCAPTCHA sends the word to more
humans as an “unknown word” and picks the
answer with the highest number of “votes,”
where each human answer counts as one vote
and each OCR guess counts as one half of a
vote (recall that these words all have been pre
viously processed by OCR). In practice, these
weights seem to yield the best results, though
our accuracy is not very sensitive to them (as
long as more weight is given to human guesses
than OCR guesses). A guess must obtain at least
2.5 votes before it is chosen as the correct
spelling of the word for the digitization process.
Hence, if the first two human guesses match
each other and one of the OCRs, they are con
sidered a correct answer; if the first three guesses
match each other but do not match either of the
OCRs, they are considered a correct answer, and
the word becomes a control word. To account
for words that are unreadable, reCAPTCHA has
a button that allows users to request a new pair
of words. When six users reject a word before
any correct spelling is chosen, the word is dis
carded as unreadable. After all suspicious words
in a text have been deciphered, we apply a post
processing step because human users make a
variety of predictable mistakes (see supporting
online text). From analysis of our data, 67.87%
of the words required only two human responses
to be considered correct, 17.86% required three,
7.10% required four, 3.11% required five, and
only 4.06% required six or more (this includes
words discarded as unreadable). |
Read the whole article here: http://recaptcha.net/reCAPTCHA_Science.pdf
It's quite ingenious because not only is it a more secure CAPTCHA, it also can be used for good. This is probably why the PHD's at Google took note and bought the company.
Kal
_________________
Support our site by using our affiliate links. We thank you!
My basement/HT/bar/brewery build 2.0
|
|
Back to top |
|
|
AnalogRocks Forum Moderator
Joined: 08 Mar 2006 Posts: 26690 Location: Toronto, Ontario, Canada
TV/Projector: Sony 1252Q, AMPRO 4000G
|
Link Posted: Thu Apr 08, 2010 2:22 pm Post subject: |
|
|
Those annoying spammers still sign up though. I've nuked 3 post's since last night. Pricks!
_________________ Tech support for nothing
CRT.
HD done right!
|
|
Back to top |
|
|
kschmit2
Joined: 09 Mar 2006 Posts: 1141 Location: Heidelberg, Germany
|
Link Posted: Thu Apr 08, 2010 6:07 pm Post subject: |
|
|
thx Kal, interesting read.
I should have come up with that
|
|
Back to top |
|
|
kal Forum Administrator
Joined: 06 Mar 2006 Posts: 17850 Location: Ottawa, Canada
TV/Projector: JVC DLA-NZ7
|
|
Back to top |
|
|
garyfritz
Joined: 08 Apr 2006 Posts: 12024 Location: Fort Collins, CO
|
Link Posted: Thu Apr 08, 2010 7:02 pm Post subject: |
|
|
That is so cool!! What a brilliant use of resources.
|
|
Back to top |
|
|
ecrabb Forum Moderator
Joined: 13 Mar 2006 Posts: 15909 Location: Utah
TV/Projector: JVC RS40, Epson 5010
|
Link Posted: Thu Apr 08, 2010 7:19 pm Post subject: |
|
|
Absolutely amazing. Brilliant is right.
SC
|
|
Back to top |
|
|
huggy
Joined: 02 Aug 2008 Posts: 927 Location: Melbourne,Australia
|
Link Posted: Thu Apr 08, 2010 8:17 pm Post subject: |
|
|
AnalogRocks wrote: | huggy wrote: | Kal
The 5 post thingy is great,may I suggest that to avoid clogging up threads by new users with "post count" posts,you could start a new "sticky" thread just for that purpose.
This is how it's done in our local DTV forum and works well.
Dave |
Can you elaborate on this? |
One thread with the sole purpose of new users getting their post count up to 5,they can fill it up with whatever.
This is what I mean;
http://www.dtvforum.info/index.php?showtopic=44129&hl=red+text+thread
Dave
|
|
Back to top |
|
|
|
|
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You can download files in this forum
|
Forum powered by phpBB © phpBB Group
|
|