
As we showed in Table 4, reCaptcha is flexible and a solution is considered correct even if it contains a wrong
image along with the correct ones. As such, we configured the system to select 3 images for the solution so as to
fall within the “relaxed” limits. The Pass Challenge bars represent the outcome of the attacks.
We started with a baseline measurement for the “vanilla” version of each module which selects images based
on the overlapping tags; for GRIS, this also entails using the hint provided in the challenge and the best guess
returned by the reverse image search. In general, the success rate for GRIS is limited by the number of candidate
images for which we can obtain a best guess description. For the other modules, the baseline attack selects the
3 images that have the most common (overlapping) tags with the sample image. When using the hint, best guess
and page titles, the Alchemy module passed 49.9% of the challenges, while Clarifai passed 58%. Caffe is also very
effective, solving 45.9% of the challenges. The hint has a significant effect in most cases, increasing the accuracy
by 1.5-15.5% depending on the annotation system.
We explored how the attack’s accuracy is impacted by supplying the image annotation module with higher
resolution versions of the images. We were able to automatically obtain a higher resolution version of 2,909 images
from the 700 challenges. Out of those, 371 corresponded to the sample image. The high resolution images
increased the attacks’ success, with Alchemy and Clarifai passing 53.4% and 61.2% of the challenges respectively.
TDL is less accurate achieving 45%, while Caffe increases to 49.1%.
We also measured the number of challenges our system would pass if there was no flexibility. Since in most
cases the solution consists of 2 images, we tuned the system to select 2 images for each challenge. The Exact
Solution bars in Figure 4 present the results, and we can see that all the image annotation services were quite
effective in identifying the correct images. Clarifai is the most effective as it selected the exact set of images in
40.2% of the challenges, while Alchemy reached 31.5% and Caffe 28.3%.
Tag classifier
To quantify the effectiveness of our tag classifier as part of our captcha-breaking system, we followed a 10-fold
validation approach for training and testing our classifier on the dataset of 700 labelled image captchas. In our first
experiment, we skipped the other image selection steps, and relied solely on the classifier for selecting the images.
For each image, the classifier received as input the hint and the set of tags, and returned a “similarity” score; we
selected the 3 images with the highest score. Our attack provided an exact match solution for 26.28% (σ= 7.09),
and passed 44.71% (σ= 6.39) of the challenges. In the second experiment, we incorporated our classifier into our
system, and used the classifier-based selection as a replacement of the overlapping-based selection of images from
the undecided set. When using the classifier, our attack’s average accuracy for Clarifai reached 66.57% (σ= 7.53),
resulting in an improvement of about 5.3%. The classifier is more effective than the overlap approach, as it identifies
specific subsets of tags that are associated with each hint, instead of the more simplistic metric of the number of
common tags. Furthermore, the use of the classifier does not impact the performance of the attack as the duration
is increased by ∼0.025 sec.
Live attack
To obtain an exact measurement of our attack’s accuracy, we run our automated captcha-breaker against reCaptcha.
We employ the Clarifai service as it shows the best result amount other services.
Labelled dataset. We created a labelled dataset to exploit the image repetition. We manually labelled 3,000
images collected from challenges, and assigned each image a tag describing the content. We selected the appro-
priate tags from our hint list. We used pHash for the comparison, as it is very efficient, and allows our system to
compare all the images from a challenge to our dataset in 3.3 seconds.
We ran our captcha-breaking system against 2,235 captchas, and obtained a 70.78% accuracy. The higher
accuracy compared to the simulated experiments is, at least partially, attributed to the image repetition; the history
module located 1,515 sample images and 385 candidate images in our labelled dataset.
Average run time. Our attack is very efficient, with an average duration of 19.2 seconds per challenge. The
most time consuming phase is running GRIS, consuming phase, as it searches for all the images in Google and
processes the results, including the extraction of links that point to higher resolution versions of the images.
Offline mode. We also evaluated our attack in an offline mode, where we did not use any online annotation
services or Google’s reverse image search; we relied solely on the local library, our labelled dataset, and our skip-
gram classifier. with the two libraries, NeuralTalk and Caffe. When using Caffe and our classifier, our system solved
Black Hat ASIA 2016 •I’m not a human: Breaking the Google reCAPTCHA page 9 of 12