This is a repository copy of Using Generative Adversarial Networks to Break and Protect

Text Captchas.

White Rose Research Online URL for this paper:

http://eprints.whiterose.ac.uk/156512/

ersion: Accepted ersion

Article:

Ye, G, Tang, Z, Fang, D et al. (6 more authors) (2020) Using Generative Adversarial

Networks to Break and Protect Text Captchas. ACM Transactions on Privacy and Security,

23 (2). 7. ISSN 2471-2566

https://doi.org/10.1145/3378446

article published in ACM Transactions on Privacy and Security. Uploaded in accordance

with the publisher's self-archiving policy.

eprints@whiterose.ac.uk

https://eprints.whiterose.ac.uk/

Reuse

indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by

national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of

the full text version. This is indicated by the licence information on the White Rose Research Online record

for the item.

Takedown

If you consider content in White Rose Research Online to be in breach of UK law, please notify us by

emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request.

Using Generative Adversarial Networks to Break and

Protect Text Captchas

GUIXIN YE, ZHANYONG TANG, DINGYI FANG, Northwest University, China

ZHANXING ZHU, YANSONG FENG, Peking University, China

PENGFEI XU, XIAOJIANG CHEN, Northwest University, China

JUNGONG HAN, University of Warwick, United Kingdom

ZHENG WANG, University of Leeds, United Kingdom

Text-based CAPTCHAs remains a popular scheme for distinguishing between a legitimate human user and an

automated program. This article presents a novel genetic text captcha solver based on the generative adversarial

network. As a departure from prior text captcha solvers that require a labor-intensive and time-consuming

process to construct, our scheme needs signicantly fewer real captchas but yields better performance in

solving captchas. Our approach works by rst learning a synthesizer to automatically generate synthetic

captchas to construct a base solver. It then improves and ne-tunes the base solver using a small number of

labeled real captchas. As a result, our attack requires only a small set of manually labeled captchas, which

reduces the cost of launching an attack on a captcha scheme. We evaluate our scheme by applying it to 33

captcha schemes, of which 11 are currently used by 32 of the top-50 popular websites. Experimental results

demonstrate that our scheme signicantly outperforms four prior captcha solvers and can solve captcha

schemes where others fail. As a countermeasure, we propose to add imperceptible perturbations onto a captcha

image. We demonstrate that our countermeasure can greatly reduce the success rate of the attack.

CCS Concepts: • Security and privacy →Graphical / visual passwords;Authentication.

Additional Key Words and Phrases: Text captchas, Generative adversarial networks, Transfer learning, Security,

Authentication

ACM Reference Format:

Guixin Ye, Zhanyong Tang, Dingyi Fang, Zhanxing Zhu, Yansong Feng, Pengfei Xu, Xiaojiang Chen, Jungong

Han, and Zheng Wang. 2020. Using Generative Adversarial Networks to Break and Protect Text Captchas. ACM

Transactions on Privacy and Security 1, 1, Article 1 (January 2020), 30 pages. https://doi.org/10.1145/3378446

Extension of Conference Paper: a preliminary version of this article entitled "Yet Another Text Captcha Solver: A Generative

Adversarial Network Based Approach" by G. Ye et al. appeared in ACM Conference on Computer and Communications

Security, 2018 [74]. The work was partly supported by the National Natural Science Foundation of China (NSFC) through

grant agreements 61972314, 61672427, and 61872294; in part by the International Cooperation Project of Shaanxi Province

(2019KW-009) and the Ant Financial through the Ant Financial Science Funds for Security Research. Corresponding authors:

Zhanyong Tang and Zheng Wang.

Authors’ addresses: Guixin Ye, Zhanyong Tang, Dingyi Fang, Northwest University, China, gxye@stumail.nwu.edu.cn,

{zytang,dyf,xjchen,pfxu}@nwu.edu.cn; Zhanxing Zhu, Yansong Feng, Peking University, China; Pengfei Xu, Xiaojiang Chen,

Northwest University, China; Jungong Han, University of Warwick, United Kingdom, jungong.han@warwick.ac.uk; Zheng

Wang, University of Leeds, United Kingdom, z.wang5@leeds.ac.uk.

1:2 Ye, G. et al

1 INTRODUCTION

The completely automated public Turing test, or CAPTCHA

for short, is often used to distinguish

legitimate users from malicious bots [

]. Captchas exist in various forms, including texts [

–

images [

], audio [

], video [

], and games [

]. Among these, text captcha is a popular scheme

and remains being used by the majority of the top-50 popular websites ranked by

alexa.com

including Microsoft, Google, eBay and many others.

Breaking a particular captcha scheme

is rarely news today. This is a heavily studied area, and

many scheme-specic captcha solvers have been proposed in the past. The seminal work presented

by Greg and Malik dated back to 2003 was among the rst attempts to automatically solve text

captchas [

]. However, most of the prior attacks are specically tuned for a few specic captcha

schemes and adapting them for a new scheme would require signicant human intervention

for tuning the model and collecting training data – manually-labeled captcha images. Just like

cryptography, text captchas are evolving and becoming more robust where many of the advanced

features make prior attacks no longer applicable [18].

By employing analytical models and algorithms, some of the more recent works have improved

the generalization ability of a captcha solver [

]. The idea behind such schemes is that a

captcha solver can be tuned to target a new scheme by changing and adapting the model parameters

and algorithm thresholds. These schemes, however, are only eective in solving text captchas with

simple security features. Their success often relies on a good character segmentation method [

but the recent development of text captchas has made character segmentation more challenging by

introducing advanced security features like more complex backgrounds as well as distorted and

overlapping characters.

In this article, we present a new approach for building text captcha solvers. Compared to prior

attacks, our approach requires signicantly fewer numbers of manually labeled captchas but delivers

better performance for solving a wider range of schemes. Our work is inspired and enabled by the

recently proposed generative adversarial network (

GAN

) [

] and its breakthrough eectiveness in

image translation tasks [

]. To construct a solver for a given captcha scheme, we rst automatically

learn a

GAN

-based captcha synthesizer using a small set of labeled real captcha images. Next, we

use the learned synthesizer to automatically generate a large number of training samples without

human involvement, from which we learn a base solver. We then apply transfer learning [

] to

ne-tune and improve the base solver. As a signicant departure from prior attacks, our approach

greatly reduces the cost and human eorts in creating and tuning a captcha solver as well as the

underpinning analytical models and algorithms. Our approach is generally applicable because the

process for building a solver is mostly automatic and is not coupled to a specic scheme. We show

that our approach can result in a highly eective solver for a large set of currently used text captcha

schemes, making our attack a severe threat to text captchas.

We evaluate the proposed scheme through extensive experiments. We apply our approach to 33

text captcha schemes, 11 of which were being used by 32 of the top-50 popular websites ranked by

alexa.com

as of April 2019. We compare our approach to four prior captcha solvers [

Experimental results show that our approach needs as few as 500 as opposed to millions [

] labeled

captcha images to learn a successful solver. Despite our approach uses a signicantly fewer number

of real captchas, it gives a higher success rate. Experimental results show that our approach can

successfully crack all tested schemes, judged by the commonly used standard [

], and it can solve

a captcha in less than 50 milliseconds using a modest desktop GPU.

1To aid readability, we will use the acronym in lowercase thereafter.

In this article, the term breaking captchas refers to automatically solve the captcha challenge using a computer program,

i.e., recognizing the characters of a text captcha image.

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:3

3. Occluding line

2. Character overlapping

6. Solid and hollow fonts

5. Different font sizes and colors

4. Character rotating,

distortion or waving

(d) (e) (f)

1. Noisy background

(a) (b) (c)

Anti-segmentation

features:

Anti-recognition

features:

Fig. 1. Security features of current text-based captchas used in this work. Label 1, 2, 3 show the anti-

segmentation features and label 4, 5, 6 present the anti-recognition features.

As a countermeasure, we turn again to the

GAN

framework. We show that by inserting some

imperceptible perturbations or noise to the captcha images, one can signicantly decrease the

eectiveness of our attack. This provides a new way to protect the popular text captcha schemes

against machine-learning based attacks before a better alternative is adopted.

To sum up, this article makes the following technical contributions. It is the rst to:

•

employ the generative adversarial paradigm to build a successful solver for text captchas

based on a small number of real captcha images;

•apply transfer learning to ne-tune a captcha solver that learned from synthetic data;

•

show how a generic learning-based approach can be applied to target a rich set of captcha

schemes, which not only requires less human eorts to construct but also leads to better

performance over prior attacks;

•

propose a new countermeasure for text captchas based on the generative adversarial paradigm.

2 BACKGROUND

In this section, we present the threat model, introduce the preliminaries of text captchas and the

GAN architecture.

2.1 Threat Model

Like many prior works, our attack employs supervised learning techniques to build a captcha

solver. The quality of a machine-learned model depends on the volume and quality of the training

data. In this work, we assume the adversary has access to a small number of manually labeled

captcha images for the target scheme. We refer these captchas as real captchas because they

are generated by the target scheme. The real captchas can be labeled either by the attacker or

paid crowdsourcing workers. Specically, our attack needs as few as 500 real captchas to build a

successful captcha solver. Prior attacks based on machine learning models often require thousands

or sometimes millions of examples to learn a good captcha solver [

]. For example, the work

presented in [

] requires millions of captcha images to learn an eective

CNN

model to solve

reCAPTCHA. Compared to these prior attacks, our approach incurs signicantly less overhead and

cost for collecting and labeling the data.

We also assume the adversary has sucient computing power to run machine learning algorithms.

In this article, we show that the learning can be performed on a typical GPU cloud server, and the

learned solver can run eciently on a modest desktop GPU.

2.2 Security Features of Text Captchas

A text captcha image often consists of distorted characters, a noisy background or occluding

lines, which are coined as security features. Without loss of generality, to make our experiments

1:4 Ye, G. et al

real captchas synthetic captchas

Training

base solver

Fine

Tuning

refined solver

clean real captchas

Preprocessing

model

security features

Training

captcha

generation model

Fig. 2. Overview of our approach.

We first use a small set of real captchas and the security features of

the target scheme to learn a captcha generation model.

The captcha generation model is then applied

to automatically generate synthetic captchas (with and without confusing background paerns) to learn

a pre-processing model to remove security features, e.g., noisy backgrounds and occluding lines, from the

input captcha image.

At the same time, the synthetic captchas (without security features) are used to train

a base solver.

Finally, we use a few clean real captchas (that have been processed by the preprocessing

model) to fine-tune the base solver to build the final solver.

manageable, we restrict our scope to six widely used security features employed by the current

text captcha schemes. They are used by the top-50 popular websites ranked by

alexa.com

at the

time this work was conducted.

Figure 1 illustrates some of the security features targeted in this work. These include anti-

segmentation and anti-recognition features. The anti-segmentation feature, labeled as 1, 2 and 3

in Figure 1, aims to increase the diculty of character segmentation. A anti-recognition feature,

on the other hand, makes it dicult for a computer program to recognize the characters. This is

achieved by using a variety of font styles and distorted characters, as depicted in Figure 1 with

labels 4, 5 and 6. Later in Table 1, we summarize how these security features are used in dierent

captcha schemes.

2.3 Generative Adversarial Networks

Our work is the rst to apply the recently proposed

GAN

architecture [

] to learn a captcha

solver. A classical

GAN

consists of two modules. The rst is a generative network for generating

synthetic data, and the second is a discriminator network to lter out the synthetic examples from

the real ones. To train the generative and discriminator networks, we use backpropagation [

], a

well-established training method for neural networks. During each training iteration, the generator

aims to produce better synthetic samples while the discriminator would become more skilled

at agging synthetic samples.

G ANs

have demonstrated promising results i n n atural language

processing [42, 76] and image generation [35, 77] tasks.

3 OVERVIEW OF OUR APPROACH

Figure 2 depicts the four steps of building a captcha solver using our approach. Each of the steps

is described as follows.

1 Training data synthesis.

To reduce the eorts for collecting and labeling real captchas and at

the same time provide sucient training data to build an eective captcha solver, we seek ways to

generate synthetic training data. We do so by learning a captcha synthesizer for a target captcha

scheme (Figure 2

). Our captcha synthesizer is a neural network trained under the generative

adversarial paradigm. Our

GAN

consists of two components. The first is a captcha generation

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:5

Generation

Model

Discriminator

Network

synthetic captchas

real captchas

classification accuracy

acc.>th.?

captcha generation

model

Yes

Adjusting synthesizer parameters

Fig. 3. The training process of our GAN-based text captcha synthesizer.

model that tries to produce captchas which are as similar as possible to the target captchas. The

second is a discriminator that tries to identify the synthetic captchas from the real ones. This

generation-discrimination process terminates when the discriminator fails to identify a large

portion of the synthetic captchas. After training, we then use the learned captcha generation

model to automatically produce a large number of captchas together with their characters. This is

described in Section 4.1.

2 Preprocessing.

To assist the captcha solver, we build a preprocessing model (Figure 2

)

to remove as much captcha security features as possible. The preprocessing model also tries to

standardize the font style by e.g., lling hollow characters and standardizing spaces or gaps between

characters. We leverage a specic

GAN

called Pix2Pix [

] to build the pre-processor model. The pre-

processor model is trained by using solely synthetic captcha samples. Each training sample contains

two captcha images: one has security features, and the other does not. We learn a preprocessing

model for each captcha scheme and the training process is fully automatic. We describe this process

in more details at Section 4.2.

3 Training the base solver.

In this step, we use the preprocessed synthetic captcha images together

with their corresponding character labels to learn a base solver (see Figure 2

). Our base solver

is a standard Convolutional Neural Network (

CNN

). The trained solver takes in a pre-processed

captcha image and outputs the corresponding label. This is detailed in Section 4.3.

4 Fine-tuning the base solver.

In this nal step, we apply transfer learning to further improve

the base solver (see Figure 2

). Specically, we use the set of manually labeled real captchas that

we used to train the captcha synthesizer to update the weights at some network layers of the base

solver. This is described with more details in Section 4.3.

4 IMPLEMENTATION DETAILS

This section provides details on how to build a captcha synthesizer to generate synthetic training

data (Section 4.1), and how to learn a preprocessing model (Section 4.2 ) and a captcha solver

(Section 4.3) using synthetic captcha images.

4.1 Training Data Synthesis

Prior work shows that to learn an eective

CNN

-based solver for text captchas would require as

many as 2.3 million of labeled training samples [

]. Collecting and labeling such large volume of

captchas would require intensive human eorts and incur signicant cost. Our approach overcomes

this issue by using synthetic training data. To this end, we rst learn a captcha synthesizer and use

the synthesizer to populate the training data with a large number of synthetic captchas which are

similar to the target captchas. This allows the training dataset to cover the problem space far more

nely than what could be achieved by exclusively using manually-labeled real captchas.

1:6 Ye, G. et al

Security Feature

On/Off

#Options

Value Range

Noisy background(s)

[10, img.width

]

Occluding lines On

{Line, Sin, Quadratic, Bezieer

}

Char. Overlapping On

3, 10]

Character set On

[A 

Font style(s) On

Solid

Font color(s) On

RGB (65, 103, 141)

Distortion On

{[0.1, 0.2], [0.2, 0.3]}

Rotation On

30, 30]

Waving Off

(a) Real Baidu captchas of

different security features (b) Synthetic parameters (c) w/ security

features

Generated synthetic captchas

(d) w/o security

features

Fig. 4. Example synthetic captchas for Baidu scheme. Our captcha synthesizer is trained using a set of real

captchas (a). The parameter seing (b) defines the security feature space. The trained captcha synthesizer is

used to produce synthetic captchas with (c) and without (d) the security features (i.e., noisy backgrounds and

occluding lines in this example) included.

Captcha Image

Synthesizer

Parameter settings

LCxW

Random captcha words

+GAN-captcha-

generator

Fig. 5. Overview of our captcha generation model. Our generator model includes an image synthesizer (

)

and a GAN-captcha-generator (

). The image synthesizer takes in a word of characters and the security

feature seing to produce an initial captcha image. The GAN-captcha-generator then modifies the initial

captcha image at the pixel level, aiming to make the resultant captchas are similar to the ones of the target

scheme. Once the training process is completed, the captcha generator model can be used to automatically

generate the captcha images based on any given word of characters.

As we have briey described in S ection 3 , our

G AN

-based captcha s ynthesizer c onsists o f a

captcha generation model and a discriminator. Figure 3 illustrates the process of training a captcha

synthesizer. The training process is largely automatic except that it needs 500 manually labeled

real captcha images of the target scheme and a set of user-dened security features. The security

feature denition is given by setting a set of pre-dened parameters. As an example, Figure 4 lists

all pre-dened parameters of the

Baidu

captcha scheme. For this example, the waving feature is

turned o as it is not used by the

Baidu

scheme. It is to note that these parameters can be easily

extended and adjusted to target other captcha schemes.

Captcha generation model.

Figure 5 shows that our captcha generation model is comprised of

a captcha image synthesizer and a

GAN

captcha generator. The image synthesizer automatically

generates captcha images for a given parameter setting and a sequence of characters (i.e., a word),

while the

GAN

captcha generator modies the s ynthetic c aptcha a t the p ixel l evel. T he image

synthesizer takes in a security feature conguration and tries to nd a set of parameter values so

that the synthetic captchas are as similar as possible to the ones from the target captcha scheme.

We use the grid search method presented in [

] to nd the optimal parameters for a given captcha

scheme. Like the image generator, the

GAN

captcha genertor learns how to modify the generated

images at the pixel level so that the resulting captcha contains security features that are similar to

the real ones of the target scheme. The similarity is measured by the ratio of synthetic captchas

that cannot be distinguished from the real ones by the discriminator. In other words, the more

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:7

synthetic captchas that can “fool" the discriminator, the higher quality the generated synthetic

captchas will be. We also use the similarity score to update the parameter values of the captcha

image synthesizer during the grid search process. Specically, if the similarity score is above 0.65,

the parameter values will be reduced according to a given attenuation coecient, or vice verse. It is

to note that once the captcha generation model is learned, it can automatically generate a synthetic

captcha image based on any given characters.

Captcha discriminator.

Our discriminator model is also a

CNN

dened in [

]. The last layer of

the

CNN

gives the probability of an input captcha being a synthetic one. We use batches of captcha

images to train the discriminator, where each mini-batch consists of randomly sampled synthetic

captchas,

, and real captchas,

, and the target labels are 1 for every

and 0 for every

. The

discriminator network updates its parameters by minimizing the following loss function:

LD=−X

log D(xi)−X

log(1−D(yj)) (1)

where

D(·)

is the probability of the input being a synthetic captcha, and 1

−D(·)

is that of a

real one. In this work, we use the

Jensen-Shannon

divergence [

] to evaluate the dierence of

the distribution between the synthetic and real captcha images when training the discriminator.

We have also considered the

Wasserstein

distance [

] during our initial experiment but found

that the

Jensen-Shannon

divergence works better in our problem setting. Specically, we found

that the

Jensen-Shannon

divergence metric can be used to better distinguish between two real

and synthetic captures that are visually similar. This capability helps us to better optimize the

generation parameters to improve the performance of the captcha generation model.

Training.

We use the minibatch stochastic gradient descent (SGD) and the Adam solver [

] with

a learning rate of 0.0002 to train our captcha synthesizer. The objective of our captcha synthesizer

can be expressed as:

LcG AN =Ex,y∼pd at a (x,y)[loдD (x,y)]+Ex∼pd a t a (x),z∼pz(z)[loд(1−D(x,G(x,y)))] (2)

where xand yare a synthetic and a real captcha respectively, and z is the noise.

Our overall training objective follows the general

GAN

approach [

], using the

1 norm with the

regularization term λset to 0.0001. The training objective is dened as:

G∗=arg min

G,max

DLcG AN (G,D)+λLL1(G)(3)

where the generator,

, tries to minimize the dierence between the generated captchas and the

real ones, while the discriminator, D, seeks to maximize it.

Here, the L1 loss function is dened as:

LL1(G)=Ex,y∼pdat a (x,y),z∼pz(z)[||y−G(x,z)||1] (4)

During training, when updating the parameters of the synthesizer, we x the parameters of

the discriminator; and when updating the discriminator, we x the parameters of the synthesizer.

Training terminates when the discriminator fails to identify more than 5% of the synthetic captchas.

Once the synthesizer is trained, it can be used to quickly generate synthetic captchas. In our case,

it takes less than one hour to generate a million captcha images.

Working example.

We use the

Baidu

captcha scheme as a working example to illustrate the

process for training a captcha sythesizer. The training process consists of multiple steps. In the

initial step, we provide some (i.e., 500) real captchas for the

GAN

learning engine. We also give the

initial parameter values for the captcha image synthesizer. Similarly, the

GAN

captcha generator is

1:8 Ye, G. et al

+Discriminator

training

Discriminator

training

Generator

training

accuracy

+Generator

training

(a)Pre-training (b)Generator training

(c)Discriminator training

Fig. 6. The training process of our

GAN

-based pre-processing model. The generator tries to remove as much

noisy backgrounds and occluding lines from the input captchas, while the discriminator tries to identify

which of the input clean captchas are produced by the generator. All the captchas used in the training are

generated by our captcha generation model.

initialized with random weights. During each iteration of the

GAN

training process, the captcha

generation model (that consists of the captcha image synthesizer and the

GAN

captcha generator) a

batch of synthetic captchas which are examined by the captcha discriminator. If the discriminator

can successfully distinguish a large number of synthetic captchas from the real ones, the grid search

method is employed to adjust the parameter values for synthesizing another batch of captchas. This

iteratively training process continues until the discriminator can distinguish less than 5% of the

synthetic captchas from the real ones (see Section 6.5). When the process is terminated, the learning

engine will output the optimal parameter values to be used by the captcha image synthesizer and

the

GAN

captcha generator for synthesizing captcha images with security features. To generate

captchas without security features, we simply turn o the feature option of the captcha image

synthesizer. For examples, Figure 4 (a) shows real Baidu captchas and (c) and (d) in Figure 4 are

the synthetic captchas with and without background security features produced by our captcha

generation model. As can be seen from the gure, the security features of the synthetic captchas

are visually similar to the real captchas.

4.2 Captcha Preprocessing

Modern captcha schemes often integrate advanced security features like a noisy background

(Figure 1a, b, and c) and distorted hollow fonts (Figure 1d, e, and f). These features make prior

pre-processing methods like [

] invalid (see Section 6.4). In our work, we build a

GAN

-based

preprocessing model to remove these security features. Like the synthesizer, we train a preprocessing

model for each captcha scheme. In our initial experiment, we also tried to build a general pre-

processing model across dierent captcha schemes. However, we found that a scheme-specic

model performs better. Note that we use only synthetic captchas to train the preprocessing model.

Specically, we adopt the Pix2Pix image-to-image translation framework [

] which was originally

developed to transform an image from one style to another. In our case, the images to be translated

are captcha images with background noise such as the

Baidu

captcha shown in Figure 1b or

dierent font styles such as the

M icrosoft

captcha shown in Figure 1d. Note that our model

removes multiple security features (e.g., Figure 4b) at once.

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:9

E J R A

Synthetic captchas

and their labels

Target captchas

and their labels

(a) Train the base solver

(b) Train the fine-tuned solver

Output

H X L M

Output

Convolutional Pooling Fully connected

Retrained Layers

Reused Layers

Fig. 7. Our

CNN

-based captcha solver. We first use synthetic captchas to train the based solver (a) which is

then refined using a small number (500 in this work) of real captchas (b).

Our

GAN

-based preprocessing model also consists of a generator and a discriminator. Figure 6

depicts the training process. The generator works at the pixel level, which tries to amend some

pixels of the input captcha images (e.g., removing noise from the background shown as Figure 6b).

By contrast, the discriminator tries to distinguish the preprocessed captchas from the clean captchas

that are produced by the captcha generation model described in Section 4.1.

Training.

Before training, we rst pre-train an initial generator and discriminator using some

synthetic captchas (Figure 6a). The captchas used in the pre-training process are organized as pairs

where each pair contains (1) a synthetic captcha image with the target security features and (2) a

corresponding image without these security features. Once the pre-training process is nished, we

continue to train them under the generative adversarial framework. The training process is similar

to how we train our captcha synthesizer (Section 4.1). Over time, the generator would become

better in removing security features, and the discriminator would become better in recognizing

security features (even the changes are small). Training terminates when the discriminator fails to

identify more than 5% of the preprocessed images from the clean counterparts (Figure 6c). After

that, we use the trained generator to preprocess unseen captcha images of the target scheme.

4.3 Build and Fine-tune the Solver

To build a captcha solver, we follow a two-step approach. We rst train a base solver from

synthetic captchas. We then ne-tune the base solver using the same set of real captchas used to

build the captcha synthesizer.

Network structure of our solver.

Our captcha solver is built upon a classical

CNN

called LeNet-

], and it tries to identify the characters of the preprocessed captchas. Unlike LeNet-5 which

was initially designed to recognize single characters, we introduce some additional layers (2

convolutional and 3

pooling layers) to extend its capability to recognize multiple characters.

Figure 7a shows the structure of our solver which has ve convolutional layers, ve polling layers

followed by two fully-connected layers. Each of the convolutional layers is followed by a pooling

1:10 Ye, G. et al

layer. We use a 3

3 lter for the convolutional layer and a max-pooling lter for the pooling layer.

We use the default parameters of LeNet-5 for the rest of the network structures.

It is to note that we have also considered other inuential

CNN

structures including R esNet [

Inception [

] and VGG [

]. We found that there is little dierence in solving text captchas among

these models. We choose LeNet-5 due to its simplicity, which gives the quickest inference (i.e.,

prediction) time and requires the least training samples for ne-tuning the base solver.

Training the base solver.

We train a base solver for a target captcha scheme. In case that the

number of characters of a captcha image is not xed for a scheme, we also train a base solver for

each possible number of characters. We use 200,000 synthetic captchas generated by our captcha

generation model to train the base solver. Each training sample consists of a clean captcha (produced

by the preprocessing model) and an integer vector that stores the character IDs of the captcha.

Note that we assign a unique ID to each candidate character of the target captcha scheme. We use

Bayesian

based parameter tuner [

] to automatically choose the hyperparameters for training

the base solver. Training a base solver takes around ve hours using four NVIDIA P40 GPUs on a

cloud server (see Section 5.3). The trained base solver can then be applied to any unseen captcha

image of the target scheme.

Rening the base solver.

To ne-tune the base solver, we apply transfer learning [

] to update

later layers (i.e., those that are closer to the output layer) of the base solver, by using the 500 labeled

real captchas that were previously used to train the synthesizer. The idea of transfer learning, in a

nutshell, is that in neural network classication, information learned at the early layers of neural

networks (i.e., closer to the input layer) will be useful for multiple classication tasks. The later

the network layers are, the more specialized the layers become [

]. We exploit this property to

calibrate the base solver to minimize any bias and over-tting that may arise from the synthetic

training data.

Figure 7b illustrates the process of applying transfer learning to rene the base solver. Transfer

learning in our context is as simple as keeping the weights of the early layers and then update the

parameters of the later layers by applying the standard training process using the real captchas.

This process takes less than 5 minutes on our training platform.

5 EXPERIMENTAL SETUP

5.1 Captcha Schemes

Our evaluation targets 11 current text captcha schemes used by 32 of the top-50 popular websites

ranked by

alexa.com3

. We note that some of the websites use the same captcha scheme, e.g.,

Youtube

uses the

Google

scheme, and

Live

Office

and

Bing

use the

Microsoft

scheme. The

websites we examined cover a wide range of domains including e-commerce, social networks,

search, and information portals. Table 1 gives some examples of the captcha schemes tested in this

work. We note that many captcha schemes exclude some specic characters that are likely to cause

confusion after performing the character distortion, for improving the usability of the captchas.

Examples of such characters include ‘o’ and ‘0’, ‘1’ and ‘l’, etc (See Table 1).

In addition to the 11 current schemes, we also extend our evaluation to 22 other captcha schemes

(See Table 5) used in prior studies to provide a fair comparison with previous attacks. It is worth

mentioning that while we collected the captchas from the ocial websites, many of the captcha

schemes we tested are also used by third-party websites and applications as a security mechanism.

3We have refreshed the captcha dataset used in our previous work [74] when conducting this evaluation.

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:11

Table 1. Text-based captcha schemes tested in our experiments.

Security Features

Scheme Website(s) Example Anti-segmentation Anti-recognition

Excluded

Characters

Google

google.{com,co.in,co.jp,

co.uk,ru,com.br,fr

com.hk,it,ca,es,com.mx}

youtube.com

Overlapping characters,

Enligh letters

Varied font sizes & color,

rotation, disortion

and waving

–

Microsoft {live, bing, miscosoft,

office, linkedin}.com

Overlapping characters,

solid background

Dierent font styles,

varied font sizes,

rotation, waving

0, 1, 5

D, G, I, O, U

Alipay

{alipay, tmall,

taobao, login.tmall,

alipayexpress}.com

English letters and

arabic numerals,

overlapping characters

Rotation and distortion 0, 1

I, L, O

eBay ebay.com Overlapping characters,

Only arabic numerals

Rotating, distortion

and waving –

Wikipedia wikipedia.org Overlapping characters,

Enligh letters

Rotation, distortion

and waving –

Baidu {baidu, qq}.com

Occluding lines,

character overlapping,

only Enligh letters

Varied font size, color,

rotation, disortion

and waving

Sina sina.cn

English letters and

arabic numerals,

overlapping characters

Rotation, distortion

and waving

1, 9, 0

D, I, J, L, O, T

i, j, l, o, t, g, r

Weibo weibo.cn

English letters and

arabic numerals,

overlapping characters,

occluding lines

Rotation and distortion 0, 1, 5

D, G, I, Q, U

Sohu sohu.com

Complex background,

occluding lines,

and overlapping

Varied font size, color

and rotation

0, 1

i, l, o, z

Qihu360 360.cn

English letters and

arabic numerals,

overlapping characters

Varied font sizes,

rotation and distortion

I, L, O, T

i, l, o, t, q

JD jd.com

English letters and

arabic numerals,

overlapping characters

Rotation and distortion

0, 1, 2, 7, 9

D, G, I, J,

L, O, P, Q, Z

5.2 Collecting and Synthesizing Captchas

We use two sets of captchas in evaluation: one for training and the other for testing. Most of

training data are synthetic captchas generated by our captcha generation model. The testing data

are collected from the target website for training and testing our

GAN

-based synthesizer and the

ne-tuned solver.

Synthesizing training captchas. We rst initialize the security feature parameters as described

in Section 4.1 and then use the initial parameters to generate the rst batch of synthetic captchas –

which are then used together with 500 real captchas to train our synthesizer. After we have trained

the synthesizer, we then use it to generate synthetic samples to learn the preprocessing model and

the base solver. Specically, we use 20,000 and 200,000 synthetic captchas to train the preprocessing

model and the base solver respectively.

Collecting testing captchas.

The real captchas are automatically collected using a web crawler

written in Python. Each collected captcha is manually labeled by three paid participants (nine

participants in total) recruited from our institution. We use only captchas where a consensus has

been reached by all the three annotators. In total, we have used 1,500 real captchas for each target

scheme. We randomly divided the collected captchas into two sets, one set of 500 captchas for

training our synthesizer and the nal solver, and the other set of 1,000 captchas for testing our

solver. It takes up to 30 minutes (less than 10 minutes for most schemes) to collect 500 captchas

1:12 Ye, G. et al

Table 2. The overall success rate and solver running time.

Success rate

Scheme Base Solver Fine-tuned Solver

Running Time per

Captcha (ms)

Sohu 83% 92% 43.78

eBay 52% 86.6% 4.22

JD 60% 86% 43.18

Wikipedia 7% 78% 4.71

Microsoft 36.6% 69.6% 46.06

Alipay 23% 61% 3.75

Qihu 360 48.6% 56% 41.03

Sina 40.6% 52.6% 42.81

Weibo 4.7% 44% 3.41

Baidu 6% 34% 41.57

Google 0% 3% 4.02

and less than 2 hours to label them by one user. This suggests that the eort and cost for launching

our attack on a particular captcha scheme is low.

5.3 Implementation and Hardware Platforms

Our prototype system

is implemented using Python. The preprocessing model is built upon

the Pix2Pix framework [

], implemented using Tensorow v.1.12, and the captcha solver is coded

using Keras v.2.1. We use two dierent hardware platforms. For training, we use a cloud server

with a 2.4GHz Intel Xeon CPU, four NVIDIA Tesla P40 GPUs and 256GB of RAM, running the

Centos 7 operating system with Linux kernel 3.10. The trained models are then run and tested on a

desktop PC with a 3.2GHz Intel Xeon CPU, a NVIDIA Titan GPU and 64GB of RAM, running the

Ubuntu 16.04 operating system with Linux kernel 4.10. All trained models run on the Titan GPU

for inference.

6 EXPERIMENTAL RESULTS

In this section, we rst present the overall success rate of our approach for solving 11 current

captcha schemes. We then compare our approach against prior attacks on another 22 schemes.

Next, we analyze the working mechanism of our approach before discussing the impact of security

features on user experience and the generalization ability of our approach.

6.1 Evaluation on Current Captcha Schemes

Table 2 presents the success rate and the average running time in solving a captcha image for

11 current schemes. There is no dierence in solving time between the base and the ne-tuned

solvers because they use the same network structure. For each captcha scheme, we report the

average running time across 1,000 captchas. We observe little variation in the running time, less

than 0.5% across test runs. Note that in this evaluation, all captcha images of a scheme contain the

same number of characters. In Section 6.3, we show how our approach can be extended to target a

variable number of characters.

6.1.1 Overall success rate. Our base solver, built from synthetic data, is able to solve most of the

captcha schemes with a success rate of over 20%. This demonstrates the capability of

CNN

models in

4Code and data are available at: https://goo.gl/92VxXC.

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:13

Table 3. Example text-based captchas that are incorrectly labeled by our fine-tuned solver.

Scheme Captcha Image Ground Truth Solver Output Human Attempts

Sohu d4sk d4sh 1.6

eBay 934912 994912 1.8

JD BHER BFER 1.5

Wikipedia druidsemi druidseml 1.5

Microsoft XK6NK XK6VK 1.2

Alipay B7JK B7YK 1.6

Qihu 360 s34Ea s3VFa 1.8

Sina nG3uu nG3uv 1.4

Weibo 4TXB 4TX8 1.4

Baidu WFIH WFEH 1.8

Google irgandoca igiruloca >10

(a) Original Google captchas with different fonts and strong security features

(b) Synthetic Google captchas using our captcha generator

Fig. 8. Examples of real Google captchas (a) and the synthetic versions (b).

performing image recognition. However, it gives a low success rate for some of the schemes such as

Weibo

(4.7%) and

Google

(0%). The ne-tuned solver, rened using transfer learning, signicantly

boosts the performance of the base solver. In particular, it improves the success rate for

Wikipedia

from 7% to 78%,

Weibo

from 4.7% to 44%,

Alipay

from 23% to 61% and

Microsoft

from 36.6% to

69.6%. This result shows that transfer learning in combination with captcha synthesis can reduce

the data collection eorts for building an eective text captcha solver.

The rened solver also improves the success rate for

Google

captcha from 0% to 3%. This

relatively low success rate is because of the strong security features like distorted, overlapping,

waving characters and dynamic font styles employed by the scheme. These features make it dicult

for our captcha generation model to generate high-quality synthetic data. Figure 8 shows that our

synthetic captchas are not suciently similar to the real captchas (especially for the font styles).

We also observe that some security features like overlapping, rotated, distorted characters and

dynamic font styles can provide stronger protection under our attack over features like noisy

background and occluding lines. Nevertheless, 3% is still above the 1% threshold for which a captcha

is considered to be ineective [

]. We stress that no prior attack before ours can successfully crack

the current Google captcha scheme under this criterion.

1:14 Ye, G. et al

Table 4. How oen a common English prefix and suix appears at the 5

000 captcha images from

Google

and Wikipedia.

Number Number

Prexes Google Wikipedia Suxes Google Wikipedia

dis- 76 21 -ing 337 95

pre- 49 10 -est 166 105

mis- 44 9 -ion 129 26

anti- 15 3 -ness 77 6

semi- 7 2 -tion 63 12

fore- 3 2 -less 28 5

inter- 3 1 -ation 21 4

under- 1 0 -ative 8 2

trans- 0 1 -itive 3 0

6.1.2 Incorrectly labeled captchas. Table 3 gives some example captchas that are incorrectly labeled

by our ne-tuned solver. For most of these captchas, our solver only incorrectly recognize one

character and the mis-identied character is similar to the ground truth. For example, for the

eBay

captcha shown in Table 3, our solver incorrectly label character "3" to "9" due to character overlap-

ping. For the

Google

scheme, our solver often fails to label several characters in the middle due to

excessive character distoration and overlapping. However, our annotators were also struggling to

recognize the characters for those captchas. To quantify the diculty, we asked ten annotators to

label those captchas and count the number of attempts required to succeed. The last column of

Table 3 gives the averaged number of attempts required by our annotators to successfully recognize

images of a captcha scheme. The results suggest that our annotators found it dicult to recognize

most of the captcha schemes in the rst attempt. In particular, due to the strong distorted and

occulting lines of the

Google

captcha scheme, more than half of our annotators failed to recognize

a Google captcha image within ten attempts.

6.1.3 Exploiting captcha paerns to improve the success r ate. Some captcha schemes like

Google

and

Wikipedia

have more than eight characters in a single captcha image. We call these long-

character captcha schemes. We notice that the characters of a long-character captcha image tend

to follow some patterns, where some English word prexes or suxes appear frequently. We think

this might be a feature for helping a human user to better recognize the characters. To verify our

hypothesis, we collected and manually labeled 5

000 captcha images in addition to the 1

000 testing

captchas used for the

Google

and the

Wikipedia

schemes. We then count how often a commonly

used English word prex and sux appears in the 5

000 captchas for each of the two schemes, by

using the list of prexes and suxes suggested in [45].

Table 4 lists some of the frequently appeared prexes and suxes, containing at least three

characters. We see that a three-character prex or sux appears at least 9 times (up to 76) in the

000 captcha images of a scheme. This is greater than the averaged frequency of 1.99 if those

characters are evenly and randomly distributed across the 26 English alphabet letters over the 5

000

captchas of a scheme. We also observe a similar pattern for prexes or suxes with four or more

characters, although they have less frequency of appearance over the three-character counterparts.

Heuristics.

We wondered if one can exploit this observation to improve the success rate of a

captcha solver. In other words, can we build a context-sensitive captcha solver to correct some of

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:15

(b)reCaptcha 2013 (c)Microsoft

(d)QQ (f)Yahoo!

(a)eBay

(e)Amazon

Fig. 9. Examples of the captcha schemes (le) tested in prior work, and the synthetic versions (right) generated

by our captcha generation model. Our generation model is highly eectively in synthesizing captcha images.

the characters after performing image recognition? To this end, we develop a heuristic to post-

process the characters given by the ne-tuned solver to target the English word prexes and suxes

listed in Table 4. Specically, for a solved captcha word, we rst identify whether the word contains

a candidate pattern. A candidate pattern is a sequence of characters which similar to a word prex

or sux, but only with a few characters that are dierent from a standard word prex or sux. For

example, “trani" is a candidate pattern for word prex “ trans" as both words are only dierent in

the last character, ‘i’. A solved captcha word can also contain multiple. In this case, we will use

the prexes and suxes listed in Table 4 to search for the possible candidate patterns. Using this

strategy, our heuristic would correct the candidate pattern “seml" to “semi". Doing so gives a correct

prediction for the

Wikipedia

captcha shown in Table 3. Applying this strategy to the 1

000 test

captchas images for

Google

and

Wikipedia

, we improve the success rate for the

Google

scheme

from 3% to 5.1% and the Wikipedia scheme from 78% to 79.8%.

6.1.4 Training and deployment overhead. It took us around 2 days to train a captcha synthesizer

and the preprocessing model together on our training platform, and less than 50 milliseconds

to solve a captcha on our evaluation platform using a desktop GPU. For captcha schemes with

a confusing background or occluding lines (e.g., Baidu and Sina captchas in Table 2), our solver

can take 10

longer than others to solve process a captcha image. This overhead comes from the

preprocessing model. As we train a scheme-specic preprocessing model with dierent network

structures, the stronger the security features are, the more complex the preprocessing will be (and

hence longer running times). Nonetheless, our approach can solve all the testing schemes under

the commonly used criterion [10] with a quick running time.

6.2 Comparison to Prior Aacks

We now compare our approach with four state-of-the-art methods [

] on 24 distinct

captcha schemes, including the

eBay

and

Wikipedia

schemes from Table 1 and other 22 schemes.

To provide a fair comparison, we try to use captchas that prior methods were tested on. When

possible, we use the same dataset or captchas from the original scheme on which the prior work

was evaluated. For those obsolete captcha schemes (21 out of 24 schemes), we collected the test

data from public datasets, or using captcha generation tools developed by independent researchers.

Specically, we use (1) public datasets of previous captcha schemes, (2) online captcha generators,

such as

captchas.net

which was used by some of the previous captcha schemes, and (3) open

source captcha generators used by prior work.

For each captcha scheme, we collected 1,500 samples – from which we use 500 for training and

1,000 for testing. Figure 9 gives some examples of the real captchas and the one produced by our

generation model. The gure suggests that our generation model can produce captchas that are

visually similar to real examples from the target scheme.

Table 5 compares our ne-tuned solver to previous attacks. Our approach outperforms all

comparative schemes by delivering a signicantly higher success rate. For many of the testing

1:16 Ye, G. et al

Table 5. Comparing our approach against four prior aacks [

] on 24 captcha schemes where prior

methods were tested on. Here B-11 and B-14 represent the method of [10] and [11] respectively.

Success rate Success rate

Captcha Scheme Captcha Example B-11 [10] Ours Captcha Scheme Captcha Example Gao’s Ours

Megaupload 93% 100% Baidu (2016) 46.6% 97.5%

Blizzard 70% 100% QQ 56% 94%

Authorize 66% 100% Taobao 23.4% 90.7%

Captcha.net 73% 99.6% Sina 9.4% 90%

NIH 72% 99% reCAPTCHA (2011) 77.2% 87.4%

Reddit 42% 98% eBay 58.8% 86.6%

Digg 20% 95% Amazon 25.8% 79%

eBay 43% 86.6% Wikipedia 23.8% 78%

Slashdot 35% 86.4% Microsoft 16.2% 72.1%

Wikipedia 25% 78% Yahoo! (2016) 5.2% 63%

Success rate Success rate

Captcha Scheme Captcha Example B-14 [11] Ours Captcha Scheme Captcha Example George’s Ours

reCAPTCHA (2013) 22.3% 90% PayPal 57.1% 92.4%

Baidu (2013) 55.2% 89% reCAPTCHA (2011) 66.6% 87.4%

reCAPTCHA (2011) 22.7% 87.4% Yahoo! (2016) 57.4% 63%

eBay 51.4% 86.6%

Baidu (2011) 38.7% 83.1%

Wikipedia 28.3% 78%

Yahoo! (2014) 5.3% 75.1%

CNN 51.1% 51.6%

schemes, our approach boosts the success rate by 40%. It can successfully solve all the captchas of

Blizzard, Megaupload and Authorize used in [

]. Our approach achieves a success rate of 87.4%

and 90% for

reCAPTCHA

2011 and 2013 respectively. This scheme was previously deemed to be

strong where the human accuracy is 87.4% [

]. That is to say, our solver matches the capability

of humans in solving

reCAPTCHA

. To achieve a comparable accuracy for

reCAPTCHA

, a

CNN

-based

captcha solver [

] would require 2.3 million unique real captcha images [

], but our approach

needs only 500. We note that unlike all the competitive approaches which require manually tuning

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:17

Wikipedia Google

20%

40%

60%

80%

100%

Prediction successrate

Different CaptchaSchemes

8characters 9characters 10characters mean

Fig. 10. The success rate of our prediction model when targeting captchas with variable number of characters.

a character segmentation method, we forgo this process. Thus, our approach requires less expert

involvement but gives better performance.

6.3 Targeting Schemes with A Variable Number of Characters

One potential criticism of our approach described so far is that it only targets captcha schemes

with a xed number of characters. However, our approach can be extended to target schemes with

a variable number of characters. One way for doing that is to have a model to predict how many

characters a preprocessed image may contain, and then use a captcha solver that is specically

built for that number of characters.

To test this strategy, we use a

CNN

to build a character number predictor. Our model consists of

four convolutional layers, four pooling layers and a fully connected layer, and a max-pooling layer

follows each of the convolutional layers. The lter size in each convolutional layer is 5

5, and

other parameters are the same as our base captcha solver.

We evaluated our predictor using

Google

and

Wikipedia

captchas, both use a variable number

(8, 9 or 10) of characters. For each scheme, we use 100

000 synthetic captchas (around 33

333

captchas per character length) for training the predictor and 3

000 (1

000 per character length)

real captchas for testing. Figure 10 shows the accuracy for predicting the number of characters in

a captcha image. Our predictor gives an accuracy of 90.9% and 80.8% for

Wikipedia

and

Google

schemes respectively.

When combining the predictor with our ne-tuned solver (but not using the context-aware

heuristic described in Section 6.1.3), we see a slight drop in the accuracy. This is expected as our

character-number predictor is not perfect. The combination gives a success rate of 70.9% and 2%

for

Wikipedia

and

Google

schemes respectively. The resulting success rates are still higher than

the 1% threshold for which a captcha scheme is seen to be ineective [10].

6.4 Preprocessing Security Features

Recall that the second step of our attacking pipeline is to remove the security features and stan-

dardize the font style of an input captcha. In this experiment, we compare our preprocessing model

against prior preprocessing methods on removing noisy backgrounds [

], and standardizing

font styles [12, 17] and character gaps [18].

Removing security features.

The classical methods used in prior attacks for preprocessing

captchas is ltering [

]. The idea is to apply a x-sized window, or lter kernel, throughout

the image to remove the occluding lines and noise while keeping edges of the characters. As can

1:18 Ye, G. et al

(a)Input Baidu captchas

(c)Applying our pre-processing model

(b)Applying a 2 1, 2 2, 3 1 filter kernel respectively

×× ×

Fig. 11. For the input images (a), a filter-based method fails to remove security features (b) while our approach

can (c).

(b) Results given by Gao's approach

(a) Example hollow captchas from Sina and Microsoft schemes

Fig. 12. Comparing font style standardization between a state-of-the-art hollow captcha solver [

] and our

preprocessing model. Our preprocessing model is able to fill the hollow parts more eectively.

be seen from Figure 11, nding the right lter kernel size is dicult. This is because the lter

either fails to eliminate the background and occluding lines or it overdoes it by eroding edges of

the characters (Figure 11b). While ltering was eective for prior text-based captchas, the latest

captcha schemes have introduced more sophisticated security features which make it no longer

feasible. In contrast to ltering, our preprocessing model can successfully eliminate nearly all the

background noise and occluding lines from the input image, leading to a much cleaner captcha

image while keeping the character edges, as depicted in Figure 11c.

Filling hollow characters.

Figure 12 compares our preprocessing model against a state-of-the-art

hollow captcha solver [

]. The task in this experiment is to ll the hollow parts of the characters.

Here, we apply both schemes to the testing hollow captchas from

Sina

and

Microsoft

schemes.

Figure 12a gives some of the examples from these two schemes, while Figures 12b and 12c present

the corresponding results given by the hollow lling method in [

] and our approach respectively.

As can be seen from the diagrams, our preprocessing model is able to ll most of the hollow strokes,

but the state-of-the-art method leaves some hollow strokes unlled. Therefore, our approach is

ACM Transactions on Privacy and Security, Vol. 1, No. 1, Article 1. Publication date: January 2020.

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:19

(a)Wikipedia (b)Microsoft

(c)Sina (d)Baidu

Fig. 13. Character segmentation produced by our preprocessing model. For each scheme, the le image is

the input captcha, and the right image is the output of our preprocessing model.

78.5% 70.5%

31% 25%

23.8%

9.4% 16.2% 5%

78.5% 70.5%

31% 25%

23.8%

9.4% 16.2% 5%

Wikipedia Sina Microsoft Baidu

20%

40%

60%

80%

100%

The success rate

chaschemes

Our approach Gao'sapproach

Fig. 14. Using our character segmentation approach can help to improve the success rate of prior work [

more eective in standardizing the font style. We also note that unlike prior attacks which require

manually designing and tuning an individual method to process each security feature, our approach

automatically learns how to process all features at one go. Therefore, our approach requires less

eort for implementing a holistic preprocessing model.

Standardizing character gaps.

Prior work has reported that the robustness of a text captcha

scheme largely dependents on the diculty of character segmentation rather than character

recognition [

]. Many modern text captchas are designed to make it harder for a computer

program to segment the characters. The examples given in Figure 13 show that our preprocessing

model is eectively in standardizing the gap between characters. To evaluate the eectiveness of

our preprocessing model for character segmentation, we use the same network structure to train a

model solely for character segmentation. We then use the preprocessing model to replace the native

character segmentation model used in [

], but keep the remaining parts unchanged. Figure 14

shows that our preprocess along can help to greatly improve the success rate of a previous solver.

6.5 Synthesizer Training Termination Criteria

Our captcha synthesizer is trained under the

GAN

framework, and training terminates when the

discriminator fails to classify a certain ratio of synthetic captchas (Section 4.2). Figure 15 reports how

the termination criterion aects the quality of the synthetic captchas. The x-axis shows the ratio

(from 0.8 to 0.97) of synthetic captchas that are misclassied as a real captcha by the discriminator

when training terminates. The y-axis shows the success rate achieved by the ne-tuned solver for

ve current captcha schemes, where the base solver is trained on the resulting synthetic captchas

using dierent termination criteria but the ne-tuned solver is trained on the same set of real

captchas.

1:20 Ye, G. et al

0.80 0.85 0.90 0.95 0.97

20%

40%

60%

80%

100%

Fine-tuned solver successrate

Ratio of mis-classifiedsynthetic captchas

Microsoft Wikipedia eBay Baidu JD

Fig. 15. How the synthesizer training termination criterion aects the solver performance. Training terminates

when the discriminator fails to classify a certain ratio of synthetic captchas.

In general, the more synthetic captchas that the discriminator fails on, the higher the quality

the generated synthetic captchas will be, which in turns leads to a more eective captcha solver.

However, the increase in the success rate reaches a plateau at 0.95. Further increasing the similarity

of the synthetic captchas to real ones does not improve the success rate due to overtting. Based

on this observation, we choose to terminate synthesizer training when the

GAN

discriminator can

successfully distinguish less than 5% (i.e., fail on 95% or more) of the synthetic captchas. We found

that this threshold works well for all captcha schemes tested in this work.

6.6 Transfer Learning

Recall that we only use 500 real captchas to rene the base solver by employing transfer learning

(Section 4.3). Our strategy for transfer learning is to only retrain some of the latter neural network

layers of the base solver (see Figure 7). In this experiment, we investigate how the choice of transfer

learning layers aects the performance of the ne-tuned solver. To that end, we apply transfer

learning to dierent levels of the base solver, by changing the starting point of transfer learning

from the 2nd convolutional layer (CL) all the way down to the rst fully-connected layer (FC).

6.6.1 Identify the best beginning layers. We apply transfer learning to dierent levels of the base

solver. This is achieved by changing the starting point of transfer learning from the 2nd convo-

lutional layer (

) all the way down to the rst fully-connected layer (

). To determine the best

starting layer for transfer learning, we apply cross-validation to the real captcha training dataset.

Specially, we divide the 500 real captchas into two parts, the rst part of 450 captchas is used to

rene the base solver, and the rest 50 captchas are used to validate the rened solver. We vary the

beginning layer for transfer learning and then test the rened base solver on the validation set to

nd out which beginning layer leads to the best performance. Figure 16 reports performance of

the resulting ne-tuned solvers trained under dierent transfer learning congurations for the 11

current captcha schemes given in Table 1. Overall, applying transfer learning to the second or third

onward leads to the best performance. Furthermore, this rening process only takes several

minutes as it uses just 500 captchas.

6.6.2 Finding suitable training data size. In this experiment, we evaluate how the number of real

captchas used in transfer learning aects the success rate of the ne-tuned solver. Figure 17 shows

the success rates of the fine-tuned solver when using different numbers of real captchas in transfer

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:21

2ndCL 3rdCL 4th CL 5th CL 1st FC

20%

40%

60%

80%

100%

Fine-tunedsolver successrate

Beginninglayer for transfer learning

Google Wikipedia eBay Microsoft Baidu

Alipay JD Sina Sohu Weibo Qihu360

Fig. 16. How the beginning layer for transfer learning aects the resulting performance of the fine-tuned

solver.

200 400 500 800 1000

20%

40%

60%

80%

100%

Fine-tuned solver successrate

Number of real captchas

Google Wikipedia eBay Microsoft Baidu

Alipay JD Sina Sohu Weibo Qihu360

Fig. 17. The achieved success rates when the fine-tuned solver is trained using dierent number of real

captchas.

learning. When the number of training examples is 500, our approach reaches a high success

rate. For most captcha schemes, the success rate drops signicantly when the number of training

examples less than 400. Nevertheless, our approach can achieve a high success rate when the

number of training examples is 500. Such a number allows an attacker to collect from the target

website easily.

6.7 Captcha Usability Study

Our evaluation also includes a user study to quantify the impact of security features on user

experience (i.e., captcha usability) and the success rate of our solver. Specically, we have conducted

an online survey by recruiting 20 participants to ll in an anonymous questionnaire. Our participants

are at the age group of under 30s and are familiar with text captchas. In the questionnaire, we present

100 synthetic captchas with dierent security strength. We divide the synthetic captchas into six

categories based on the number of characters and the security parameters used for generating

the captcha. In the survey, we give each participant one minute to label a captcha and ask each

participant to rate the usability of ve captchas from each category on a 5-point Likert-scale, where

1 = very poor and 5 = excellent usability.

1:22 Ye, G. et al

Table 6. Example captchas used in our user study, the success rates of humans and our approach, and the

usability rating.

Security Features Success Rate

No. Example Anti-segmentation Anti-recognition Humans Ours Usability

1English letters,

arabic numerals

Rotation,

varied font sizes 95.25% 100% 4

2English letters Rotation,

varied font sizes 90.25% 88% 2.75

3English letters,

complex background Rotation, distortion 91% 96% 2.8

English letters,

overlapping characters,

complex background

Varied font sizes,

rotation, distortion 89.25% 86% 2.7

5English letters Varied font sizes,

ratation, distortion 79.75% 77% 2.8

6English letters,

overlapping characters

Varied font sizes,

rotation, distortion,

waving

68.75% 40% 2.1

Table 6 gives the criteria used to determine the captcha diculties and an example captcha

for each category. For each category, we also give the averaged success rates achieved by our

participants and our solver, as well as the averaged rating given by the participants.

We see that using more security features increases the diculty for a computer program to solve

a captcha challenge, but it also decreases user experience. This can be illustrated that the averaged

human success rate for the captchas in category 6 of Table 6 is below 70%, meaning that nearly

one-third of the time a user will enter a wrong answer for captchas in this category. Therefore,

captchas in this category were given the lowest usability score of 2.1 is not surprising. We also

observe that various security features have a dierent impact on the eectiveness of our captcha

solver. For example, our solver can better handle captchas with noisy backgrounds in categories

3 and 4 than that with distorted characters in categories 5 and 6. As a result, although a captcha

image with a noisy background may have equally poor usability as another one with distorted or

overlapping characters, the two captcha images could have dierent degrees of robustness under

our attack. Moreover, as we expect, the success rate of a computer solver drops as the diculty of

the captcha increases.

We also nd that noisy backgrounds have a negative impact on the user experience because our

participants gave an averaged usability score of less than 3 for captchas in categories 3 and 4 of

Table 6. On the other hand, background confusion has little contribution to the security strength

of captchas under our attack. This can be conrmed f rom t he s imilar, or e ven better-solving

performance given by our solver when compared to human participants for captchas in the two

categories. This nding suggests that complex background confusion perhaps should be abandoned

in future text captcha schemes. Overall, this user study shows that a

GAN

-based captcha solver

can achieve comparable performance for solving text captchas when compared to humans, but

balancing the security and usability of a text captcha scheme is not trivial.

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:23

Table 7. Success rate of comparing to RCN [20] for classifying the MNIST dataset.

# of per digit 20 40 60 80 100

RCN 96.5% 97.3% 97.6% 97.8% 98%

Our ne-tuned solver 96.2% 98.3% 98.9% 99.2% 99.8%

6.8 Generalization Ability

Given the scope of this work, we cannot test our approach on all current captcha schemes. To

evaluate our approach’s generalization ability for character recognition, we apply it to the

MNIST

dataset. This dataset contains a large number of handwritten digits of dierent forms.

We follow the same methodology as we have used throughout the evaluation to build a

MNIST

solver, i.e., by rst building a synthesizer, then a base solver and a ne-tuned solver. We train the

synthesizer using up to 500 real

MNIST

images. Next, we build the base solve using 100,000 synthetic

images before ne-tuning the base solver using the same set of real

MNIST

images. We compare

our solver with the recently proposed

RCN

[

], which was shown to be eective by using a small

number of training samples. We test both approaches on images that are not seen in the training

phase.

As can be seen from Table 7, our approach gives a marginally lower accuracy when using 20 real

MNIST

images per digit, but it outperforms the

RCN

when using 40 or more real images per digit. In

other words, our approach is eective on another image classication dataset, indicating that our

approach has a good generalization ability.

Extend to other captcha schemes.

We believe our approach is generally applicable and can be

naturally extended for video and image captchas by adapting the network architecture to recognize

objects from the inputs; and favorably, the process of synthetic data generation, model training

and tuning still is unchanged. This exibility allows one to attack various types of captchas, not

just text-based ones. For example, to target NuCAPTCHA [

], a motion-based captcha scheme,

we need to replace our

CNN

solver with a model similar to the Mask R-CNN [

]. The idea is to

rst segment the video frames into images and then recognize characters from individual images.

After replacing the solver structure, we also need to extend our

GAN

-based captcha synthesizer

to generate a sequence of synthetic images (as recognition is performed at the image level). For

motion-based captchas, the key is to maintain the temporal relationships among images, for which

a temporal CNN can be useful [40].

7 POTENTIAL COUNTERMEASURES

7.1 Security Enhance through Adversarial Example Generation

Recent works have shown that adversarial examples generated by inserting some perturbations

onto a target image can confuse a machine-learned image classier [

]. Recent work has exploited

this observation to improve the security of captcha images [

]. However, the perturbations

or noise generated by prior methods [

] are often noticeable by a human eye and as a result,

our preprocessing model is eective in removing these perturbations. Hence, a better approach is

to make the perturbations imperceptible, so that it has less impact on the user experience while

increases the diculty for training a successful preprocessing model.

However, one of the challenges for generating imperceptible perturbations is that the generation

scheme is tightly coupled to both the captcha image and the captcha solver. This raises a practical

issue because the captcha designer often does not have a copy of the solver implementation. To

demonstrate this point, we use synthetic Baidu captchas to train ve captcha solvers (in addition

1:24 Ye, G. et al

Table 8. Examples of original captcha (No. 1) and the corresponding adversarial captchas with dierent

perturbations (No. 2 - 6) .

No. Captcha Ours MaxoutNet NetInNet GoogleNet VGG ResNet18

1 LMGW(√) LMGW(√) LMGW(√) LMGW(√) LMGW(√) LMGW(√)

2NMGW(×) VMGW(×) IMGW(×) IMGW(×) LMGW(√) NMGW(×)

3 LWWW(×) LMMW(×) LWNW(×) LWWN(×) VWNW(×) LMGW(√)

4LMGW(√) LMSW(×) LWSW(×) LWSW(×) LMGW(√) LWGW(×)

5 LMGW(√) LNGW(×) VWGW(×) LNNW(×) LWWW(×) LWGW(×)

6LWMW(×) LWGW(×) LWWW(×) LWGW(×) LWWW(×) LWMW(×)

to our LeNet-5-based model), based on ve established

CNN

models: MaxoutNet [

], NetInNet [

GoogleNet [

], VGG [

], ResNet18 [

]. We then apply each trained solver to a captcha image with

dierent imperceptible perturbations. Table 8 shows the original captcha image and its adversarial

versions, and the prediction given by dierent solvers. To aid readability, we mark the perturbations,

which are imperceptible to the participants in our user study, using a black box. As can be seen

from the table, a perturbation scheme tuned for a particular network cannot invalid others.

One approach for improving the generalization ability of the perturbation scheme is to nd ways

to generate perturbations that can invalidate the commonly used image classication models. To

this end, we implement a prototyping adversarial generator to target the CNN-based models listed

in Table 8.

7.1.1 Countermeasure prototype. Our prototype has three components: a feature location module,

a perturbation generation model and an adversarial solver, described as follows.

The feature location module nds which a reas of the captcha image are most important for

successful recognition of a given captcha image across network architectures. To do so, we rst

apply sliding windows to divide a captcha image into a number of areas from the direction of top to

bottom, left to right. We then add random noise into each area to observe whether the

CNN

model

can misclassify the captcha image and the areas that can confuse the

CNN

model will be selected

as the critical locations. Once these critical areas are located, the perturbation generation module

(built upon [

]) will generate the adversarial captcha image by inserting the perturbations into

each of these areas. Note that the perturbation generator may produce dierent perturbations for

dierent areas. We run the perturbed images through a set of pre-trained captcha solvers (built

upon dierent network architectures) to check if the perturbed images can confuse all the solvers.

If not, we ask the perturbation generator to create a new set of perturbations until this success

criterion is met or the generation time has exceeded a threshold (set to three seconds in our case).

To enhance the transferability of the synthetic adversarial captchas, we are inspired by Xie et

al. [

] and apply random multiscale transformations to each critical areas at each iteration. In the

latter case, we choose the image that can confuse the largest number of targeting solvers.

7.1.2 Evaluation of countermeasure. We evaluate our captcha generator using the 1

000 real

Sohu

captchas that were used in the evaluation reported in Section 6. We choose this scheme because

our solver is highly effective in solving it by giving the highest success rate. The results for using

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:25

Our solver MaxoutNet NetInNet GoogleNet VGG ResNet18

20%

40%

60%

80%

100%

The success rate

Different captcha solvers

original adversarial

Fig. 18. The success rates when targeting the original

Sohu

captchas and the adversarial versions generated

by our scheme.

and without using our perturbation scheme are shown in Figure 18. Our perturbation scheme

signicantly reduces the success rate for solving the

Sohu

scheme when using a

CNN

-based solver.

7.1.3 Limitations of countermeasure. We acknowledge that our countermeasure does not eliminate

the vulnerability of text captchas under deep-learning-based attacks, as an attacker can still use

a network that is dierent from the ones targeting by our perturbation generator. However, we

nd that changing the number of layers or neurons, or the size of the convolutional layers of a

solver has little impact on our perturbation scheme. We also nd that using a deeper network

does not signicantly improve the success rate for solving perturbed captcha images because most

of the captcha images are of small sizes and hence a deeper network does not oer additional

benets. Nonetheless, we want to stress that while our countermeasure can help to improve the

security strength of current text captcha schemes, they will become inevitably less secure when

more advanced deep neural network architectures are proposed. Therefore, the community should

revisit the use of text captchas.

7.2 Other Alternative Countermeasures

Some alternatives have been proposed to replace text captchas. These include video-based

captachas like NuCAPTCHA [

] and game-based CAPTCHAs [

]. The former was shown to be

vulnerable [

]. The later seemly oers some promises but the recently breakthrough of deep

reinforcement learning in game playing may pose a threat to such schemes [

]. To have a robust

countermeasure, one probably need to combine multiple mechanisms similar to the multi-factor

authentication protocol [

]. Nonetheless, how to balance the security strength and usability of

a scheme is still an outstanding problem.

8 RELATED WORK

The work presented by Mori et al. [

] was among the rst text captcha solvers. Their approach

employs a set of analytical models and heuristics to attack Gimpy and EZ-Gimpy, two early simple

text-based captcha schemes. Since then, a large body of work arose for exploring ways to improve

the security of text captchas, building upon attacks on existing captcha schemes. Due to these

successful attacks, text captchas are going through an iterative development process, which are

still preferred by many users, primarily for the familiarity and a sense of security and control [

Segmentation-based aacks.

This type of attacks rst segments characters of a captcha image

and then identies each segmented character using machine-learning algorithms. Yan et al. show

1:26 Ye, G. et al

a simple character segmentation method [

], which counts the number of pixels of individual

characters, can break most of the captchas from

Captchaservices.org

. Later, they show an

improved segmentation method can be used to attack the early captcha schemes used by

Yahoo

Microsoft

and

Google

[

]. Unlike our approach, all the aforementioned attacks are tightly coupled

to the captcha scheme and hard to generalize. This means to target a new scheme, they would

require human involvement to revise the existing heuristics and possibly to design new heuristics.

Deep-learning-based aacks.

Decaptcha [

] employs machine-learning-based classiers to de-

velop a generic attack for text-based captchas. It can break 13 captcha schemes but achieves zero

success on more dicult schemes including

reCAPTCHA

and

Google

scheme. By contrast, our ap-

proach not only gives a higher accuracy on the schemes where Decpatcha succeeds but also delivers

a success rate of 87.4% on

reCAPTCHA

for which Decaptcha has a success rate of zero (see Table 5).

Recently, George et al. presents Recursive Cortical Network (

RCN

) for image recognition [

]. The

RCN

is eective in recognizing individual characters but are less eective for solving text-based

captchas when compared to our approach. In particular, on the

PayPal

dataset, our approach boosts

the success rate from 57.1% to 92.4%. Stark et al. [

] show that active learning can be used to

reduce the number of captchas required to learn a solver. However, this approach requires having

access to a captcha generator of the target scheme, which is often not available to the adversary.

On the other hand, active learning is complementary to our approach as it allows the learning

engine to use a fewer number of training samples to speed up the training process.

Other aacks.

The work presented by Gao et al. targets captchas of hollow characters [

]. Their

approach rst lls the hollow character strokes, and then searches for the possible combinations

of adjacent character strokes to recognize individual characters. While are eective on hollow

characters, this approach is ineective on captcha images with overlapping and distorted characters.

Their more recent work [18] uses the Log-Gabor lter to rst extract character components from

the captcha image; it then uses the k-Nearest Neighbor algorithm to recognize individual characters

using the extracted information. Due to the limitation of the Log-Gabor lter, their method is

ineective for captcha images with noisy backgrounds, e.g. Baidu captcha shown in Figure 1b.

Alternative captcha schemes.

It is worth mentioning that there are also other captcha schemes

built around images [

], audio data [

] or recently adversarial captchas [

Many of these were proposed to replace text captchas. However, these alternative schemes are less

popular than text captchas and were shown to be vulnerable too [

]. In particular,

a signicant weakness of an image-based scheme is that the number of images used by the scheme

is typically limited. As a result, an adversary may exploit side channels to obtain and label a large

portion of the images used by a scheme [32].

Adversarial machine learning.

As a nal remark, we would like to point out that our work builds

upon the foundations of adversarial machine learning [

]. This technique is shown to be useful

in constructing adversarial applications to bypass malware detection [

], escape from spam

mail ltering [

], or confuse machine learning classiers [

]. However, no work to date has

employed the technique to construct a generic solver for text captchas, and our work is the rst to

do so.

9 CONCLUSION

This article has presented the rst

GAN

-based generic solver for text captcha. Our solver is built

by rst learning a captcha synthesizer to automatically generate synthetic training examples to

build a base solver, and then rening the base solver using transfer learning. This feature allows

our approach relies on fewer real captchas to construct the solver, and can target a wide range of

schemes. As a result, our approach needs less human involvement compared to prior methods.

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:27

Our approach was evaluated on 33 text captcha schemes, including 11 schemes that were being

used by 32 of the top-50 popular websites at the time this study was conducted. Experimental

results show that our approach outperforms four start-of-the arts by successfully solving more

captchas. We show that our approach is robust and generally applicable, which can break many

advanced security features used by modern text captchas. Our results suggest that these advanced

features only make it dicult for a legitimate user but would fail to stop automated programs. As a

countermeasure, we show that by inserting some imperceptible perturbations on a captcha image,

one can enhance the security strength of text captchas under deep-learning-based attacks.

REFERENCES

[1]

Abdalnaser Algwil, Dan C Ciresan, Beibei Liu, and Je Yan. 2016. A security analysis of automated chinese turing

tests. (2016), 520–532.

[2]

Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In Interna-

tional conference on machine learning. 214–223.

[3]

Elias Athanasopoulos and Spiros Antonatos. 2006. Enhanced CAPTCHAs: using animation to tell humans and

computers apart. In IFIP International Conference on Communications and Multimedia Security. 97–108.

[4]

Charles Audet and J. E. Dennis Jr. 2006. Mesh Adaptive Direct Search Algorithms for Constrained Optimization. Siam

Journal on Optimization 17, 1 (2006), 188–217.

[5]

Marco Barreno, Blaine Nelson, Russell Sears, Anthony D. Joseph, and J. D. Tygar. 2006. Can machine learning be

secure?. In ACM Symposium on Information, Computer and Communications Security. 16–25.

[6]

Jerey P. Bigham and Anna C. Cavender. 2009. Evaluating existing audio CAPTCHAs and an interface optimized for

non-visual use. In Sigchi Conference on Human Factors in Computing Systems. 1829–1838.

[7]

Elie Bursztein. 2012. How we Broke the NuCaptcha Video Scheme and What we Proposed to Fix it. https://elie.net/

blog/security/how-we-broke-the-nucaptcha-video-scheme-and-what-we-propose-to-x-it.

[8]

Elie Bursztein, Jonathan Aigrain, Angelika Moscicki, and John C Mitchell. 2014. The end is nigh: generic solving of

text-based CAPTCHAs. In USENIX WOOT.

[9]

Elie Bursztein and Steven Bethard. 2009. Decaptcha: breaking 75% of eBay audio CAPTCHAs. In Usenix Conference on

Oensive Technologies. 8–8.

[10]

Elie Bursztein, Matthieu Martin, and John Mitchell. 2011. Text-based CAPTCHA strengths and weaknesses. In CCS.

125–138.

[11]

Elie Bursztein, Angelique Moscicki, Celine Fabry, Steven Bethard, John C. Mitchell, and Jurafsky Dan. 2014. Easy does

it: more usable CAPTCHAs. In ACM Conference on Human Factors in Computing Systems. 2637–2646.

[12]

Kumar Chellapilla, Kevin Larson, Patrice Y. Simard, and Mary Czerwinski. 2005. Computers beat Humans at Single

Character Recognition in Reading based Human Interaction Proofs (HIPs). In Conference on Email & Anti-Spam.

[13]

Monica Chew and J Doug Tygar. 2004. Image recognition captchas. In International Conference on Information Security.

Springer, 268–279.

[14]

Jeremy Elson, John R. Douceur, Jon Howell, and Jared Saul. 2007. Asirra:a CAPTCHA that exploits interest-aligned

manual image categorization. In ACM Conference on Computer and Communications Security, CCS 2007, Alexandria,

Virginia, Usa, October. 366–374.

[15]

Bent Fuglede and Flemming Topsoe. 2004. Jensen-Shannon divergence and Hilbert space embedding. In International

Symposium onInformation Theory, 2004. ISIT 2004. Proceedings. IEEE, 31.

[16]

Haichang Gao, Mengyun Tang, Yi Liu, Ping Zhang, and Xiyang Liu. 2017. Research on the Security of Microsoftąŕs

Two-Layer Captcha. IEEE Transactions on Information Forensics & Security 12, 7 (2017), 1671–1685.

[17]

Haichang Gao, Wang Wei, Xuqin Wang, Xiyang Liu, and Je Yan. 2013. The robustness of hollow CAPTCHAs. In ACM

Sigsac Conference on Computer & Communications Security. 1075–1086.

[18]

Haichang Gao, Je Yan, Fang Cao, Zhengya Zhang, Lei Lei, Mengyun Tang, Ping Zhang, Xin Zhou, Xuqin Wang, and

Jiawei Li. 2016. A Simple Generic Attack on Text Captchas. In NDSS.

[19]

Song Gao. 2014. An evolutionary study of dynamic cognitive game CAPTCHAs: Automated attacks and defenses.

Dissertations & Theses - Gradworks (2014).

[20]

Dileep George, Wolfgang Lehrach, Ken Kansky, Miguel Lázaro-Gredilla, Christopher Laan, Bhaskara Marthi, Xinghua

Lou, Zhaoshi Meng, Yi Liu, Huayan Wang, et al

2017. A generative vision model that trains with high data eciency

and breaks text-based CAPTCHAs. Science 358, 6368 (2017), eaag2612.

[21]

C Gold, A Holub, and P Sollich. 2005. Bayesian approach to feature selection and parameter tuning for support vector

machine classiers. Neural Networks 18, 5 (2005), 693–701.

1:28 Ye, G. et al

[22]

Philippe Golle. 2008. Machine learning attacks against the Asirra CAPTCHA. computer and communications security

2008 (2008), 535–542.

[23]

Ian J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. 2014. Multi-digit number recogni-

tion from street view imagery using deep convolutional neural networks. In International Conference on Learning

Representations (ICLR).

[24]

Ian J. Goodfellow, Jean Pougetabadie, Mehdi Mirza, Bing Xu, David Wardefarley, Sherjil Ozair, Aaron Courville, and

Yoshua Bengio. 2014. Generative Adversarial Networks. Advances in Neural Information Processing Systems 3 (2014),

2672–2680.

[25]

Ian J Goodfellow, Jonathon Shlens, Christian Szegedy, Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015.

Explaining and harnessing adversarial examples. In ICML. 1–10.

[26]

Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. 2013. Maxout networks.

arXiv preprint arXiv:1302.4389 (2013).

[27]

Rich Gossweiler, Maryam Kamvar, and Shumeet Baluja. 2009. What’s up CAPTCHA?:a CAPTCHA based on image

orientation. In International Conference on World Wide Web, WWW 2009, Madrid, Spain, April. 841–850.

[28]

Mori Greg and Jitendra Malik. 2003. Recognizing objects in adversarial cultter: Breaking a visual CAPTCHA. In IEEE

Computer Society Conferene on Computer Vision and Pattern Recognition.

[29]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In IEEE International Conference on

Computer Vision (ICCV). 2980–2988.

[30]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[31]

Robert Hecht-Nielsen. 1989. Theory of the backpropagation neural network. Harcourt Brace & Co. 593–605 vol.1 pages.

[32]

Carlos Javier Hernandezcastro, Arturo Ribagorda, and Yago Saez. 2009. Side-channel attack on labeling CAPTCHAs.

Computer Science (2009).

[33]

Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin I. P Rubinstein, and J. D Tygar. 2011. Adversarial machine

learning. IEEE Internet Computing 15, 5 (2011), 4–6.

[34]

Phillip Isola. 2017. Pix2Pix: Image-to-Image Translation with COnditional Adversarial Networks. https://github.com/

phillipi/pix2pix.

[35]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-Image Translation with Conditional

Adversarial Networks. arxiv (2016).

[36] Wilkins J. 2009. Strong captcha guidelines v1. 2. (2009).

[37]

Zhiping Jiang, Jizhong Zhao, Xiang-Yang Li, Jinsong Han, and Wei Xi. 2013. Rejecting the attack: Source authentication

for wi- management frames using csi information. In IEEE INFOCOM. 2544–2552.

[38] Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. Computer Science (2014).

[39]

Kat Krol, Simon Parkin, and M Angela Sasse. 2016. Better the Devil You Know: A User Study of Two CAPTCHAs and

a Possible Replacement Technology. In NDSS Workshop on Usable Security.

[40]

Colin Lea, Rene Vidal, Austin Reiter, and Gregory D Hager. 2016. Temporal convolutional networks: A unied approach

to action segmentation. In European Conference on Computer Vision. 47–54.

[41]

Y. Lecun, L. Bottou, Y. Bengio, and P. Haner. 1998. Gradient-based learning applied to document recognition. Proc.

IEEE 86, 11 (1998), 2278–2324.

[42]

Jiwei Li, Will Monroe, Tianlin Shi, Sĺębastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial Learning for

Neural Dialogue Generation. (2017).

[43]

Bin Liang, Hongcheng Li, Miaoqiang Su, Xirong Li, Wenchang Shi, and XiaoFeng Wang. 2018. Detecting Adversarial

Image Examples in Deep Neural Networks with Adaptive Noise Reduction. IEEE Transactions on Dependable and Secure

Computing (2018).

[44] Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).

[45]

Elaine K. McEwan. 2008. Root Words, Roots and Axes. http://www.readingrockets.org/article/root-words-roots-

and-axes.

[46]

Hendrik Meutzner and Dorothea Kolossa. 2014. Reducing the Cost of Breaking Audio CAPTCHAs by Active and

Semi-supervised Learning. In International Conference on Machine Learning and Applications. 67–73.

[47]

Takeru Miyato, Shinichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. 2015. Distributional Smoothing by

Virtual Adversarial Examples. arXiv (2015).

[48]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin

Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv (2013).

[49]

Manar Mohamed, Niharika Sachdeva, Michael Georgescu, Song Gao, Nitesh Saxena, Chengcui Zhang, Ponnurangam Ku-

maraguru, Paul C. Van Oorschot, and Wei Bang Chen. 2014. A three-way investigation of a game-CAPTCHA:automated

attacks, relay attacks and usability. In ACM Symposium on Information, Computer and Communications Security. 195–

206.

Using Generative Adversarial Networks to Break and Protect Text Captchas 1:29

[50]

Manar Mohameda, Song Gaob, Niharika Sachdevac, Nitesh Saxena, Chengcui Zhangd, Ponnurangam Kumaraguruc,

and Paul C. Van Oorschote. 2017. On the security and usability of dynamic cognitive game CAPTCHAs. Journal of

Computer Security (2017), 1–26.

[51]

Margarita Osadchy, Julio Hernandez-Castro, Stuart Gibson, Orr Dunkelman, and Daniel Pĺęrez-Cabo. 2017. No

Bot Expects the DeepCAPTCHA! Introducing Immutable Adversarial Examples, with Applications to CAPTCHA

Generation. IEEE Transactions on Information Forensics & Security PP, 99 (2017), 1–1.

[52]

Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning. IEEE Transactions on Knowledge & Data

Engineering 22, 10 (2010), 1345–1359.

[53]

Ishai Rosenberg, Asaf Shabtai, Lior Rokach, and Yuval Elovici. 2017. Generic Black-Box End-to-End Attack against

RNNs and Other API Calls Based Malware Classiers. arXiv (2017).

[54] Neil J. Rubenking. 2013. Are You a Human. https://www.areyouahuman.com.

[55]

Andy Schlaikjer. 2010. A Dual-Use Speech CAPTCHA: Aiding Visually Impaired Web Users while Providing Tran-

scriptions of Audio Streams. LTI (2010).

[56] NuData Security. 2010. NuCaptcha. www.nucaptcha.com.

[57]

Muhammad Shahzad, Alex X Liu, and Arjmand Samuel. 2017. Behavior based human authentication on touch screen

devices using gestures and signatures. IEEE Transactions on Mobile Computing 16, 10 (2017), 2726–2741.

[58] Chenghui Shi, Xiaogang Xu, Shouling Ji, Kai Bu, Jianhai Chen, Raheem A. Beyah, and Ting Wang. 2019. Adversarial

CAPTCHAs. CoRR abs/1901.01107 (2019). arXiv:1901.01107 http://arxiv.org/abs/1901.01107

[59]

Ashish Shrivastava, Tomas Pster, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. 2017. Learning

from Simulated and Unsupervised Images through Adversarial Training. In The IEEE Conference on Computer Vision

and Pattern Recognition (CVPR).

[60]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition.

Computer Science (2014).

[61]

Suphannee Sivakorn, Iasonas Polakis, and Angelos D. Keromytis. 2016. I am Robot: (Deep) Learning to Break Semantic

Image CAPTCHAs. In IEEE European Symposium on Security and Privacy. 388–403.

[62]

Fabian Stark, Caner Hazirbas, Rudoplh Triebel, and Daniel Cremers. 2015. CAPTCHA Recognition with Active Deep

Learning. In German Conference on Pattern Recognition Workshop.

[63] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent

Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on

computer vision and pattern recognition. 1–9.

[64]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioe, Jonathon Shlens, and Zbigniew Wojna. 2015. Rethinking the

Inception Architecture for Computer Vision. Computer Science (2015), 2818–2826.

[65]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.

2013. Intriguing properties of neural networks. Computer Science (2013).

[66]

Jennifer Tam, Jirĺł Simsa, Sean Hyde, and Luis Von Ahn. 2008. Breaking Audio CAPTCHAs. In Conference on Neural

Information Processing Systems, Vancouver, British Columbia, Canada, December. 1625–1632.

[67]

Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford. 2003. CAPTCHA: Using Hard AI Problems for

Security. Springer Berlin Heidelberg. 294–311 pages.

[68]

Luis Von Ahn, Manuel Blum, and John Langford. 2004. Telling humans and computers apart automatically. Communi-

cations of the Acm 47, 2 (2004), 56–60.

[69]

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L. Yuille. 2019. Improving

Transferability of Adversarial Examples With Input Diversity. In The IEEE Conference on Computer Vision and Pattern

Recognition (CVPR).

[70]

Weilin Xu, Yanjun Qi, and David Evans. 2016. Automatically Evading Classiers: A Case Study on PDF Malware

Classiers. In Network and Distributed System Security Symposium.

[71]

Yi Xu, Gerardo Reynaga, Sonia Chiasson, Jan-Michael Frahm, Fabian Monrose, and Paul C Van Oorschot. 2014. Security

analysis and related usability of motion-based captchas: Decoding codewords in motion. IEEE transactions on dependable

and secure computing 11, 5 (2014), 480–493.

[72]

Je Yan and Ahmad Salah El Ahmad. 2007. Breaking Visual CAPTCHAs with Naive Pattern Recognition Algorithms.

In Computer Security Applications Conference, 2007. ACSAC 2007. Twenty-Third Annual. 279–291.

[73]

Je Yan and Ahmad Salah El Ahmad. 2008. A low-cost attack on a Microsoft captcha. In ACM Conference on Computer

and Communications Security, CCS 2008, Alexandria, Virginia, Usa, October. 543–554.

[74]

Guixin Ye, Zhanyong Tang, Dingyi Fang, Zhanxing Zhu, Yansong Feng, Pengfei Xu, Xiaojiang Chen, and Zheng Wang.

2018. Yet Another Text Captcha Solver: A Generative Adversarial Network Based Approach. In Proceedings of the 2018

ACM SIGSAC Conference on Computer and Communications Security. ACM, 332–348.

[75]

Jason Yosinski, Je Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural

networks?. In Advances in neural information processing systems. 3320–3328.

1:30 Ye, G. et al

[76]

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2016. SeqGAN: Sequence Generative Adversarial Nets with Policy

Gradient. (2016).

[77]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-Image Translation using

Cycle-Consistent Adversarial Networks. arXiv preprint arXiv:1703.10593 (2017).