
1 Introduction
Artificial Intelligence (AI) has been experiencing unprecedented success in the recent years thanks to the progress
accomplished in Machine Learning (ML), and more specifically Deep Learning (DL). These advances raise several
questions about AI safety and ethics [1]. In this work, we do not provide an answer to these questions but we show
that AI systems based on ML algorithms such as reCAPTCHA v3 [2] are still vulnerable to automated attacks. Google’s
reCAPTCHA system, for detecting bots from humans, is the most used defense mechanism in websites. Its purpose is
to protect against automated agents and bots, attacks and spams. Previous versions of Google’s reCAPTCHA (v1 and
v2) present tasks (images, letters, audio) easily solved by humans but challenging for computers. The reCAPTCHA v1
presented a distorted text that the user had to type correctly to pass the test. This version was defeated by Bursztein
et al. [3] with 98% accuracy using ML-based system to segment and recognize the text. As a result, image-based and
audio-based reCAPTCHAs were introduced as a second version. Researchers have also succeeded in hacking these
versions using ML and more specifically DL. For example, the authors in [4] designed an AI-based system called
UnCAPTCHA to break Google’s most challenging audio reCAPTCHAs. On 29 October 2018, the official third version
was published [5] and removed any user interface. Google’s reCAPTCHA v3 uses ML to return a risk assessment score
between 0.0 and 1.0. This score characterize the trustability of the user. A score close to 1.0 means that the user is human.
In this work, we introduce an RL formulation to solve this reCAPTCHA version. Our approach is programmatic: first,
we propose a plausible formalization of the problem as a Markov Decision Process (MDP) solvable by state-of-the-art
RL algorithms; then, we introduce a new environment for interacting with the reCAPTCHA system; finally, we analyze
how the RL agent learns or fails to defeat Google reCAPTCHA. Experiment results show that the RL agent passes the
reCAPTCHA test with 97.4accuracy. To our knowledge, this is the first attempt to defeat the reCAPTCHA v3 using RL .
2 Method
2.1 Preliminaries
An agent interacting with an environment is modeled as a Markov Decision Process (MDP) [6]. A MDP is defined as
a tuple (S,A, P, r)where Sand Aare the sets of possible states and actions respectively. P(s, a, s0)is the transition
probabilities between states and ris the reward function. Our objective is to find an optimal policy π∗that maximizes
the future expected rewards. Policy-based methods directly learn π∗. Let’s assume that the policy is parameterized by a
set of weights wsuch as π=π(s, w). Then, the objective is defined as: J(w) = EπPT
t=0 γtrtwhere γis the discount
factor and rtis the reward at time t.
Thanks to the policy gradient theorem and the gradient trick [7], the Reinforce algorithm [8] estimates gradients using
(1).
∇EπT
X
t=0
γtrt=EπT
X
t=0 ∇log π(at|st)Rt(1)
Rtis the future discounted return at time tdefined as Rt=PT
k=tγ(k−t)·rk, where Tmarks the end of an episode.
Usually the equation (1) is formulated as the gradient of a loss function L(w)defined as follows: L(w) =
−1
NPN
i=1 PT
t=0 ∇log π(ai
t|si
t)Ri
twhere Nis the a number of collected episodes.
2.2 Settings
To pass the reCAPTCHA test, a human user will move his mouse starting from an initial position, perform a sequence
of steps until reaching the reCAPTCHA check-box and clicking on it. Depending on this interaction, the reCAPTCHA
system will reward the user with a score. In this work, we modeled this process as a MDP where the state space Sis the
possible mouse positions on the web page and the action space is A={up, lef t, right, down}. Using these settings, the
task becomes similar to a grid world problem.
As shown in Figure 1, the starting point is the initial mouse position and the goal is the position of the reCAPTCHA is
the web page. For each episode, the starting point is randomly chosen from a top right or a top left region representing
2.5% of the browser window’s area (5% on the x-Axis and 5% on the y-Axis). A grid is then constructed where each pixel
between the initial and final points is a possible position for the mouse. We assume that a normal user will not necessary
move the mouse pixel by pixel. Therefore, we defined a cell size cwhich is the number of pixels between two consecutive
positions. For example, if the agent is at the position (x0, y0)and takes the action left, the next position is then (x0−c, y).
1