JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 1

TypeNet: Deep Learning Keystroke Biometrics

Alejandro Acien, Aythami Morales, John V. Monaco, Ruben Vera-Rodriguez, Julian Fierrez, Member, IEEE

Abstract—We study the performance of Long Short-Term Memory networks for keystroke biometric authentication at large scale in

free-text scenarios. For this we introduce TypeNet, a Recurrent Neural Network (RNN) trained with a moderate number of keystrokes

per identity. We evaluate different learning approaches depending on the loss function (softmax, contrastive, and triplet loss), number

of gallery samples, length of the keystroke sequences, and device type (physical vs touchscreen keyboard). With 5 gallery sequences

and test sequences of length 50, TypeNet achieves state-of-the-art keystroke biometric authentication performance with an Equal Error

Rate of 2.2% and 9.2% for physical and touchscreen keyboards, respectively, significantly outperforming previous approaches. Our

experiments demonstrate a moderate increase in error with up to 100,000 subjects, demonstrating the potential of TypeNet to operate

at an Internet scale. We utilize two Aalto University keystroke databases, one captured on physical keyboards and the second on

mobile devices (touchscreen keyboards). To the best of our knowledge, both databases are the largest existing free-text keystroke

databases available for research with more than 136 million keystrokes from 168,000 subjects in physical keyboards, and 60,000

subjects with more than 63 million keystrokes acquired on mobile touchscreens.

Index Terms—Biometrics, keystroke dynamics, large scale, deep learning, TypeNet, keystroke authentication.

1 INTRODUCTION

Keystroke dynamics is a behavioral biometric trait aimed

at recognizing individuals based on their typing habits. The

velocity of pressing and releasing different keys [1], the

hand postures during typing [2], and the pressure exerted

when pressing a key [3] are some of the features taken

into account by keystroke biometric algorithms aimed to

discriminate among subjects. Although keystroke biomet-

rics suffer high intra-class variability for person recognition,

especially in free-text scenarios (i.e. the input text typed

is not fixed between enrollment and testing), the ubiquity

of keyboards as a method of text entry makes keystroke

dynamics a near universal modality to authenticate subjects

on the Internet.

Text entry is prevalent in day-to-day applications: un-

locking a smartphone, accessing a bank account, chatting

with acquaintances, email composition, posting content on

a social network, and e-learning [4]. As a means of subject

authentication, keystroke dynamics is economical because

it can be deployed on commodity hardware and remains

transparent to the user. These properties have prompted

several companies to capture and analyze keystrokes. The

global keystroke biometrics market is projected to grow

from $129.8million dollars (2017 estimate) to $754.9million

by 2025, a rate of up to 25% per year1. As an example,

Google has recently committed $7 million dollars to fund

TypingDNA2, a startup company which authenticates peo-

ple based on their typing behavior.

At the same time, the security challenges that keystroke

biometrics promises to solve are constantly evolving and

getting more sophisticated every year: identity fraud, ac-

•A. Acien, A. Morales, R. Vera-Rodriguez, and J. Fierrez are with the

School of Engineering, Universidad Autonoma de Madrid, 28049 Madrid,

Spain (e-mail: alejandro.acien@uam.es; aythami.morales@uam.es;

ruben.vera@uam.es; julian.fierrez@uam.es).

•J. V. Monaco is with the Naval Postgraduate School, Monterey CA, USA

(e-mail: vinnie.monaco@nps.edu).

1. https://www.prnewswire.com/news-releases/keystroke

2. https://siliconcanals.com/news/

count takeover, sending unauthorized emails, and credit

card fraud are some examples3. These challenges are mag-

nified when dealing with applications that have hundreds

of thousands to millions of users. In this context, keystroke

biometric algorithms capable of authenticating individuals

while interacting with online applications are more neces-

sary than ever. As an example of this, Wikipedia struggles to

solve the problem of ‘edit wars’ that happens when different

groups of editors represent opposing opinions. According

to [5], up to 12% of the discussions in Wikipedia are

devoted to revert changes and vandalism, suggesting that

the Wikipedia criteria to identify and resolve controversial

articles is highly contentious. Large scale keystroke bio-

metrics algorithms could be used to detect these malicious

editors among the thousands of editors who write articles

in Wikipedia every day. Other applications of keystroke

biometric technologies are found in e-learning platforms;

student identity fraud and cheating are some challenges that

virtual education technologies need to addresss to become

a viable alternative to face-to-face education [4].

The literature on keystroke biometrics is extensive, but

to the best of our knowledge, previous systems have only

been evaluated with up to several hundred subjects and

cannot deal with the recent challenges that massive usage

applications are facing. The aim of this paper is to explore

the feasibility and limits of deep learning architectures for

scaling up free-text keystroke biometrics to hundreds of

thousands of users. The main contributions of this work are

threefold:

1) We explore novel free-text keystroke biometrics ap-

proaches based on Deep Recurrent Neural Net-

works, suitable for authentication and identification

at large scale. We conduct an exhaustive exper-

imentation and evaluate how performance is af-

fected by the following factors: the length of the

keystroke sequences, the number of gallery samples,

3. https://150sec.com/fraudulent-fingertips

arXiv:2101.05570v2 [cs.CV] 18 Feb 2021

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 2

and the device (touchscreen vs physical keyboard).

We present TypeNet, a Recurrent Neural Network

trained with keystroke sequences from more than

100,000 subjects. We analyze the performance of

three different loss functions (softmax, contrastive,

triplet) used to train TypeNet.

2) The results reported by TypeNet represent the state

of the art in keystroke authentication based on free-

text reducing the error obtained by previous works

in more than 50%. Processed data has been made

available so the results can be reproduced4. We

evaluate TypeNet in terms of Equal Error Rate (EER)

as the number of test subjects is scaled from 100

up to 100,000 (independent from the training data)

for the desktop scenario (physical keyboards) and

up to 30,000 for the mobile scenario (touchscreen

keyboard). TypeNet learns a feature representation

of a keystroke sequence without the need for re-

training if new subjects are added to the database,

as commonly happens in many biometric systems

[6]. Therefore, TypeNet is easily scalable.

3) We carry out a comparison with previous state-of-

the-art approaches for free-text keystroke biometric

authentication. The performance achieved by the

proposed method outperforms previous approaches

in the scenarios evaluated in this work. The results

suggest that authentication error rates achieved by

TypeNet remain low as thousands of new users are

enrolled.

A preliminary version of this article was presented in

[7]. This article significantly improves [7] in the following

aspects:

1) We add a new version of TypeNet trained and

tested with keystroke sequences acquired in mobile

devices and results in the mobile scenario. Addition-

ally, we provide cross-sensor interoperability results

[8], [9] between desktop and mobile datasets.

2) We include two new loss functions (softmax and

triplet loss) that serve to improve the performances

in all scenarios. Our experiments demonstrate that

triplet loss can be used to multiply by two the

accuracy of free-text keystroke authentication ap-

proaches.

3) We evaluate TypeNet in terms of Rank-n identifica-

tion rates using a background set of 1,000 subjects

(independent from the training data).

4) We add experiments about the dependencies be-

tween input text and TypeNet performance, a com-

mon issue in free-text keystroke biometrics.

In summary, we present the first evidence in the lit-

erature of competitive performance of free-text keystroke

biometric authentication at large scale (up to 100,000 test

subjects). The results reported in this work demonstrate

the potential of this behavioral biometric for widespread

deployment.

The paper is organized as follows: Section 2summarizes

related works in free-text keystroke dynamics. Section 3

describes the datasets used for training and testing TypeNet

4. Data available at: https://github.com/BiDAlab/TypeNet

models. Section 4describes the processing steps and learn-

ing methods in TypeNet. Section 5details the experimental

protocol. Section 6reports the experiments and discusses

the results obtained. Section 7summarizes the conclusions

and future work.

2 BACKG ROUN D AN D RELATE D WORK

The measurement of keystroke dynamics depends on the

acquisition of key press and release events. This can oc-

cur on almost any commodity device that supports text

entry, including desktop and laptop computers, mobile and

touchscreen devices that implement soft (virtual) keyboards,

and PIN entry devices such as those used to process credit

card transactions. Generally, each keystroke (the action of

pressing and releasing a single key) results in a keydown

event followed by keyup event, and the sequence of these

timings is used to characterize an individual’s keystroke dy-

namics. Within a web browser, the acquisition of keydown

and keyup event timings requires no special permissions,

enabling the deployment of keystroke biometric systems

across the Internet in a transparent manner.

Keystroke biometric systems are commonly placed into

two categories: fixed-text, where the keystroke sequence

typed by the subject is prefixed, such as a username or

password, and free-text, where the keystroke sequence is

arbitrary, such as writing an email or transcribing a sen-

tence with typing errors. Notably, free-text input results in

different keystroke sequences between the gallery and test

samples as opposed to fixed-text input. Biometric authenti-

cation algorithms based on keystroke dynamics for desktop

and laptop keyboards have been predominantly studied

in fixed-text scenarios where accuracies higher than 95%

are common [18]. Approaches based on sample alignment

(e.g. Dynamic Time Warping) [18], Manhattan distances [19],

digraphs [20], and statistical models (e.g. Hidden Markov

Models) [21] have shown to achieve the best results in fixed-

text.

Nevertheless, the performances of free-text algorithms

are generally far from those reached in the fixed-text sce-

nario, where the complexity and variability of the text entry

contribute to intra-subject variations in behavior, challeng-

ing the ability to recognize subjects [22]. Monrose and Rubin

[10] proposed in 1997 a free-text keystroke algorithm based

on subject profiling by using the mean latency and stan-

dard deviation of digraphs and computing the Euclidean

distance between each test sample and the reference pro-

file. Their results worsened from 90% to 23% of correct

classification rates when they changed both subject profiles

and test samples from fixed-text to free-text. Gunetti and

Picardi [11] extended the previous algorithm to n-graphs.

They calculated the duration of n-graphs common between

training and testing and defined a distance function based

on the duration and order of such n-graphs. Their results of

7.33% classification error outperformed the previous state of

the art. Nevertheless, their algorithm needs long keystroke

sequences (between 700 and 900 keystrokes) and many

keystroke sequences (up to 14) to build the subject profile,

which limits the usability of that approach. Murphy et al.

[15] more recently collected a very large free-text keystroke

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 3

Study Scenario #Subjects #Seq. Sequence Size #Keys Best Performance

Monrose and Rubin (1997) [10] Desktop 31 N/A N/A N/A ACC = 23%

Gunetti and Picardi (2005) [11] Desktop 205 1 ∼15 700 ∼900 keys 688K EER = 7.33%

Kim and Kang (2009) [12] Mobile 50 20 ∼200 keys 200K EER= 0.05%

Gascon et al. (2014) [13] Mobile 315 1 ∼10 ∼160 keys 67K EER= 10.0%

Ceker and Upadhyaya (2016) [14] Desktop 34 2 ∼7K keys 442K EER = 2.94%

Murphy et al. (2017) [15] Desktop 103 N/A 1,000 keys 12.9M EER = 10.36%

Monaco and Tappert (2018) [16] Both 55 6 500 keys 165K EER = 0.6%

Deb et al. (2019) [17] Mobile 37 180K 3 seconds 6.7M81.61% TAR at 0.1% FAR

Ours (2020) Both 228K15 ∼70 keys 199M EER= 2.2%

TABLE 1

Comparison among different free-text keystroke datasets employed in relevant related works. N/A = Not Available. ACC = Accuracy, EER = Equal

Error Rate, TAR = True Acceptance Rate, FAR = False Acceptance Rate.

dataset (∼2.9M keystrokes) and applied the Gunetti and Pi-

cardi algorithm achieving 10.36% classification error using

sequences of 1,000 keystrokes and 10 genuine sequences to

authenticate subjects.

More recently than the pioneering works of Monrose and

Gunetti, some algorithms based on statistical models have

shown to work very well with free-text, like the POHMM

(Partially Observable Hidden Markov Model) [16]. This

algorithm is an extension of the traditional Hidden Markov

Model (HMM), but with the difference that each hidden

state is conditioned on an independent Markov chain. This

algorithm is motivated by the idea that keystroke timings

depend both on past events and the particular key that

was pressed. Performance achieved using this approach in

free-text is close to fixed-text, but it again requires several

hundred keystrokes and has only been evaluated with a

database containing less than 100 subjects.

The performance of keystroke biometric systems on

mobile devices can in some cases exceed that of desktop

systems. Unlike physical keyboards, touchscreen keyboards

support a variety of input methods, such as swipe which

enables text entry by sliding the finger along a path that vis-

its each letter and lifting the finger only between words. The

ability to enter text in ways other than physical key pressing

has led to a greater variety of text entry strategies employed

by typists [23]. In addition to this, mobile devices are readily

equipped with additional sensors which offer more insight

to a users keystroke dynamics. This includes the touchscreen

itself, which is able to sense the location and pressure, as

well as accelerometer, gyroscope, and orientation sensors.

Like desktop keystroke biometrics, many mobile

keystroke biometric studies have focused on fixed-text se-

quences [24]. Some recent works have considered free-text

sequences on mobile devices. Gascon et al. [13] collected

freely typed samples from over 300 participants and de-

veloped a system that achieved a True Acceptance Rate

(TAR) of 92% at 1% False Acceptance Rate (FAR) (an EER of

about 10%). Their system utilized accelerometer, gyroscope,

time, and orientation features. Each user typed an English

pangram (sentence containing every letter of the alphabet)

approximately 160 characters in length, and classification

was performed by Support Vector Machine (SVM). In other

work, Kim and Kang [12] utilized microbehavioral features

to obtain an EER below 0.05% for 50 subjects with a single

reference sample of approximately 200 keystrokes for both

English and Korean input. The microbehavioral features

consist of angular velocities along three axes when each key

is pressed and released, as well as timing features and the

coordinate of the touch event within each key. See [24] for a

survey of keystroke biometrics on mobile devices.

Because mobile devices are not stationary, mobile

keystroke biometrics depend more heavily on environmen-

tal conditions, such as the user’s location or posture, than

physical keyboards which typically remain stationary. This

challenge of mobile keystroke biometrics was examined

by Crawford and Ahmadzadeh in [25]. They found that

authenticating a user in different positions (sitting, standing,

or walking) performed only slightly better than guessing,

but detecting the user’s position before authentication can

significantly improve performance.

Nowadays, with the proliferation of machine learning

algorithms capable of analysing and learning human behav-

iors from large scale datasets, the performance of keystroke

dynamics in the free-text scenario has been boosted. As an

example, [14] proposes a combination of the existing di-

graphs method for feature extraction plus an SVM classifier

to authenticate subjects. This approach achieves almost 0%

error rate using samples containing 500 keystrokes. These

results are very promising, even though it was evaluated

using a small dataset with only 34 subjects. In [17] the

authors employ an RNN within a Siamese architecture to

authenticate subjects based on 8biometric modalities on

smartphone devices. They achieved results in a free-text

scenario of 81.61% TAR at 0.1% FAR using just 3second

test windows with a dataset of 37 subjects.

Previous works in free-text keystroke dynamics have

achieved promising results with up to several hundred

subjects (see Table 1), but they have yet to scale beyond this

limit and leverage emerging machine learning techniques

that benefit from vast amounts of data. Here we take a step

forward in this direction of machine learning-based free-text

keystroke biometrics by using the largest datasets published

to date with 199 million keystrokes from 228,000 subjects

(considering both mobile and desktop datasets). We analyze

to what extent deep learning models are able to scale in

keystroke biometrics to recognize subjects at a large scale

while attempting to minimize the amount of data per subject

required for enrollment.

3 KEYSTROKE DATASETS

All experiments are conducted with two Aalto University

Datasets: 1) the Dhakal et al. dataset [26], which comprises

more than 5GB of keystroke data collected on desktop

keyboards from 168,000 participants; and 2) the Palin et

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 4

AABB

Press Release Press Release

Time

HL IL

PL RL

Fig. 1. Example of the 4 temporal features extracted between two con-

secutive keys: Hold Latency (HL), Inter-key Latency (IL), Press Latency

(PL), and Release Latency (RL).

al. dataset [23], which comprises almost 4GB of keystroke

data collected on mobile devices from 260,000 participants.

The same data collection procedure was followed for both

datasets. The acquisition task required subjects to memorize

English sentences and then type them as quickly and ac-

curate as they could. The English sentences were selected

randomly from a set of 1,525 examples taken from the

Enron mobile email and Gigaword Newswire corpus. The

example sentences contained a minimum of 3words and a

maximum of 70 characters. Note that the sentences typed

by the participants could contain more than 70 characters

because each participant could forget or add new charac-

ters when typing. All participants in the Dhakal database

completed 15 sessions (i.e. one sentence for each session) on

either a desktop or a laptop physical keyboard. However,

in the Palin dataset the participants who finished at least 15

sessions are only 23% (60,000 participants) out of 260,000

participants that started the typing test. In this paper we

will employ these 60,000 subjects with their first 15 sessions

in order to allow fair comparisons between both datasets.

For the data acquisition, the authors launched an online

application that records the keystroke data from participants

who visit their webpage and agree to complete the acqui-

sition task (i.e. the data was collected in an uncontrolled

environment). Press (keydown) and release (keyup) event

timings were recorded in the browser with millisecond reso-

lution using the JavaScript function Date.now. The authors

also reported demographic statistics for both datasets: 72%

of the participants from the Dhakal database took a typing

course, 218 countries were involved, and 85% of the them

have English as native language, meanwhile only 31% of the

participants from the Palin database took a typing course,

163 countries were involved, and 68% of the them were

English native speakers.

4 SYSTEM DESCRIPTION

4.1 Pre-processing and Feature Extraction

The raw data captured in each session includes a time

series with three dimensions: the keycodes, press times, and

release times of the keystroke sequence. Timestamps are in

UTC format with millisecond resolution, and the keycodes

are integers between 0and 255 according to the ASCII code.

We extract 4temporal features for each sequence (see

Fig. 1for details): (i) Hold Latency (HL), the elapsed time

Fig. 2. Architecture of TypeNet for free-text keystroke sequences. The

input xis a time series with shape M×5(keystrokes ×keystroke

features) and the output f(x)is an embedding vector with shape 1×128.

between key press and release events; (ii) Inter-key Latency

(IL), the elapsed time between releasing a key and pressing

the next key; (iii) Press Latency (PL), the elapsed time

between two consecutive press events; and (iv) Release

Latency (RL), the elapsed time between two consecutive

release events. These 4features are commonly used in both

fixed-text and free-text keystroke systems [27]. Finally, we

include the keycodes as an additional feature.

The 5features are calculated for each keystroke in the

sequence. Let Nbe the length of the keystroke sequence,

such that each sequence provided as input to the model is a

time series with shape N×5(Nkeystrokes by 5features).

All feature values are normalized before being provided

as input to the model. Normalization is important so that

the activation values of neurons in the input layer of the

network do not saturate (i.e. all close to 1). The keycodes

are normalized to between 0and 1by dividing each keycode

by 255, and the 4timing features are converted to seconds.

This scales most timing features to between 0and 1as the

average typing rate over the entire dataset is 5.1±2.1keys

per second. Only latency features that occur either during

very slow typing or long pauses exceed a value of 1.

4.2 TypeNet Architecture

In keystroke dynamics, it is thought that idiosyncratic

behaviors that enable authentication are characterized by

the relationship between consecutive key press and release

events (e.g. temporal patterns, typing rhythms, pauses,

typing errors). In a free-text scenario, keystroke sequences

between enrollment and testing may differ in both length

and content. This reason motivates us to choose a Recurrent

Neural Network as our keystroke authentication algorithm.

RNNs have demonstrated to be one of the best algorithms to

deal with temporal data (e.g. [28], [29]) and are well suited

for free-text keystroke sequences (e.g. [17], [30]).

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 5

Our RNN architecture is depicted in Fig. 2. It is com-

posed of two Long Short-Term Memory (LSTM) layers of

128 units (tanh activation function). Between the LSTM

layers, we perform batch normalization and dropout at a

rate of 0.5to avoid overfitting. Additionally, each LSTM

layer has a recurrent dropout rate of 0.2.

One constraint when training a RNN using standard

backpropagation through time applied to a batch of se-

quences is that the number of elements in the time dimen-

sion (i.e. number of keystrokes) must be the same for all

sequences. We set the size of the time dimension to M. In

order to train the model with sequences of different lengths

Nwithin a single batch, we truncate the end of the input

sequence when N > M and zero pad at the end when

N < M, in both cases to the fixed size M. Error gradients

are not computed for those zeros and do not contribute

to the loss function at the output layer as a result of the

masking layer shown in Fig. 2.

Finally, the output of the model f(x)is an array of size

1×128 that we will employ later as an embedding feature

vector to recognize subjects.

4.3 LSTM Training: Loss Functions

Our goal is to build a keystroke biometric system capable of

generalizing to new subjects not seen during model training,

and therefore, having a competitive performance when it

deploys to applications with thousands of users. Our RNN

is trained only once on an independent set of subjects. This

model then acts as a feature extractor that provides input

to a distance-based recognition scheme. After training the

RNN once, we will evaluate in the experimental section the

recognition performance for a varying number of subjects

and enrollment samples per subject.

We train our deep model with three different loss func-

tions: Softmax loss, which is widely used in classification

tasks; Contrastive loss, a loss for distance metric learning

based on two samples [31]; and Triplet loss, a loss for metric

learning based on three samples [32]. These are each defined

as follows.

4.3.1 Softmax loss

Let xibe a keystroke sequence of individual Ii, and let us

introduce a dense layer after the embeddings described in

the previous section aimed at classifying the individuals

used for learning (see Fig. 3.a). The Softmax loss is applied

LS=−log







efC

Ii(xi)

c=1

efC

c(xi)







(1)

where Cis the number of classes used for learning (i.e. iden-

tities), fC= [fC

1, . . . , f C

C], and after learning all elements of

fCwill tend to 0 except fC

Ii(xi)that will tend to 1. Softmax is

widely used in classification tasks because it provides good

performance on closed-set problems. Nonetheless, Softmax

does not optimize the margin between classes. Thus, the

performance of this loss function usually decays for prob-

lems with high intra-class variance. In order to train the

architecture proposed in Fig. 2, we have added an output

classification layer with Cunits (see Fig. 3.a). During the

training phase, the model will learn discriminative infor-

mation from the keystroke sequences and transform this

information into an embedding space where the embedding

vectors f(x)(the outputs of the model) will be close in case

both keystroke inputs belong to the same subject (genuine

pairs), and far in the opposite case (impostor pairs).

4.3.2 Contrastive loss

Let xiand xjeach be a keystroke sequence that together

form a pair which is provided as input to the model. The

Contrastive loss calculates the Euclidean distance between

the model outputs,

d(xi,xj) = kf(xi)−f(xj)k(2)

where f(xi)and f(xj)are the model outputs (embedding

vectors) for the inputs xiand xj, respectively. The model will

learn to make this distance small (close to 0) when the input

pair is genuine and large (close to α) for impostor pairs by

computing the loss function LCL defined as follows:

LCL = (1−Lij )d2(xi,xj)

2+Lij

max2{0, α −d(xi,xj)}

2(3)

where Lij is the label associated with each pair that is set

to 0for genuine pairs and 1for impostor ones, and α≥0

is the margin (the maximum margin between genuine and

impostor distances). The Contrastive loss is trained using

a Siamese architecture (see Fig. 3.b) that minimizes the

distance between embeddings vectors from the same class

(d(xi,xj)with Lij = 0), and maximizes it for embeddings

from different class (d(xi,xj)with Lij = 1).

4.3.3 Triplet loss

The Triplet loss function enables learning from positive and

negative comparisons at the same time (note that the label

Lij eliminates one of the distances for each pair in the

Contrastive loss). A triplet is composed by three different

samples from two different classes: Anchor (A) and Positive

(P) are different keystroke sequences from the same subject,

and Negative (N) is a keystroke sequence from a different

subject. The Triplet loss function is defined as follows:

LT L = max n0, d2(xi

A,xi

P)−d2(xi

A,xj

N) + αo(4)

where αis a margin between positive and negative pairs

and dis the Euclidean distance calculated with Eq. 2. In

comparison with Contrastive loss, Triplet loss is capable

of learning intra- and inter-class structures in a unique

operation (removing the label Lij). The Triplet loss is trained

using an extension of a Siamese architecture (see Fig. 3.c) for

three samples. This learning process minimizes the distance

between embedding vectors from the same class (d(xA,xP)),

and maximizes it for embeddings from different classes

(d(xA,xN)).

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 6

Fig. 3. Learning architecture for the different loss functions a) Softmax loss, b) Contrastive loss, and c) Triplet loss. The goal is to find the most

discriminant embedding space f(x).

4.4 LSTM Training: Implementation Details

We train three RNN versions (i.e. one for each loss func-

tion) for each input device: desktop and mobile, using the

Dhakal and Palin databases, respectively. For the desktop

scenario, we train the models using only the first 68,000

subjects from the Dhakal dataset. For the Softmax function

we train a model with C= 10,000 subjects due to GPU

memory constraints, as the Softmax loss requires a very

wide final layer with many classes. In this case, we used

15 ×10,000 = 150,000 keystroke sequences for training

and the remaining 58,000 subjects were discarded. For the

Contrastive loss we generate genuine and impostor pairs

using all the 15 keystroke sequences available for each

subject. This provides us with 15 ×67,999 ×15 = 15.3

million impostor pair combinations and 15 ×14/2 = 105

genuine pair combinations for each subject. The pairs were

chosen randomly in each training batch ensuring that the

number of genuine and impostor pairs remains balanced

(512 pairs in total in each batch including impostor and

genuine pairs). Similarly, we randomly chose triplets for the

Triplet loss training.

The remaining 100,000 subjects were employed only for

model evaluation, so there is no data overlap between the

two groups of subjects. This reflects an open-set authen-

tication paradigm. The same protocol was employed for

the mobile scenario but adjusting the amount of subjects

employed to train and test. In order to have balanced

subsets close to the desktop scenario, we divided by half

the Palin database such that 30,000 subjects were used

to train the models, generating 15 ×29,999 ×15 = 6.75

million impostor pair combinations and 15 ×14/2 = 105

genuine pair combinations for each subject. The other 30,000

subjects were used to test the mobile TypeNet models. Once

again 10,000 subjects were used to train the mobile TypeNet

model with Softmax loss.

Regarding the hyper-parameters employed during train-

ing, the best results for both models were achieved with

a learning rate of 0.05, Adam optimizer with β1= 0.9,

β2= 0.999 and = 10−8, and the margin set to α= 1.5.

The models were trained for 200 epochs with 150 batches

per epoch and 512 sequences in each batch. The models

were built in Keras-Tensorflow.

5 EXPERIMENTAL PROTOCO L

5.1 Authentication Protocol

We authenticate subjects by comparing gallery samples xi,g

belonging to the subject iin the test set to a query sample

xj,q from either the same subject (genuine match i=j)

or another subject (impostor match i6=j). The test score

is computed by averaging the Euclidean distances between

each gallery embedding vector f(xi,g)and the query embed-

ding vector f(xj,q)as follows:

i,j =1

g=1

||f(xi,g)−f(xj,q )|| (5)

where Gis the number of sequences in the gallery (i.e. the

number of enrollment samples) and qis the query sample of

subject j. Taking into account that each subject has a total of

15 sequences, we retain 5sequences per subject as the test

set (i.e. each subject has 5genuine test scores) and let Gvary

between 1≤G≤10 in order to evaluate the performance

as a function of the number of enrollment sequences.

To generate impostor scores, for each enrolled subject

we choose one test sample from each remaining subject.

We define kas the number of enrolled subjects. In our

experiments, we vary kin the range 100 ≤k≤K,

where K= 100,000 for the desktop TypeNet models and

K= 30,000 for the mobile TypeNet. Therefore each subject

has 5genuine scores and k−1impostor scores. Note that we

have more impostor scores than genuine ones, a common

scenario in keystroke dynamics authentication. The results

reported in the next section are computed in terms of

Equal Error Rate (EER), which is the value where False

Acceptance Rate (FAR, proportion of impostors classified

as genuine) and False Rejection Rate (FRR, proportion of

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 7

genuine subjects classified as impostors) are equal. The error

rates are calculated for each subject and then averaged over

all ksubjects [33].

5.2 Identification Protocol

Identification scenarios are common in forensics applica-

tions, where the final decision is based on a bag of evidences

and the biometric recognition technology can be used to

provide a list of candidates, referred to as background set

Bin this work. The Rank-1 identification rate reveals the

performance to unequivocally identifying the target subject

among all the subjects in the background set. Rank-nrepre-

sents the accuracy if we consider a ranked list of nprofiles

from which the result is then manually or automatically

determined based on additional evidence [34].

The 15 sequences from the ktest subjects in the database

were divided into two groups: Gallery (10 sequences) and

Query (5sequences). We evaluate the identification rate by

comparing the Query set of samples xQ

j,q, with q= 1, ..., 5

belonging to the test subject jagainst the Background

Gallery set xG

i,g, with g= 1, ..., 10 belonging to all back-

ground subjects. The distance was computed by averaging

the Euclidean distances || · || between each gallery embed-

ding vector f(xG

i,g)and each query embedding vector f(xQ

j,q)

as follows:

i,j =1

10 ×5

g=1

q=1

||f(xG

i,g)−f(xQ

j,q)|| (6)

We then identify a query set (i.e. subject j=Jis the

same gallery person i=I) as follows:

I= arg min

isQ

i,J (7)

The results reported in the next section are computed in

terms of Rank-naccuracy. A Rank-1means that di,J < dI,J

for any i6=I, while a Rank-nmeans that instead of selecting

a single gallery profile, we select nprofiles starting with

i=Iby increasing distance di,J . In forensic scenarios, it is

traditional to use Rank-20, Rank-50, or Rank-100 in order to

generate a short list of potential candidates that are finally

identified by considering other evidence.

6 EXPERIMENTS AND RE SU LTS

6.1 Authentication: Varying Amount of Enrollment Data

As commented in the related works section, one key factor

when analyzing the performance of a free-text keystroke

authentication algorithm is the amount of keystroke data

per subject employed for enrollment. In this work, we study

this factor with two variables: the keystroke sequence length

Mand the number of gallery sequences used for enrollment

Our first experiment reveals to what extent Mand

Gaffect the authentication performance of our TypeNet

models. Note that the input to our models has a fixed size

of Mafter the masking process shown in Fig. 2. For this

experiment, we set k= 1,000 (where kis the number of

enrolled subjects). Tables 2and 3summarize the error rates

in both desktop and mobile scenarios respectively, achieved

by the TypeNet models for the different values of sequence

length Mand enrollment sequences per subject G.

In the desktop scenario (Table 2) we observe that for

sequences longer than M= 70 there is no significant

improvement in performance. Adding three times more key

events (from M= 50 to M= 150) lowers the EER by only

0.7% in average for all values of G. However, adding more

sequences to the gallery shows greater improvements with

about 50% relative error reduction when going from 1to 10

sequences independent of M. Comparing among the differ-

ent loss functions, the best results are always achieved by

the model trained with Triplet loss for M= 70 and G= 10

with an error rate of 1.2%, followed by the Contrastive loss

function with an error rate of 3.9%; the worst results are

achieved with the Softmax loss function (6.0%). For one-

shot authentication (G= 1), our approach has an error rate

of 4.5% using sequences of 70 keystrokes.

Similar trends are observed in the mobile scenario (Table

3) compared to the desktop scenario (Table 2). First, increas-

ing sequence length beyond M= 70 keystrokes does not

significantly improve performance, but there is a significant

improvement when increasing the number of sequences per

subject. The best results are achieved for M= 100 and

G= 10 with an error rate of 6.3% by the model trained with

triplet loss, followed again by the contrastive loss (10.0%),

and softmax (12.3%). For one-shot authentication (G= 1),

the performance of the triplet model decays up to 10.7%

EER using sequences of M= 100 keystrokes.

Comparing the performance achieved by the three Type-

Net models between mobile and desktop scenarios, we

observe that in all cases the results achieved in the desktop

scenario are significantly better to those achieved in the

mobile scenario. These results are consistent with prior work

that has obtained lower performance on mobile devices

when only timing features are utilized [2], [24], [35].

Next, we compare TypeNet with our implementation

of two state-of-the-art algorithms for free-text keystroke

authentication: a statistical sequence model, the POHMM

(Partially Observable Hidden Markov Model) from [16], and

another algorithm based on digraphs and SVM from [14].

To allow fair comparisons, all approaches are trained and

tested with the same data and experimental protocol: G= 5

enrollment sequences per subject, M= 50 keystrokes per

sequence, k= 1,000 test subjects.

In Fig. 4we plot the error rates of the three approaches

(i.e. Digraphs, POHMM, and TypeNet) trained and tested on

both desktop (left) and mobile (right) datasets. The Type-

Net models outperform previous state-of-the-art free-text

algorithms in both mobile and desktop scenarios with this

experimental protocol, where the amount of enrollment data

is reduced (5×M= 250 training keystrokes in comparison

to more than 10,000 in related works, see Section 2). This can

largely be attributed to the rich embedding feature vector

produced by TypeNet, which minimizes the amount of data

needed for enrollment. The SVM generally requires a large

number of training sequences per subject (∼100), whereas

in this experiment we have only 5training sequences per

subject. We hypothesize that the lack of training samples

contributes to the poor performance (near chance accuracy)

of the Digraphs system based on SVMs.

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 8

#enrollment sequences per subject G

1 2 5 7 10

30 17.2/10.7/8.6 14.1/9.0/6.4 13.3/7.3/4.6 12.7/6.8/4.1 11.5/3.3/3.7

50 16.8/8.2/5.4 13.1/6.7/3.6 10.8/5.4/2.2 9.2/4.8/1.8 8.8/4.3/1.6

70 14.1/7.7/4.5 10.4/6.2/2.8 7.5/4.8/1.7 6.7/4.3/1.4 6.0/3.9/1.2

100 13.8/7.7/4.2 10.1/6.0/2.7 7.4/4.7/1.6 6.4/4.3/1.4 5.7/3.9/1.2

#keys per sequence M

150 13.8/7.7/4.1 10.1/6.0/2.7 7.4/4.7/1.6 6.5/4.3/1.4 5.8/3.8/1.2

TABLE 2

Equal Error Rates (%) achieved in desktop scenario using Softmax/Contrastive/Triplet loss for different values of the parameters M(sequence

length) and G(number of enrollment sequences per subject).

#enrollment sequences per subject G

1 2 5 7 10

30 17.7/15.7/14.2 16.0/14.1/12.5 15.2/13.0/11.3 14.9/12.6/10.9 14.5/12.1/10.5

50 17.2/14.6/12.6 15.4/13.1/10.7 13.8/12.1/9.2 13.4/11.5/8.5 12.7/11.0/8.0

70 17.8/13.8/11.3 15.5/12.4/9.5 13.5/11.2/7.8 13.0/10.7/7.2 12.1/10.4/6.8

100 18.4/13.6/10.7 15.8/12.3/8.9 13.6/10.9/7.3 13.0/10.4/6.6 12.3/10.0/6.3

#keys per sequence M

150 18.4/13.7/10.7 15.9/12.3/8.8 13.7/10.8/7.3 13.0/10.4/6.6 12.3/10.0/6.3

TABLE 3

Equal Error Rates (%) achieved in mobile scenario using Softmax/Contrastive/Triplet loss for different values of the parameters M(sequence

length) and G(number of enrollment sequences per subject).

Fig. 4. ROC comparisons in free-text biometric authentication for desktop (left) and mobile (right) scenarios between the three proposed TypeNet

models and two state-of-the-art approaches: POHMM from [16] and digraphs/SVM from [14]. M= 50 keystrokes per sequence, G= 5 enrollment

sequences per subject, and k= 1,000 test subjects.

6.2 Authentication: Varying Number of Subjects

In this experiment, we evaluate to what extent our best

TypeNet models (those trained with triplet loss) are able

to generalize without performance decay. For this, we scale

the number of enrolled subjects kfrom 100 to K(with

K= 100,000 for desktop and K= 30,000 for mobile).

For each subject we have 5genuine test scores and k−1

impostor scores, one against each other test subject. The

models used for this experiment are the same trained in

previous the section (68,000 independent subjects included

in the training phase for desktop and 30,000 for mobile).

Fig. 5shows the authentication results for one-shot en-

rollment (G= 1 enrollment sequences, M= 50 keystrokes

per sequence) and the case (G= 5,M= 50) for different

values of k. For the desktop devices, we can observe that in

both cases there is a slight performance decay when we scale

from 1,000 to 10,000 test subjects, which is more pronounced

in the one-shot case. However, for a large number of subjects

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 9

Fig. 5. EER (%) of our proposed TypeNet models when scaling up the

number of test subjects kin one-shot (G= 1 enrollment sequences per

subject) and 5-shot (G= 5) authentication cases. M= 50 keystrokes

per sequence.

(k≥10,000), the error rates do not appear to demonstrate

continued growth. For the mobile scenario, the results when

scaling from 100 to 1,000 test subjects show a similar ten-

dency compared to the desktop scenario with a slightly

greater performance decay. However, we can observe an

error rate reduction when we continue scaling the number

of test subjects up to 30,000. In all cases the variation of

the performance across the number of test subjects is less

than 2.5% EER. These results demonstrate the potential of

the RNN architecture in TypeNet to authenticate subjects at

large scale in free-text keystroke dynamics. We note that in

the mobile scenario, we have utilized only timing features;

prior work has found that greater performance may be

achieved by incorporating additional sensor features [12].

6.3 Authentication: Cross-device Interoperability

In this experiment we measure the cross-device interoper-

ability between the best TypeNet models trained with the

triplet loss. We also study the capacity of both desktop and

mobile TypeNet models to generalize to other input devices.

For this, we test both models with a different keystroke

dataset than the one employed in their training. Addition-

ally, for this experiment we train a third TypeNet model

called Mixture-TypeNet with triplet loss using keystroke

sequences from both datasets (half of the training batch for

each dataset) but keeping the same train/test subject divi-

sion as the other TypeNet models to allow fair comparisons.

To be consistent with the other experiments we keep the

same experimental protocol: G= 5 enrollment sequences

per subject, M= 50 keystrokes per sequence, k= 1,000 test

subjects.

Table 4shows the error rates achieved for the three Type-

Net models when we test with desktop (Dhakal) and mobile

(Palin) datasets. We can observe that error rates increase

significantly in the cross-device scenario for both desktop

and mobile TypeNet models. This performance decay is

alleviated by the Mixture-TypeNet model, which still per-

forms much worse than the other two models trained and

TypeNet model

Desktop Mobile Mixture

Desktop 2.2 21.4 17.9

Test dataset

Mobile 13.7 9.2 12.6

TABLE 4

Equal Error Rates (%) achieved in the cross-device scenario for the

three TypeNet models (Desktop, Mobile, and Mixture) when testing on

either desktop [26] or mobile [23] dataset.

tested in the same-sensor scenario. These results suggest

that multiple device-specific models may be superior to a

single model when dealing with input from different device

types. This would require device type detection in order to

pass the enrollment and test samples to the correct model

[8].

6.4 Identification based on Keystroke Dynamics

Table 5presents the identification accuracy for a background

of B= 1,000 subjects, k= 10,000 test subjects, G= 10

gallery sequences per subject, and M= 50 keystrokes

per sequence. The accuracy obtained for an identification

scenario is much lower than the accuracy reported for

authentication. In general, the results suggest that keystroke

identification enables a 90% size reduction of the candidate

list while maintaining almost 100% accuracy (i.e., 100%

rank-100 accuracy with 1,000 subjects). However, the results

show the superior performance of the triplet loss function

and significantly better performance compared to tradi-

tional keystroke approaches [14], [16]. While traditional ap-

proaches are not suitable for large-scale free text keystroke

applications, the results obtained by TypeNet demonstrate

its usefulness in many applications.

The number of background profiles can be further re-

duced if auxiliary data is available to realize a pre-screening

of the initial list of gallery profiles (e.g. country, language).

The Aalto University Dataset contains auxiliary data includ-

ing age, country, gender, keyboard type (desktop vs laptop),

among others. Table 6shows also subject identification

accuracy over the 1,000 subjects with a pre-screening by

country (i.e., contents generated in a country different to

the country of the target subject are removed from the

background set). The results show that pre-screening based

on a unique attribute is enough to largely improve the

identification rate: Rank-1 identification with pre-screening

ranges between 5.5% to 84.0%, while the Rank-100 ranges

between 42.2% to 100%. These results demonstrate the

potential of keystroke dynamics for large-scale identification

when auxiliary information is available.

6.5 Input Text Dependency in TypeNet Models

For the last experiment, we examine the effect of the text

typed (i.e. the keycodes employed as input feature in the

TypeNet models) on the distances between embedding vec-

tors and how this may affect the model performance. The

main drawback when using the keycode as an input feature

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 10

Method Scenario Rank

1 50 100

Digraph [14] D 0.1 9.5 15.2

Digraph [14] M 0.0 8.5 14.4

POHMM [16] D 6.1 48.4 63.4

POHMM [16] M 6.5 41.8 53.7

TypeNet (softmax) D 47.5 96.3 98.7

TypeNet (softmax) M 23.5 82.6 91.4

TypeNet (contrastive)D29.4 97.2 99.3

TypeNet (contrastive)M19.0 80.4 89.8

TypeNet (triplet) D 67.4 99.8 99.9

TypeNet (triplet) M 25.5 87.5 94.2

TABLE 5

Identification accuracy (Rank-nin %) for a background size B= 1,000.

Scenario: D = Desktop, M = Mobile.

Method Scenario Rank

1 50 100

Digraph [14] D 5.5 37.6 42.2

POHMM [16] D 21.8 78.3 89.7

TypeNet (softmax) D 68.3 99.39 99.9

TypeNet (contrastive)D56.3 99.7 99.9

TypeNet (triplet) D 84.0 99.9 100

TABLE 6

Identification accuracy (Rank-nin %) for a background size B= 1,000

and pre-screening based on the location of the typist. Scenario: D =

Desktop. There is not metadata related to the mobile scenario.

to free-text keystroke algorithms is that the model could

potentially learning text-based features (e.g. orthography,

linguistic expressions, typing styles) rather than keystroke

dynamics (e.g., typing speed and style) features. To analyze

this phenomenon, we first introduce the Levenshtein dis-

tance (commonly referred as Edit distance) proposed in [36].

The Levenshtein distance dLmeasures the distance between

two words as the minimum number of single-character edits

(insertions, deletions or substitutions) required to change

one word into another. As an example, the Levenshtein

distance between “kitten” and “sitting” is dL= 3, because

we need to substitute “s” for “k”, substitute “i” for “e”,

and insert “g” at the end (three editions in total). With the

Levenshtein distance metric we can measure the similarity

of two keystroke sequences in terms of keys pressed and an-

alyze whether TypeNet models could be learning linguistic

expressions to recognize subjects. This would be revealed

by a high correlation between Levenshtein distance dLand

the Euclidean distance of test scores dE.

In Fig. 6we plot the test scores (Euclidean distances)

employed in one-shot scenario (G= 1 enrollment sequence

per subject, M= 50 keystrokes per sequence, k= 1,000

test subjects) versus the Levenshtein distance between the

gallery and the query sample that produced the test score

(i.e. dE(f(xg),f(xq)) vs. dL(xg,xq)). To provide a quantita-

tive comparison, we also calculate the Pearson coefficient

pand the Linear Regression response as a measure of

correlation between both distances (smaller slope indicates

a weaker relationship). In mobile scenarios (Fig. 6down) we

can observe a significant correlation (i.e higher slope in the

Linear Regression response and high pvalue) between the

Levenshtein distances and the test scores: genuine distance

scores show lower Levenshtein distances (i.e. more similar

typed text) than the impostor ones, and therefore, this

metric provides us some clues about the possibility that

TypeNet models in the mobile scenario could be using the

similarity of linguistic expressions or keys pressed between

the gallery and the query samples to recognize subjects.

These results suggest us that the TypeNet models trained

in the mobile scenario may be performing worse than in

the desktop scenario, among other factors, because mobile

TypeNet embeddings show a significant dependency to the

entry text. On the other hand, in desktop scenarios (Fig.

6up) this correlation is not present (i.e. the small slope

in the Linear Regression response and p∼0) between

test scores and Levenshtein distances, suggesting that the

embedding vector produced by TypeNet models trained

with the desktop dataset are largely independent of the

input text.

7 CONCLUSIONS AND FUTURE WORK

We have presented TypeNet, a new free-text keystroke

biometrics system based on an RNN architecture trained

with three different loss functions: softmax, contrastive, and

triplet. Authentication and identificatino results were obtain

with two datasets at very large scale: one dataset composed

of 136 million keystrokes from 168,000 subjects captured

on desktop keyboards and a second composed of 60,000

subjects captured on mobile devices with more than 63

million keystrokes. Deep neural networks have shown to

be effective in face recognition tasks when scaling up to

hundreds of thousands of identities [37]. The same capacity

has been shown by TypeNet models in free-text keystroke

biometrics.

In all authentication scenarios evaluated in this work,

the models trained with triplet loss have shown a superior

performance, esspecially when there are many subjects but

few enrollment samples per subject. The results achieved in

this work outperform previous state-of-the-art algorithms.

Our results range from 17.2% to 1.2% EER in desktop and

from 17.7% to 6.3% EER in mobile scenarios depending

on the amount of subject data enrolled. A good balance

between performance and the amount of enrollment data

per subject is achieved with 5enrollment sequences and 50

keystrokes per sequence, which yields an EER of 2.2/9.2%

(desktop/mobile) for 1,000 test subjects. These results sug-

gest that our approach achieves error rates close to those

achieved by the state-of-the-art fixed-text algorithms [18],

within ∼5% of error rate even when the enrollment data is

scarce.

Scaling up the number of test subjects does not sig-

nificantly affect the performance: the EER in the desktop

scenario increases only 5% in relative terms with respect to

the previous 2.2% when scaling up from 1,000 to 100,000

test subjects, while in the mobile scenario decays up to 15%

the EER in relative terms. Evidence of the EER stabilizing

around 10,000 subjects demonstrates the potential of this

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 11

Fig. 6. Levenshtein distances vs. test scores in desktop (up) and mobile (down) scenarios for the three TypeNet models. For qualitative comparison

we plot the linear regression results (red line), and the Pearson correlation coefficient p.

architecture to perform well at large scale. However, the

error rates of both models increase in the cross-device inter-

operability scenario. Evaluating the TypeNet model trained

in the desktop scenario with the mobile dataset the EER

increases from 2.2% to 13.7%, and from 9.2% to 21.4% for

the TypeNet model trained with the mobile dataset when

testing with the desktop dataset. A solution based on a

mixture model trained with samples from both datasets out-

performs the previous TypeNet models in the cross-device

scenario but with significantly worse results compared to

single-device development and testing.

In addition to authentication results, identification ex-

periments have been also conducted. In this case, Type-

Net models trained with triplet loss have shown again a

superior performance in all ranks evaluated. For Rank-1,

TypeNet models trained with triplet loss have an accuracy

of 67.4/25.5% (desktop/mobile) with a background size of

B= 1,000 identities, meanwhile previous related works

barely achieve 6.5% accuracy. For Rank-50, the TypeNet

model trained with triplet loss achieves almost 100% accu-

racy in the desktop scenario and up to 87.5% in the mobile

one. The results are improved when using auxiliary-data to

realize a pre-screening of the initial list of gallery profiles

(e.g. country, language), showing the potential of TypeNet

models to perform great not only in authentication, but

also in identification tasks. Finally we have demonstrated

that the text-entry dependencies in TypeNet models are

irrelevant in desktop scenarios, although in mobile scenarios

the TypeNet models have some correlation between the

input text typed and the performance achieved.

For future work, we will improve the way training

pairs/triplets are chosen in Siamese/Triplet training. Cur-

rently, the pairs are chosen randomly; however, recent work

has shown that choosing hard pairs during the training phase

can improve the quality of the embedding feature vectors

[38]. We will also explore improved learning architectures

based on a combination of short- and long-term modeling,

which has demonstrated to be very useful for modeling

behavioral biometrics [39].

In addition, we plan to test our model with other free-

text keystroke databases to analyze the performance in other

scenarios [40], and investigate alternate ways to combine

the multiple sources of information [34] originated in the

proposed framework, e.g., the multiple distances in Equa-

tion (6). Integration of keystroke data with other informa-

tion captured at the same time in desktop [4] and mobile

acquisition [41] will be also explored.

Finally, the proposed TypeNet models will be valuable

beyond user authentication and identification, for applica-

tions related to human behavior analysis like profiling [42],

bot detection [43], and e-health [44].

ACKNOWLEDGMENTS

This work has been supported by projects: PRIMA

(MSCA-ITN-2019-860315), TRESPASS-ETN (MSCA-ITN-

2019-860813), BIBECA (RTI2018-101248-B-I00 MINECO),

edBB (UAM), and Instituto de Ingenieria del Conocimiento

(IIC). A. Acien is supported by a FPI fellowship from the

Spanish MINECO.

REFERENCES

[1] S. Banerjee and D. Woodard, “Biometric authentication and iden-

tification using keystroke dynamics: A survey,” Journal of Pattern

Recognition Research, vol. 7, pp. 116–139, Jan. 2012. 1

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 12

[2] D. Buschek, A. De Luca, and F. Alt, “Improving accuracy, applica-

bility and usability of keystroke biometrics on mobile touchscreen

devices,” in Proc. of the ACM Conference on Human Factors in

Computing Systems, 2015, pp. 1393–1402. 1,7

[3] A. Acien, A. Morales, R. Vera-Rodriguez, and J. Fierrez,

“Keystroke mobile authentication: Performance of long-term ap-

proaches and fusion with behavioral profiling,” in Proc. Iberian

Conf. on Pattern Recognition and Image Analysis (IBPRIA), ser. LNCS,

vol. 11868. Springer, July 2019, pp. 12–24. 1

[4] J. Hernandez-Ortega, R. Daza, A. Morales, J. Fierrez, and J. Ortega-

Garcia, “edBB: Biometrics and Behavior for assessing remote ed-

ucation,” in AAAI Workshop on Artificial Intelligence for Education

(AI4EDU), February 2020. 1,11

[5] T. Yasseri, R. Sumi, A. Rung, A. Kornai, and J. Kertesz, “Dynamics

of conflicts in Wikipedia,” PLOS ONE, vol. 7, no. 6, pp. 1–12, 06

2012. 1

[6] J. Fierrez-Aguilar, D. Garcia-Romero, J. Ortega-Garcia, and

J. Gonzalez-Rodriguez, “Adapted user-dependent multimodal

biometric authentication exploiting general information,” Pattern

Recognition Letters, vol. 26, no. 16, pp. 2628–2639, December 2005.

[7] A. Acien, J. V. Monaco, A. Morales, R. Vera-Rodriguez, and

J. Fierrez, “TypeNet: Scaling up keystroke biometrics,” in Proc.

IEEE/IAPR International Joint Conference on Biometrics (IJCB),

September 2020. 2

[8] F. Alonso-Fernandez, J. Fierrez, D. Ramos, and J. Gonzalez-

Rodriguez, “Quality-based conditional processing in multi-

biometrics: application to sensor interoperability,” IEEE Trans. on

Systems, Man and Cybernetics Part A, vol. 40, no. 6, pp. 1168–1179,

2010. 2,9

[9] R. Tolosana, R. Vera-Rodriguez, J. Fierrez, A. Morales, and

J. Ortega-Garcia, “Benchmarking desktop and mobile handwriting

across cots devices: the e-biosign biometric database,” PLOS ONE,

vol. 5, no. 12, 2017. 2

[10] F. Monrose and A. Rubin, “Authentication via keystroke dynam-

ics,” in Proc. of the 4th ACM Conference on Computer and Communi-

cations Security, 1997, pp. 48–56. 2,3

[11] D. Gunetti and C. Picardi, “Keystroke analysis of free text,” ACM

Transactions on Information and System Security, vol. 8, no. 3, pp.

312—-347, Aug. 2005. 2,3

[12] J. Kim and P. Kang, “Freely typed keystroke dynamics-based

user authentication for mobile devices based on heterogeneous

features,” Pattern Recognition, vol. 108, p. 107556, 2020. 3,9

[13] H. Gascon, S. Uellenbeck, C. Wolf, and K. Rieck, “Continuous

authentication on mobile devices by analysis of typing motion be-

havior,” Sicherheit 2014–Sicherheit, Schutz und Zuverl¨assigkeit, 2014.

[14] H. C¸ eker and S. Upadhyaya, “User authentication with keystroke

dynamics in long-text data,” in Proc. of IEEE 8th International

Conference on Biometrics Theory, Applications and Systems (BTAS),

2016. 3,7,8,9,10

[15] C. Murphy, J. Huang, D. Hou, and S. Schuckers, “Shared dataset

on natural human-computer interaction to support continuous

authentication research,” in Proc. of IEEE/IAPR International Joint

Conference on Biometrics (IJCB), 2017, pp. 525–530. 2,3

[16] J. V. Monaco and C. C. Tappert, “The partially observable Hidden

Markov Model and its application to keystroke dynamics,” Pattern

Recognition, vol. 76, pp. 449–462, 2018. 3,7,8,9,10

[17] D. Deb, A. Ross, A. K. Jain, K. Prakah-Asante, and K. V. Prasad,

“Actions speak louder than (pass)words: Passive authentication

of smartphone users via deep temporal features,” in Proc. of IAPR

International Conference on Biometrics (ICB), 2019. 3,4

[18] A. Morales, J. Fierrez, R. Tolosana, J. Ortega-Garcia, J. Galbally,

M. Gomez-Barrero, A. Anjos, and S. Marcel, “Keystroke Biometrics

Ongoing Competition,” IEEE Access, vol. 4, pp. 7736–7746, Nov.

2016. 2,10

[19] J. V. Monaco, “Robust keystroke biometric anomaly detection,”

arXiv preprint arXiv:1606.09075, Jun. 2016. 2

[20] F. Bergadano, D. Gunetti, and C. Picardi, “User authentication

through keystroke dynamics,” ACM Transactions on Information and

System Security, vol. 5, no. 4, pp. 367–397, Nov. 2002. 2

[21] M. L. Ali, K. Thakur, C. C. Tappert, and M. Qiu, “Keystroke

biometric user verification using Hidden Markov Model,” in Proc.

of IEEE 3rd International Conference on Cyber Security and Cloud

Computing (CSCloud), 2016, pp. 204–209. 2

[22] T. Sim and R. Janakiraman, “Are digraphs good for free-text

keystroke dynamics?” in Proc. of IEEE Conference on Computer

Vision and Pattern Recognition, 2007. 2

[23] K. Palin, A. Feit, S. Kim, P. O. Kristensson, and A. Oulasvirta,

“How do people type on mobile devices? observations from a

study with 37,000 volunteers.” in Proc. of 21st ACM International

Conference on Human-Computer Interaction with Mobile Devices and

Services (MobileHCI’19), 2019. 3,4,9

[24] P. S. Teh, N. Zhang, A. B. J. Teoh, and K. Chen, “A survey on touch

dynamics authentication in mobile devices,” Computers & Security,

vol. 59, pp. 210–235, 2016. 3,7

[25] H. Crawford and E. Ahmadzadeh, “Authentication on the go:

Assessing the effect of movement on mobile device keystroke

dynamics,” in Thirteenth Symposium on Usable Privacy and Security

(SOUPS 2017), 2017, pp. 163–173. 3

[26] V. Dhakal, A. M. Feit, P. O. Kristensson, and A. Oulasvirta,

“Observations on typing from 136 million keystrokes,” in Proc.

of the ACM CHI Conference on Human Factors in Computing Systems,

2018. 3,9

[27] A. Alsultan and K. Warwick, “Keystroke dynamics authentication:

A survey of free-text,” International Journal of Computer Science

Issues (IJCSI), vol. 10, pp. 1–10, 01 2013. 4

[28] R. Tolosana, R. Vera-Rodriguez, J. Fierrez, and J. Ortega-Garcia,

“BioTouchPass2: Touchscreen password biometrics using Time-

Aligned Recurrent Neural Networks,” IEEE Transactions on Infor-

mation Forensics and Security, 2020. 4

[29] Tolosana, Ruben and Vera-Rodriguez, Ruben and Fierrez, Julian

and Ortega-Garcia, Javier, “Deepsign: Deep on-line signature ver-

ification,” IEEE Transactions on Biometrics, Behavior, and Identity

Science, 2021. 4

[30] X. Lu, Z. Shengfei, and Y. Shengwei, “Continuous authentication

by free-text keystroke based on CNN plus RNN,” Procedia Com-

puter Science, vol. 147, pp. 314–318, 01 2019. 4

[31] R. Hadsell, S. Chopra, and Y. Lecun, “Dimensionality reduction

by learning an invariant mapping,” in Proc. Computer Vision and

Pattern Recognition Conference, 2006. 5

[32] K. Q. Weinberger and L. K. Saul, “Distance metric learning for

large margin nearest neighbor classification,” Journal of Machine

Learning Research, vol. 10, pp. 207–244, 2009. 5

[33] A. Morales, J. Fierrez, and J. Ortega-Garcia, “Towards predicting

good users for biometric recognition based on keystroke dynam-

ics,” in Proc. of European Conference on Computer Vision Workshops,

ser. LNCS, vol. 8926. Springer, September 2014, pp. 711–724. 7

[34] J. Fierrez, A. Morales, R. Vera-Rodriguez, and D. Camacho, “Mul-

tiple classifiers in biometrics. Part 2: Trends and challenges,”

Information Fusion, vol. 44, pp. 103–112, November 2018. 7,11

[35] N. Banovic, V. Rao, A. Saravanan, A. K. Dey, and J. Mankoff,

“Quantifying aversion to costly typing errors in expert mobile text

entry,” in Proc. of the CHI Conference on Human Factors in Computing

Systems, 2017, pp. 4229––4241. 7

[36] H. Hyyro, “Bit-parallel approximate string matching algorithms

with transposition,” Journal of Discrete Algorithms, vol. 3, no. 2, pp.

215–229, 2005. 10

[37] I. Kemelmacher-Shlizerman, S. M. Seitz, D. Miller, and E. Brossard,

“The megaface benchmark: 1 million faces for recognition at

scale,” in Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, 2016, pp. 4873–4882. 10

[38] C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sam-

pling matters in deep embedding learning,” in Proc. of the IEEE

International Conference on Computer Vision, 2017, pp. 2840–2848. 11

[39] R. Tolosana, P. Delgado-Santos, A. Perez-Uribe, R. Vera-Rodriguez,

J. Fierrez, and A. Morales, “DeepWriteSYN: On-line handwriting

synthesis via deep short-term representations,” in AAAI Conf. on

Artificial Intelligence (AAAI), February 2021. 11

[40] A. Acien, A. Morales, R. Vera-Rodriguez, J. Fierrez, and O. Del-

gado, “Smartphone sensors for modeling human-computer inter-

action: General outlook and research datasets for user authen-

tication,” in IEEE Conf. on Computers, Software, and Applications

(COMPSAC), July 2020. 11

[41] A. Acien, A. Morales, R. Vera-Rodriguez, and J. Fierrez, “Mul-

tilock: Mobile active authentication based on multiple biometric

and behavioral patterns,” in Proc. ACM Intl. Conf. on Multimedia,

Workshop on Multimodal Understanding and Learning for Embodied

Applications (MULEA), October 2019, pp. 53–59. 11

[42] A. Acien, A. Morales, J. Fierrez, R. V. Rodriguez, and J. Hernandez-

Ortega, “Active detection of age groups based on touch interac-

tion,” IET Biometrics, vol. 8, no. 1, pp. 101–108, January 2019. 11

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, FEBRUARY 2021 13

[43] A. Acien, A. Morales, J. Fierrez, R. Vera-Rodriguez, and

O. Delgado-Mohatar, “Becaptcha: Behavioral bot detection using

touchscreen and mobile sensors benchmarked on humidb,” En-

gineering Applications of Artificial Intelligence, vol. 98, p. 104058,

February 2021. 11

[44] L. Giancardo, A. S´

anchez-Ferro, T. Arroyo-Gallego, I. Butterworth,

C. S. Mendoza, P. Montero, M. Matarazzo, J. A. Obeso, M. L.

Gray, and R. S. J. Est´

epar, “Computer keyboard interaction as an

indicator of early parkinson’s disease,” Scientific Reports, vol. 6,

October 2018. 11

Alejandro Acien Alejandro Acien received the

MSc in Electrical Engineering in 2015 from Uni-

versidad Autonoma de Madrid. In October 2016,

he joined the Biometric Recognition Group -

ATVS at the Universidad Autonoma de Madrid,

where he is currently collaborating as an as-

sistant researcher pursuing the PhD degree.

The research activities he is currently working

in Behaviour Biometrics, Human-Machine Inter-

action, Cognitive Biometric Authentication, Ma-

chine Learning and Deep Learning.

Aythami Morales Aythami Morales received his

M.Sc. degree in Telecommunication Engineer-

ing in 2006 from ULPGC. He received his Ph.D

degree from ULPGC in 2011. He performs his

research works in the BiDA Lab at Universi-

dad Aut´

onoma de Madrid, where he is currently

an Associate Professor. He has performed re-

search stays at the Biometric Research Labo-

ratory at Michigan State University, the Biomet-

ric Research Center at Hong Kong Polytechnic

University, the Biometric System Laboratory at

University of Bologna and Schepens Eye Research Institute. His re-

search interests include pattern recognition, machine learning, trustwor-

thy AI, and biometrics. He is author of more than 100 scientific articles

published in international journals and conferences, and 4 patents. He

has received awards from ULPGC, La Caja de Canarias, SPEGC, and

COIT. He has participated in several National and European projects in

collaboration with other universities and private entities such as ULPGC,

UPM, EUPMt, Accenture, Uni´

on Fenosa, Soluziona, BBVA.

John V. Monaco Dr. Monaco is an Assis-

tant Professor in the Computer Science De-

partment at the Naval Postgraduate School in

Monterey, CA. His research focuses on user

and device fingerprinting, security and privacy

in human-computer interaction, and the develop-

ment of neural-inspired computer architectures.

Dr. Monaco is the recipient of Best Paper Awards

at the 2020 Conference on Human Factors in

Computing Systems and the 2017 International

Symposium on Circuits and Systems. His work

is supported by the National Reconnaissance Office and the Army

Network Enterprise Technology Command.

Ruben Vera-Rodriguez Ruben Vera-Rodriguez

received the M.Sc. degree in telecommunica-

tions engineering from Universidad de Sevilla,

Spain, in 2006, and the Ph.D. degree in elec-

trical and electronic engineering from Swansea

University, U.K., in 2010. Since 2010, he has

been affiliated with the Biometric Recognition

Group, Universidad Autonoma de Madrid, Spain,

where he is currently an Associate Professor

since 2018. His research interests include signal

and image processing, pattern recognition, HCI,

and biometrics, with emphasis on signature, face, gait verification and

forensic applications of biometrics. Ruben has published over 100 sci-

entific articles published in international journals and conferences. He

is actively involved in several National and European projects focused

on biometrics. Ruben has been Program Chair for the IEEE 51st Inter-

national Carnahan Conference on Security and Technology (ICCST) in

2017; the 23rd Iberoamerican Congress on Pattern Recognition (CIARP

2018) in 2018; and the International Conference on Biometric Engineer-

ing and Applications (ICBEA 2019) in 2019.

Julian Fierrez Julian FIERREZ received the

MSc and the PhD degrees from Universidad

Politecnica de Madrid, Spain, in 2001 and 2006,

respectively. Since 2004 he is at Universidad Au-

tonoma de Madrid, where he is Associate Pro-

fessor since 2010. His research is on signal and

image processing, AI fundamentals and applica-

tions, HCI, forensics, and biometrics for security

and human behavior analysis. He is Associate

Editor for Information Fusion, IEEE Trans. on

Information Forensics and Security, and IEEE

Trans. on Image Processing. He has received best papers awards at

AVBPA, ICB, IJCB, ICPR, ICPRS, and Pattern Recognition Letters;

and several research distinctions, including: EBF European Biometric

Industry Award 2006, EURASIP Best PhD Award 2012, Miguel Catalan

Award to the Best Researcher under 40 in the Community of Madrid

in the general area of Science and Technology, and the IAPR Young

Biometrics Investigator Award 2017. Since 2020 he is member of the

ELLIS Society.