HAL Id: tel-02343930

https://tel.archives-ouvertes.fr/tel-02343930

Submitted on 3 Nov 2019

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entific research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Tracking Versus Security: Investigating the Two Facets

of Browser Fingerprinting

Antoine Vastel

To cite this version:

Antoine Vastel. Tracking Versus Security: Investigating the Two Facets of Browser Fingerprinting.

Computer Science [cs]. Université de Lille Nord de France, 2019. English. �tel-02343930�

Tracking Versus Security:

Investigating the Two Facets of

Browser Fingerprinting.

Antoine Vastel

Supervisors: Prof. Romain Rouvoy and Prof. Walter

Rudametkin

Université de Lille

This dissertation is submitted for the degree of

Doctor of Philosophy in Computer Science

Jury de thèse :

Président : Prof. Gilles Grimaud at University of Lille

Rapporteurs : Prof. Daniel Le Métayer at Inria and Prof. Christophe Rosenberger at

ENSICAEN

Examinateurs : Dr. Nataliia Bielova at Inria and Dr. Clémentine Maurice at CNRS

October 24th, 2019

Acknowledgements

I would like to thank everyone who contributed to the realization of this thesis.

First, I would like to thanks my two supervisors, Romain Rouvoy and Walter Rudametkin.

It has been a pleasure to work and exchange ideas with you during these 3 years of

Ph.D. Thank you Walter Rudametkin, a second time, for having motivated me to do a

Ph.D. when I was still a 4th year engineering student at Polytech Lille. Besides working

together, I have also greatly enjoyed our different trips to conferences and to Mexico.

Thank you, Lionel Seinturier, for your team management. Thanks to you, I have been

able to fully focus on my research.

During these 3 years of Ph.D., it has always been a pleasure to come to the office. That

is why I would like to thank all the members of the Spirals team. In particular, Vikas

Mishra, Antonin Durey, Guillaume Fieni, and Thomas Durieux, both for their skills to

design crawler logos, as well as for the beers at "La Capsule". I would also like to thank

Pierre Laperdrix. Thanks a lot for your supervision during my first internship at Inria,

as well as for the two papers we wrote together.

I have also been lucky to do an internship at Brave during my Ph.D. Thus, I would like to

thank Peter Snyder and Ben Livshits for their supervision during this internship. Thank

you also to everyone in the London office, in particular, Blake Loring, Ruba Abu-Salma,

Leo Feng and Yezi Li.

Thank you, Marcia Marron, for welcoming us before the deadlines. Your cooking skills

and the tequila really helped us to get our papers accepted.

Finally, I want to thank my friends and my family. In particular, my mother for having

supported me to do a Ph.D., as well as Amélie, my girlfriend, for her support all along

this adventure.

Abstract

Nowadays, a wide range of devices can browse the web, ranging from smartphones,

desktop computers, to connected TVs. To increase their browsing experience, users

also customize settings in their browser, such as displaying the bookmark bar or their

preferred languages. Customization and the diversity of devices are at the root of browser

fingerprinting. Indeed, to manage this diversity, websites can access attributes about

the device using JavaScript APIs, without asking for user consent. The combination of

such attributes is called a browser fingerprint and has been shown to be highly unique,

making of fingerprinting a suitable tracking technique. Its stateless nature makes it also

suitable for enhancing authentication or detecting bots. In this thesis, I report three

contributions to the browser fingerprinting field:

I collect 122K fingerprints from 2

346 browsers and study their stability over more

than 2 years. I show that, despite frequent changes in the fingerprints, a significant

fraction of browsers can be tracked over a long period;

I design a test suite to evaluate fingerprinting countermeasures. I apply our test suite

to 7 countermeasures, some of them claiming to generate consistent fingerprints,

and show that all of them can be identified, which can make their users more

identifiable;

I explore the use of browser fingerprinting for crawler detection. I measure its

use in the wild, as well as the main detection techniques. Since fingerprints are

collected on the client-side, I also evaluate its resilience against an adversarial

crawler developer that tries to modify its crawler fingerprints to bypass security

checks.

Résumé

De nos jours, une grande diversité d’appareils tels que des smartphones, des ordinateurs ou

des télévisions connectées peuvent naviguer sur le web. Afin d’adapter leur expérience de

navigation, les utilisateurs modifient également diverses options telles que l’affichage de la

barre des favoris ou leurs langues préférées. Cette diversité d’appareils et de configurations

sont à l’origine du suivi par empreintes de navigateurs. En effet, pour gérer cette diversité,

les sites web peuvent accéder à des informations relatives à la configuration de l’appareil

grâce aux interfaces du langage JavaScript, sans obtenir l’accord préalable de l’utilisateur.

La combinaison de ces informations est appelée empreinte de navigateur, et est bien

souvent unique, pouvant donc servir à des fins de suivi marketing. Néanmoins, le fait

que les empreintes ne soient pas stockées sur la machine rend cette technique également

intéressante pour des applications relatives à la sécurité sur le web. À travers cette thèse,

je propose 3 contributions relatives aux domaines des empreintes de navigateurs :

Je collecte 122,000 empreintes de 2,346 navigateurs et analysons leur stabilité

pendant plus de 2 ans. Je montre qu’en dépit de changements fréquents dans

leur empreinte, une part significative des navigateurs peut être suivie pendant de

longues périodes;

Je conçois une suite de tests afin d’évaluer la résistance des outils de protection

contre le suivi par empreinte de navigateurs. Je l’applique à 7 outils de protection,

et montre que tous peuvent être détectés, ce qui peut rendre leur utilisateurs plus

facilement identifiables, et donc vulnérables au suivi;

Enfin, j’explore l’utilisation des empreintes de navigateurs pour la détection de

crawlers. Après avoir mesuré l’usage de cette technique sur le web, je présente

les différents attributs et tests permettant la détection. Comme les empreintes de

navigateurs sont collectées côté client, j’évalue également la résilience de cette forme

de détection contre un adversaire développant des crawlers dont les empreintes ont

été modifiées.

Table of contents

List of figures xiii

List of tables xvii

I Preface 1

1 Introduction 3

1.1 Motivations .................................. 3

1.2 Contributions ................................. 5

1.2.1 Tracking Browser Fingerprint Evolutions . . . . . . . . . . . . . . 5

1.2.2

Studying The Privacy Implications of Browser Fingerprinting Coun-

termeasures .............................. 6

1.2.3

Evaluating the Resilience of Browser Fingerprinting to Block Ad-

versarialCrawlers........................... 6

1.3 List of Scientific Publications . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 List of Tools and Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Outline..................................... 8

II State of the Art 11

2 State-of-the-art 13

2.1 Context .................................... 13

2.1.1 BrowsersEvolution.......................... 13

2.1.2 Monetizing Content on the Web: Advertising and Tracking . . . . 15

2.2 BrowserFingerprinting............................ 17

2.2.1 Definition ............................... 17

2.2.2 Building a Browser Fingerprint . . . . . . . . . . . . . . . . . . . 19

xTable of contents

2.2.3 Studying Browser Fingerprints Diversity . . . . . . . . . . . . . . 34

2.2.4 Use of Browser Fingerprinting on the Web . . . . . . . . . . . . . 39

2.3 Countermeasures Against Fingerprinting . . . . . . . . . . . . . . . . . . 41

2.3.1 Blocking Fingerprinting Script Execution . . . . . . . . . . . . . . 42

2.3.2 Breaking Fingerprint Stability . . . . . . . . . . . . . . . . . . . . 43

2.3.3 Breaking the Uniqueness of Browser Fingerprints . . . . . . . . . 48

2.3.4 Summary of Existing Countermeasures . . . . . . . . . . . . . . . 49

2.3.5 Limits of Fingerprinting Countermeasures . . . . . . . . . . . . . 49

2.4 SecurityApplications............................. 52

2.4.1 Enhancing Web Security Using Browser Fingerprinting . . . . . . 52

2.4.2 Detecting Bots and Crawlers Without Fingerprinting . . . . . . . 57

2.5 Conclusion................................... 60

2.5.1 FP-Stalker: Tracking Browser Fingerprint Evolutions . . . . . 61

III Contributions 65

3 Fp-Stalker: Tracking Browser Fingerprint Evolutions 67

3.1 Browser Fingerprint Evolutions . . . . . . . . . . . . . . . . . . . . . . . 68

3.2 Linking Browser Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . 74

3.2.1 Browser fingerprint linking . . . . . . . . . . . . . . . . . . . . . . 74

3.2.2 Rule-based Linking Algorithm . . . . . . . . . . . . . . . . . . . . 75

3.2.3 Hybrid Linking Algorithm . . . . . . . . . . . . . . . . . . . . . . 79

3.3 Empirical Evaluation of Fp-Stalker .................... 86

3.3.1 Key Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 86

3.3.2 Comparison With Panopticlick’s Linking Algorithm . . . . . . . . 88

3.3.3 Dataset Generation Using Fingerprint Collect Frequency . . . . . 90

3.3.4 TrackingDuration .......................... 91

3.3.5 Benchmark/Overhead . . . . . . . . . . . . . . . . . . . . . . . . 97

3.3.6 ThreatstoValidity .......................... 101

3.3.7 Discussion............................... 102

3.4 Conclusion................................... 102

4 Fp-Scanner: The Privacy Implications of Browser Fingerprint Incon-

sistencies 105

4.1 Investigating Fingerprint Inconsistencies . . . . . . . . . . . . . . . . . . 106

4.1.1 Uncovering OS Inconsistencies . . . . . . . . . . . . . . . . . . . . 107

Table of contents xi

4.1.2 Uncovering Browser Inconsistencies . . . . . . . . . . . . . . . . . 110

4.1.3 Uncovering Device Inconsistencies . . . . . . . . . . . . . . . . . . 111

4.1.4 Uncovering Canvas Inconsistencies . . . . . . . . . . . . . . . . . 112

4.2 EmpiricalEvaluation............................. 113

4.2.1 Implementing FP-Scanner . . . . . . . . . . . . . . . . . . . . . . 113

4.2.2 Evaluating FP-Scanner . . . . . . . . . . . . . . . . . . . . . . . . 120

4.2.3 Benchmarking FP-Scanner . . . . . . . . . . . . . . . . . . . . . . 125

4.3 Discussion................................... 127

4.3.1 Privacy Implications . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.3.2 Perspectives.............................. 130

4.3.3 ThreatstoValidity .......................... 131

4.4 Conclusion................................... 131

5 FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to

Block Adversarial Crawlers 133

5.1 Detecting Crawler Blocking and Fingerprinting Websites . . . . . . . . . 135

5.1.1 Detecting Websites Blocking Crawlers . . . . . . . . . . . . . . . 135

5.1.2 Detecting Websites that Use Fingerprinting . . . . . . . . . . . . 137

5.2 Analyzing Fingerprinting Scripts . . . . . . . . . . . . . . . . . . . . . . 138

5.2.1 Describing our Experimental Dataset . . . . . . . . . . . . . . . . 139

5.2.2 Detecting Crawler-Specific Attributes . . . . . . . . . . . . . . . . 140

5.2.3 Checking Browser Inconsistencies . . . . . . . . . . . . . . . . . . 142

5.2.4 Checking OS Inconsistencies . . . . . . . . . . . . . . . . . . . . . 146

5.2.5 Checking Screen Inconsistencies . . . . . . . . . . . . . . . . . . . 148

5.2.6 Other Non-fingerprinting Attributes . . . . . . . . . . . . . . . . . 149

5.3 Detecting Crawler Fingerprints . . . . . . . . . . . . . . . . . . . . . . . 150

5.3.1 Experimental Protocol . . . . . . . . . . . . . . . . . . . . . . . . 150

5.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.4 Discussion................................... 159

5.4.1 Limits of Browser Fingerprinting . . . . . . . . . . . . . . . . . . 159

5.4.2 ThreatstoValidity.......................... 160

5.4.3 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . 160

5.5 Conclusion................................... 161

xii Table of contents

IV Final Remarks 163

6 Conclusion 165

6.1 Contributions ................................. 167

6.1.1 FP-Stalker: Tracking Browser Fingerprint Evolutions . . . . . 167

6.1.2

FP-Scanner: The Privacy Implications of Browser Fingerprint

Inconsistencies ............................ 167

6.1.3

FP-Crawlers: Evaluating the Resilience of Browser Fingerprint-

ing to Block Adversarial Crawlers . . . . . . . . . . . . . . . . . . 168

6.2 Futurework.................................. 169

6.2.1 Automating Crawler Detection Rules Learning . . . . . . . . . . . 169

6.2.2 Investigate New Fingerprinting Attributes . . . . . . . . . . . . . 170

6.2.3 Studying Fingerprinting for Authentication . . . . . . . . . . . . . 170

6.2.4 Developing Web Red Pills . . . . . . . . . . . . . . . . . . . . . . 171

6.3 Future of Browser Fingerprinting . . . . . . . . . . . . . . . . . . . . . . 172

References 175

Appendix A List of fingerprinting attributes collected 185

A.1 Navigatorproperties ............................. 185

A.2 Screenproperties ...............................186

A.3 Windowproperties .............................. 186

A.4 Audiomethods ................................ 187

A.5 WebGLmethods ............................... 187

A.6 Canvasmethods................................ 188

A.7 WebRTCmethods .............................. 188

A.8 Othermethods ................................ 188

Appendix B Overriding crawlers fingerprints 189

B.1 Overriding the user agent . . . . . . . . . . . . . . . . . . . . . . . . . . 189

B.2 Deleting the webdriver property . . . . . . . . . . . . . . . . . . . . . . . 189

B.3 Adding a language header . . . . . . . . . . . . . . . . . . . . . . . . . . 190

B.4 Forging a fake Chrome object . . . . . . . . . . . . . . . . . . . . . . . . 190

B.5 Overriding permissions behavior . . . . . . . . . . . . . . . . . . . . . . . 191

B.6 Overriding window and screen dimensions . . . . . . . . . . . . . . . . . 192

B.7 Overriding codecs support . . . . . . . . . . . . . . . . . . . . . . . . . . 193

B.8 Removing traces of touch support . . . . . . . . . . . . . . . . . . . . . . 194

B.9 OverridingtoStrings ............................. 195

List of figures

2.1

Schema representing the process to collect a browser fingerprint. To make

the schema more comprehensive, we consider that all resources, including

the fingerprinting script, are delivered by the first party. . . . . . . . . . 19

2.2

Example of two canvas fingerprints used by commercial fingerprinting

scripts...................................... 27

2.3

Presentation of the different attributes related to the size of the screen

andthewindow. ............................... 29

2.4

Examples of two 3D scenes generated with WebGL using Cao et al. ’s [

]

approach. ................................... 31

2.5 Example of a Google reCAPTCHA. . . . . . . . . . . . . . . . . . . . . . 60

3.1 Number of fingerprints and distinct browser instances per month . . . . . 69

3.2 Browser fingerprint anonymity set sizes . . . . . . . . . . . . . . . . . . . 71

3.3

CDF of the elapsed time before a fingerprint evolution for all the finger-

prints, and averaged per browser instance. . . . . . . . . . . . . . . . . . 73

3.4

Fp-Stalker: Overview of both algorithm variants. The rule-based

algorithm is simpler and faster but the hybrid algorithm leads to better

fingerprintlinking. .............................. 76

3.5 First 3 levels of a single tree classifier from our forest. . . . . . . . . . . . 85

3.6

Overview of our evaluation process that allows testing the algorithms using

different simulated collection frequencies. . . . . . . . . . . . . . . . . . . 88

3.7

Example of the process to generate a simulated test set. The dataset

contains fingerprints collected from browser’s

and

, which we sample at

collect_f requency

of 2 days to obtain a dataset that allows us to test

the impact of collect_f requency on fingerprint tracking. . . . . . . . . . 91

3.8

Average tracking duration against simulated collect frequency for the three

algorithms................................... 92

xiv List of figures

3.9

Average maximum tracking duration against simulated collect frequency

for the three algorithms. This shows averages of the longest tracking

durations that were constructed. . . . . . . . . . . . . . . . . . . . . . . . 93

3.10

Average number of assigned ids per browser instance against simulated

collect frequency for the three algorithms (lower is better). . . . . . . . . 94

3.11

Average ownership of tracking chains against simulated collect frequency

for the three algorithms. A value of 1 means the tracking chain contains

only fingerprints of the same browser instance. . . . . . . . . . . . . . . . 95

3.12

CDF of average and maximum tracking duration for a collect frequency of

7days (Fp-Stalker hybrid variant only). . . . . . . . . . . . . . . . . . 95

3.13

Distribution of

number of ids

per browser for a collect frequency of 7

days (Fp-Stalker hybrid variant only). . . . . . . . . . . . . . . . . . . 96

3.14

Speedup of average execution time against number of processes for Fp-

Stalker’shybridvariant .......................... 99

3.15

Execution times for Fp-Stalker hybrid and rule-based to link a finger-

print using 16 processes. Time is dependent on the size of the test set.

The increased effectiveness of the hybrid variant comes at the cost slower

ofexecutiontimes. .............................. 100

4.1 Overview of the inconsistency test suite. . . . . . . . . . . . . . . . . . . 107

4.2

Two examples of canvas fingerprints (a) a genuine canvas fingerprint

without any countermeasures installed in the browser and (b) a canvas

fingerprint altered by the Canvas Defender countermeasure that applies a

uniform noise to all the pixels in the canvas. . . . . . . . . . . . . . . . . 112

4.3

Detection accuracy and false positive rate using the transparent pixels

test for different values of Ntp (number of transparent pixels). . . . . . . 118

4.4

Detection accuracy and false positive rate using the fonts test for different

values of Nf(number of fonts associated with the wrong OS). . . . . . . 119

4.5

Detection accuracy and false positive rate of the browser feature test for

different values of Ne(number of wrong features). . . . . . . . . . . . . . 119

4.6

Execution time of FingerprintJS2 inconsistency tests and Fp-Scanner

withdifferentsettings. ............................ 126

List of figures xv

5.1

Overview of FP-Crawlers: In Section 5.1 I crawl the Alexa’s Top10K

to measure the ratio of websites using fingerprinting for crawler detection.

In Section 5.2, I explain the key fingerprinting techniques they use. Finally,

in Section 5.3 I evaluate the resilience of fingerprinting against adversary

crawlers..................................... 134

5.2

For each kind of crawler, we report on the average number of times per

crawl it is blocked by websites that use and that do not use fingerprinting. 157

List of tables

2.1

Definition of different attributes that provide information about the size

of the screen and the window. For each attribute we present a possible

value for of the attribute. All the possible values shown in the table come

fromthesameuser............................... 28

2.2

Overview of the different countermeasures and their strategies to protect

against browser fingerprinting. . . . . . . . . . . . . . . . . . . . . . . . . 50

3.1 An example of a browser fingerprint collect by the AmIUnique extension. 70

3.2

Durations the attributes remained constant for the median, the 90

and

the 95th percentiles. ............................. 72

3.3

Feature importances of the random forest model calculated from the

fingerprinttrainset. ............................. 83

3.4

Number of fingerprints per generated test set after simulating different

collectfrequencies. .............................. 92

4.1 Mapping between common OS and platform values............. 108

4.2

Mapping between

and substrings in WebGL

renderer

vendor

at-

tributesforcommonOSes...........................109

4.3 List of attributes collected by our fingerprinting script. . . . . . . . . . . 114

4.4 List of relevant tests per countermeasure. . . . . . . . . . . . . . . . . . . 117

4.5

Optimal values of the different parameters to optimize, as well as the FPR

and the accuracy obtained by executing the test with the optimal value. . 118

4.6 Comparison of accuracies per countermeasures . . . . . . . . . . . . . . . 121

4.7 Fp-Scanner steps failed by countermeasures . . . . . . . . . . . . . . . 124

xviii List of tables

5.1

Different fingerprinting tests associated with the scripts that use them.

The symbol

✓

indicates the attribute is collected and that a verification

test is run directly in the script. The

∼

symbol indicates that the attribute

is collected but there is no verification test in the script. . . . . . . . . . 141

5.2 Support of audio codecs for the main browsers. . . . . . . . . . . . . . . 146

5.3 Support of video codecs for the main browsers. . . . . . . . . . . . . . . . 146

5.4 List of crawlers and altered attributes. . . . . . . . . . . . . . . . . . . . 152

Part I

Preface

Chapter 1

Introduction

1.1 Motivations

As users, we all have our own way to browse the web. Some users browse the web using

a smartphone, while others prefer to use a laptop, sometimes with an external monitor.

Some users decide to have the bookmark bar visible in their browser, while others prefer

to increase the default font size because they usually sit far from their monitor. This

diversity of devices, browsers, and operating systems, as well as their customization,

is at the root of browser fingerprinting. In his thesis, Mayer [

] showed that browsers

could be uniquely identified because of their configuration. Indeed, to adapt websites’

behavior based on the user device, browsers enable scripts to access information about the

user device and its configuration using JavaScript APIs. The combination of attributes

collected from these APIs is called a browser fingerprint and can be collected by tracking

scripts without obtaining the user consent.

In 2010, Eckersley [

] studied browser fingerprints uniqueness. He created the Panopticlick

website and collected more 470K browser fingerprints, among which 83.6% were unique.

He also showed that more than 94.2% of the fingerprints when Flash or Java plugins

were activated. Because of this uniqueness, he argued browser fingerprints can be used

for tracking. In particular, it can be used in addition to cookies, to respawn them when

they have been deleted by the user. Indeed, while cookies are stored in the browser and

can, therefore, be erased, browser fingerprints are collected in the browser but are then

stored on a remote server the user has no control over.

4Introduction

After Eckersley’s study, several studies [

–

] have measured the use of fingerprinting in

the wild. These studies all showed that fingerprinting was used by a significant fraction

of the most popular websites. They also showed that commercial fingerprinters adapt

their behavior to leverage new APIs. Indeed, while Eckersley showed that both Flash and

Java could be used to obtain the list of fonts, in their 2013 study, Nikiforakis et al. [

]

showed that none of the commercial fingerprinters they studied were still using Java.

They also noticed that since Flash was getting less popular due to its deprecation, one of

the fingerprinters was using a new approach to obtain the list of fonts using JavaScript.

More recently, Englehardt et al. [

] showed that fingerprinters had found new approaches

to exploit APIs introduced by HTML5, such as the canvas, WebGL and audio APIs.

To protect against fingerprinting, several countermeasures have been proposed, ranging

from simple browser extensions that lie about the device nature to forked browsers

that lie about the list of fonts available [

–

]. Niforakis [

] and Acar [

] evaluated

the effectiveness of fingerprinting countermeasures, such as simple user agent spoofers,

or Fireglove, a browser extension that randomly lies about attributes constituting a

fingerprint. Their evaluations showed that countermeasures could be detected because

they generated inconsistent fingerprints. Thus, they argued that using these kinds of

countermeasures could be counterproductive for a user since she could become identifiable.

Besides tracking, browser fingerprinting can also be used to improve web security. The

main use-case studied in the literature is to enhance authentication [

–

] by using the

fingerprint as a second factor. Burztein et al. [

] showed that browser fingerprinting can

also be used to detect crawlers. They proposed a dynamic challenge-response protocol

that leverages the unpredictability and yet stable nature of canvas rendering to detect

devices that lie about their nature—e.g. emulated devices or devices that modify the

browser and OS contained in their user agent.

In this thesis, I aim at improving the understanding of browser fingerprinting both

concerning its impact on privacy, as well as its applications to improve web security.

Concerning its impact on privacy, this thesis aims to answer the following research

questions:

1. Are fingerprints stable enough to be used for tracking?

2. How long can browsers be tracked using only their fingerprint?

What is the overhead of using browser fingerprinting tracking algorithms at scale?

1.2 Contributions 5

Are fingerprinting countermeasures effective and what are the privacy implications

of using these countermeasures?

Regarding the adoption of browser fingerprinting in a security context, this thesis aims

to answer the following research questions:

How widespread is the use of fingerprinting for crawler detection among popular

websites?

2. What fingerprinting techniques are used specifically to detect crawlers?

How resilient is browser fingerprinting against an adversary that alters its fingerprint

to escape detection?

1.2 Contributions

1.2.1 Tracking Browser Fingerprint Evolutions

While browser fingerprints need to be both unique and stable for tracking, studies tend

to focus only on uniqueness at the expense stability. Nevertheless, browser fingerprints

can change frequently for several reasons ranging from a browser or a driver update

to a change in the browser settings. Thus, I argue that it is essential to accurately

measure browser fingerprint stability, in particular, the stability of the different attributes

constituting it, and whether or it varies across browsers. Moreover, to better understand

how effective browser fingerprinting is as a tracking mechanism, there is a need to measure

how long can browsers be tracked using only their fingerprints.

To address the study of the stability and the tracking duration, I analyze more than 122K

browser fingerprints from 2,346 distinct browsers collected over a two year period using the

AmIUnique browser extensions. My results confirm Eckersley findings that fingerprints

change frequently. I show that half of the browser instances display at least one change

in their fingerprint in less than five days. Nevertheless, we observe discrepancies across

browsers, with some browsers having frequent changes in their fingerprints and others

with more stable fingerprints. I also study the stability of fingerprinting techniques

that were not available when Eckersley’s study was conducted. In particular, I show

that—in addition to having a high entropy—canvas fingerprint is one of the most stable

attributes in a fingerprint. For half of the browsers, its value remains stable more than

300 days. Then, I study how long browsers can be tracked using only their fingerprints. I

6Introduction

propose two linking algorithms, one based on rules, and another hybrid one, that leverage

both rules and machine learning to link fingerprint evolutions over time. I show that

while a significant fraction of browsers is immune against fingerprinting, mostly because

their fingerprints are not unique or are too close from other fingerprints, around 32% of

browsers can be tracked for more than 100 days. Moreover, I show that these linking

algorithms can be easily parallelized to run on cheap public cloud instances, making of

fingerprinting a threat to privacy.

1.2.2 Studying The Privacy Implications of Browser Finger-

printing Countermeasures

Different defense strategies and countermeasures have been proposed to protect against

browser fingerprinting. With new APIs being added frequently to browsers, it is difficult to

always have up-to-date countermeasures that protect against new forms of fingerprinting.

Moreover, studies [

] revealed the risk of becoming more identifiable when using

fingerprinting countermeasures. Thus, I plan to study the privacy implications of using

fingerprinting countermeasures, and whether or not they are counterproductive. I propose

Fp-Scanner, a test suite that detects inconsistent fingerprints created by fingerprinting

countermeasures. I apply Fp-Scanner to 7 different countermeasures, ranging from

simple browser extensions to peer-reviewed forked browsers, and I show that even when

countermeasures claim to generate consistent fingerprints, their presence can be revealed.

Beyond spotting fingerprinting countermeasures, I demonstrate that Fp-Scanner can

also recover original values, such as the browser or the operating system. I leverage

my findings to discuss different strategies for building more effective fingerprinting

countermeasures that do not degrade user privacy.

1.2.3 Evaluating the Resilience of Browser Fingerprinting to

Block Adversarial Crawlers

Although some studies showed browser fingerprinting can be used in a security context, for

example, to enhance authentication or to detect emulated devices, fingerprinting is often

associated with unwanted tracking. I propose to study the use of browser fingerprinting

in a security context, as a mechanism to detect bots, in particular, crawlers on the web. I

show that fingerprinting for crawler detection is popular among websites of the Top Alexa

10,000. I study the techniques used by commercial fingerprinting scripts. While these

1.3 List of Scientific Publications 7

scripts use techniques also used for tracking, such as canvas or font enumeration, they

also developed specific techniques that aim at identifying if a fingerprint belongs to known

headless browsers or if a browser is instrumented. I also evaluate the effectiveness and

resilience of such detection techniques. Indeed, using fingerprinting in a security context

is challenging due to the adversarial nature of an attacker. Since browser fingerprints are

collected in the browser, it means a skilled attacker can modify its value to bypass security

checks. Thus, I show that, while crawler detection using fingerprinting provides better

results against simple crawlers with few modifications on their fingerprints compared to

other existing approaches, it fails to detect crawlers with more modifications, as well as

non-headless crawlers. Therefore, my results show that fingerprinting can quickly detect

simple headless crawlers, while its integration in a layered approach, in addition to other

existing detection approaches, can strongly increase its resilience.

1.3 List of Scientific Publications

During the course of this thesis, I published papers in the following conferences and

workshops:

[16]

Vastel, A., Laperdrix, P., Rudametkin, W., & Rouvoy, R. (2018, May). FP-

STALKER: Tracking Browser Fingerprint Evolutions. In IEEE S&P 2018-39th

IEEE Symposium on Security and Privacy (pp. 1-14). IEEE: https://hal.inria.fr/

hal-01652021.1

[17]

Vastel, A., Laperdrix, P., Rudametkin, W., & Rouvoy, R. (2018). FP-scanner: the

privacy implications of browser fingerprint inconsistencies. In 27th USENIX Security

Symposium (USENIX Security 18) (pp. 135-150): https://hal.inria.fr/hal-01820197.

[18]

Vastel, A., Rudametkin, W., & Rouvoy, R. (2018, April). FP-TESTER: Automated

Testing of Browser Fingerprint Resilience. In 2018 IEEE European Symposium

on Security and Privacy Workshops (EuroS&PW) (pp. 103-107). IEEE: https:

//hal.inria.fr/hal-01717158.2

I am the main author of the paper. I wrote the majority of its contents. I proposed the contributions

and the evaluation protocol and I wrote the experimental framework.

I am the main author of the paper. I wrote the majority of its contents. I proposed most of the

contributions and the evaluation protocol and I wrote the experimental framework.

8Introduction

Vastel, A., Blanc, X., Rudametkin, W., & Rouvoy, R. FP-Crawlers: Evaluating

the Resilience of Browser Fingerprinting to Block Adversarial Crawlers (under

submission). 1

[19]

Vastel, A., Snyder, P., & Livshits, B. Who Filters the Filters: Understanding the

Growth, Usefulness and Efficiency of Crowdsourced Ad Blocking (under submission):

https://arxiv.org/abs/1810.09160.3

1.4 List of Tools and Prototypes

During the course of this thesis, I developed several algorithms, tools, prototypes, and

libraries to gather data, test different research hypothesis or simply made or research

more accessible. To encourage the reproducibility of my results, I published the entirety

of the code:

•

Implementations of Fp-Stalker, our two algorithms to link browser fingerprints over

time [20],

•

Implementation of Fp-Scanner, our test suite to detect inconsistencies introduced

by fingerprinting countermeasures [21],

•

Code of the crawlers and the labeling interface used in Chapter 5to explore the

use of browser fingerprinting for crawler detection [22],

•

An open-source implementation of Picasso canvas as described in Burztein et al. [

]

paper [23],

•Fp-Collect, a browser fingerprinting library oriented towards bot detection [24],

•

Fp-Scanner (bis), a library that leverages Fp-Collect browser fingerprints to detect

bots [25].

1.5 Outline

The thesis is organized as follows.

I am the main author of the paper. I wrote a significant part of its contents. I proposed some of

the contributions and some of the evaluation protocol. I was the main contributor of the experimental

framework.

1.5 Outline 9

Chapter 2

starts by introducing the context of this thesis. I present how the diversity

of devices and customization are at the root of browser fingerprinting. Then, I define

what is browser fingerprinting, what are the main attributes constituting a fingerprint

and how they are collected. I review the existing literature on browser fingerprinting. I

analyze existing fingerprinting countermeasures along with their main shortcomings. I

also explore existing approaches that use fingerprinting in a security context. Finally, I

present other non-fingerprinting crawler detection approaches, such as time series analysis

and CAPTCHAs, to explain how fingerprinting compare to them.

Chapter 3

presents my study on tracking using fingerprinting. Most large-scale studies

focus on fingerprint uniqueness. In Chapter 3, I fill the gap by studying the stability of

fingerprints over more than 2 years using data collected from the AmIUnique browser

extensions. Moreover, I propose two linking algorithms that aim at linking evolutions of

fingerprints of the same browser over time and show that, despite frequent changes, a

significant fraction of browsers can be tracked for more than 100 days. This chapter is

an extension of the FP-Stalker paper [

] published at S&P 18 and includes 25,000 new

fingerprints than in the original paper.

Chapter 4

investigates the privacy impact of fingerprinting countermeasures. Because

countermeasures may generate inconsistent fingerprints, they can be detected and harm

their user privacy by making them more identifiable. We design a test suite that leverages

inconsistencies to detect the presence of fingerprinting countermeasures and show that

all of the 7 countermeasures I evaluate can be detected. This chapter was originally

published as a conference paper entitled Fp-Scanner: The Privacy Implications of Browser

Fingerprint Inconsistencies [17] published at Usenix Security 18.

Chapter 5

explores the use of fingerprinting in a context of crawler detection. I explore

its popularity among websites of the Top Alexa 10K and describe the main detection

techniques used by commercial fingerprinters to distinguish humans from bots. Because

fingerprints can be modified, I also measure the resilience of this approach against an

adversarial crawler developer.

Finally, Chapter 6concludes this thesis by summarizing my contributions, proposing

future work and discussing a possible future for browser fingerprinting.

Part II

State of the Art

Chapter 2

State-of-the-art

2.1 Context

2.1.1 Browsers Evolution

The complexity of browsers has continuously increased over time. Before 1995 and the

introduction of the JavaScript language, web pages were only constituted of static content

structured using HTML tags to describe the semantics of the content. Thus, it was not

possible for pages to perform any dynamic tasks on the client-side, such as reacting to

clicks or mouse movements. In 1995, Brendan Eich developed the JavaScript language,

while working at Netscape, the company behind the proprietary Netscape browser. The

introduction of this new language in the browsers started a new era of a more dynamic

web.

An increasing diversity of APIs.

Since then, browser vendors have kept on adding

new features to attract users. Applications that were once available only as heavy desktop

clients are now available as web applications that can run in a browser. For example,

advanced text and slides editors, such as Microsoft Word or Open Office Impress were

only available as desktop clients. Nowadays, several online services propose similar

tools running as web applications, such as Google Docs, Slides.com, and Prezi. From

a developer point of view, web applications are supposed to make the development

process more convenient, as it removes the burden of managing device compatibility

issues. Indeed, web applications should be able to run in all browsers that stick to the

web standards. Besides text and slides editors, other complex applications, such as

14 State-of-the-art

video games and real-time video chats, can now efficiently run in browsers. For these

applications to work in browsers, it required browser vendors to add several APIs, like the

canvas and the WebGL APIs, to efficiently generate 2D and 3D shapes or the WebRTC

API that enables real-time communication.

An increasing diversity of devices.

In addition to the increasing number of APIs and

features available in browsers, the diversity of devices capable of browsing the web has

also increased drastically. While a few years ago only desktop computers could browse

the web, now, a wide range of devices ranging from mobile devices, desktop computers,

to connected TV that embeds a browser can browse the web. To help websites to manage

this diversity of devices, for example, to better display the content or adapt the website

to the performance of the device, browser vendors provide several JavaScript APIs that

enable websites to access information about the device, which as we show in this thesis,

is at the root of browser fingerprinting.

Evolution of browsers market share and its consequences.

While during the two

browser wars,

there was a race between the different browser vendors to continuously

add more features, often at the expense of a proper evaluation of their impact on privacy,

nowadays, the situation has stabilized, with fewer browser vendors left. Google, with its

Chrome browser, represents more than 62% of the browser market share,

followed by

Safari with 15% and Firefox, with less than 5% of the market share. Browser vendors and

the World Wide Web Consortium (W3C) tend to better take into account the privacy

aspects before introducing new APIs, in particular, how the API could be used for to

fingerprint a browser. For example, in the case of the new

navigator.deviceMemory

attribute introduced in December 2017,

the W3C recommended to round the value

returned to reduce the fingerprinting risk.

Moreover, privacy and security have become

strong commercial arguments.

Thus, major browser vendors, such as Mozilla and Apple,

added more user-friendly mechanisms to manage privacy preferences and countermeasures

in their browser, such as the anti browser fingerprinting protection in Firefox,

or the

Inteligent Tracking Prevention (ITP) in Safari.

New privacy-friendly browsers, such

1https://en.wikipedia.org/wiki/Browser_wars

2http://gs.statcounter.com/browser-market-share#monthly-201812-201812-map

3https://developer.mozilla.org/en-US/docs/Web/API/Navigator/deviceMemory

4https://w3c.github.io/device-memory/#sec-security-considerations

https://www.theverge.com/2019/3/14/18266276/apple-iphone-ad-privacy-

facetime-bug

https://blog.mozilla.org/futurereleases/2019/04/09/protections-against-

fingerprinting-and-cryptocurrency-mining-available-in-firefox-nightly-and-beta/

7https://webkit.org/blog/7675/intelligent-tracking-prevention/

2.1 Context 15

as Brave and Cliqz, have also emerged. Browser vendors are also more willing to take

measure for fixing security issues, even though it can impact the user experience by adding

significant performance overhead. For example, Google Chrome added site isolation

to enhance security, in particular against side-channel attacks, such as Spectre and

Meltdown, even though this can lead to a memory increase of 10%. Similarly, browser

vendors deprecated browser plugins because of the security issues they engendered, even

though some of them—e.g. the Adobe Flash plugin—were used on popular websites,

like YouTube.

Instead, they favored browser extensions that have fewer privileges

than plugins and that use a system of permissions similar to the one used for mobile

applications.

2.1.2 Monetizing Content on the Web: Advertising and Track-

ing

Evolution of online advertisting.

Advertising is the most popular way to monetize

content on the web [

]. Nevertheless, since the first online advertising banners in 1995,

to the advanced ad-targeting platforms, advertising has gone through multiple stages.

At the beginning of online advertising, websites charged advertisers an upfront cost to

occupy some space with a banner on their website. Because of the popularity of these

banners, advertisers started to help their customers choose the most adapted audience

to display their banners depending on the demography of the users they were trying to

target. To help companies to measure how their advertising campaigns were performing in

real-time, Doubleclick introduced a service, called DART (Dynamic Advertising Reporting

and Targeting) that aimed at helping companies to measure the number of times their

ads had been viewed and clicked on the different websites their ads were present on.

This new feature was game-changing and lead to the creation of a new pricing model.

While advertisers used to pay websites to host their banner, no matter the amount of

traffic, views, and clicks, after the introduction of DART, the price started to depend

on the number of times ads were viewed (cost per impression). Around 2000, search

engines became increasingly more important in the web ecosystem, providing users

a convenient way to find relevant content on an ever-growing web. Search engines

monetized their popularity by enabling advertisers to target users based on the keywords

https://security.googleblog.com/2018/07/mitigating-spectre-with-site-isolation.

html

https://youtube-eng.googleblog.com/2015/01/youtube-now-defaults-to-html5_27.

html

16 State-of-the-art

they were searching for. This also created a new shift in the advertising pricing model,

with the introduction of pay per click instead of pay per impression. Finally, around

2005, advertisers have started to gather data to make advertising more relevant to users

and therefore maximize their incomes. This technique, called behavioral advertising or

targeted advertising, consists in gathering data about the users, such as their IP address,

the pages they have visited and the products they bought online, to build user profiles of

interests that are later used to provide more relevant ads. Since users only see ads they

are more interested in, there is more chance they click on it, which, therefore, increases

the advertiser revenues.

The tracking industry.

To build these user profiles, the advertising industry heavily

relies on trackers. Trackers are scripts or images used to gather and transmit data to

the tracking company servers. To increase the amount of data collected, trackers are

placed on several websites, most of the time not owned by the tracking company, as

third-party resources. To incentivize websites to use trackers on their pages, trackers

tend to provide a useful service. For example, trackers may take the form social media

widgets, such as the Facebook Like button or the Twitter retweet button that aim at

increasing the website visibility by making it more easily shareable on social media.

Trackers can also take the form of analytic services, e.g. Google Analytics, to help

websites better understand their audience. To keep track of users over time and across

different websites, trackers generate a unique user identifier (UUID) that they store

in the browser using cookies or other storage APIs, such as local and session storage,

as well as indexed database. Trackers also misuse the ETag cache header to store and

retrieve user identifiers. The idea behind multiplying the number of storage mechanisms

is that, if a user deletes only one of its stored identifiers, the other identifiers can still be

regenerated using the other storage mechanisms.

Data protection laws.

Because of the invasive nature of trackers, policymakers have

proposed laws to protect users data. One of the most recent and important law is

the European General Data Protection Regulation (GDPR) that requires websites and

trackers to obtain user consent before they gather data. Moreover, websites are required

to specify the purpose of the data collection, as well as the list of companies they will

share the data with. While previous laws used to specifically targets cookies,

GDPR is

more general. Thus, when they refer to the notion of user identifier, it does not refer

only to explicit identifiers stored in cookies, but to any forms of data that could be used

as an identifier, for example, a browser fingerprint.

10https://www.cookielaw.org/the-cookie-law/

2.2 Browser Fingerprinting 17

Conclusion.

To gather information about users, the online advertising industry heavily

relies on trackers that take different forms, ranging from social media widgets to analytics

services. To keep track of user identities along time and across different websites, trackers

store a unique user identifier in the browser using cookies or other storages mechanisms.

Nevertheless, by using a single storage mechanism, trackers run the risk that when a

user deletes her cookies, they lose track of her valuable information. Thus, some trackers

have come up with a more invasive tracking technique: browser fingerprinting. This

technique consists in gathering attributes about the user device and configuration using

APIs provided by the browsers. Due to the high diversity of devices and configurations,

the combination of these attributes, called a browser fingerprint, is often unique, and

can, therefore, be used for tracking. Contrary to cookies that can be erased by the user,

fingerprints cannot be deleted since they are not stored on the user device, making it

more difficult for users to protect themselves against it.

2.2 Browser Fingerprinting

2.2.1 Definition

Browser fingerprint

is a set of attributes that can be used to identify a browser. The

analogy with a digital fingerprint arises from the fact that this combination of attributes

is often unique [

]. Browser fingerprints are used for tracking purposes, as well as

for security purposes, such as bot detection or to enhance authentication. One of the

main differences between browser fingerprinting and cookies lies in the stateless nature

of browser fingerprints. While cookies used for tracking rely on storing an identifier in

the browser, browser fingerprints are totally stateless, which means they are not stored

on the user device, making its detection more difficult and its deletion impossible.

In this thesis, the words fingerprint and browser fingerprint, as well as the words

fingerprinting and browser fingerprinting are used interchangeably. Moreover, we consider

only permissionless browser fingerprinting—i.e., attributes that can be accessed without

requesting any permission to the user. Thus, it excludes several attributes, such as the

precise geolocation using the

navigator.geolocation

API or advanced forms of

WebRTC

fingerprinting [

] that can obtain the name of multimedia peripherals connected to a

device. While this definition of fingerprinting is widely accepted in the literature, the

different analyses of fingerprinting scripts conducted during this thesis also show that

commercial fingerprinters do not use attributes that require permissions. Nevertheless,

18 State-of-the-art

in the case where fingerprinting is used for more legitimate purposes, such as enhancing

authentication, we consider these attributes could be part of the fingerprints as users

would probably have more incentives to grant their authorization to the fingerprinting

script.

Attributes constituting a browser fingerprint can be either collected in the browser using

JavaScript or plugins, such as Flash, as well as attributes sent by the browser, such as

HTTP headers. Typically, the IP address or the geolocation that can be derived from it

are not considered as part of a browser fingerprint [

]. This definition also excludes

other forms of fingerprinting techniques, such as TCP fingerprinting [

], a technique

that leverages lower-level information from the TCP stack, such as the order of the TCP

options. While our definition of browser fingerprinting allows fingerprints collected both

on computers and mobiles, the only constraint is that it must bcollected using a browser.

Thus, it excludes all forms of fingerprinting conducted using applications, whether or not

they require permissions, such as presented by Kurtz et al. [30] and Wu et al. [31].

Collecting browser fingerprints.

Figure 2.1 provides an overview of the process to

collect a browser fingerprint. When a user visits a website with her browser, it sends

GET

request to the server to retrieve a page. Upon receiving the request, the server

sends a response containing the content of the page. Fingerprinting scripts are included

as JavaScript files in the HTML returned. These scripts may be served as first-party

scripts by the domain visited, or by third-party domains to track users across different

websites. Once the script was loaded, the fingerprinting script can execute to collect the

different attributes. In practice, most of the fingerprinting scripts wait for the Document

Object Model (DOM) to be also loaded since the script may need to interact with it

to collect some fingerprinting attributes, such as the list of fonts. After the JavaScript

fingerprinting script completes to execute, it needs to transmit the fingerprint collected

to a server. Some fingerprinting script etransmit the whole list of attributes, while others

simply compute a hash that is transmitted. Different fingerprints can be used to transmit

the fingerprint to a remote server. If only a hash is transmitted or if the fingerprint

collected is small, the fingerprint can be sent using an image pixel where the value of

the fingerprint is added as a GET parameter of the image URL. When fingerprints are

too big to be sent as images, some can trigger a

POST

request using the

XMLHttpRequest

request API

or the

navigator.sendBeacon

API.

The

sendBeacon

function has the

advantage of being asynchronous, which means that data can be transmitted when a

11https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest

12https://developer.mozilla.org/en-US/docs/Web/API/Navigator/sendBeacon

2.2 Browser Fingerprinting 19

1. The browser

requests a page

2. The server sends the

requested page along

with additional

JavaScript and CSS stylesheets

3. JavaScript Fingerprinting

script executes in the browser

and collects a ﬁngerprint

4. The ﬁngerprinting script

transmitsthe ﬁngerprint

to a remote server

Browser running

on a device

Server

Figure 2.1 Schema representing the process to collect a browser fingerprint. To make the

schema more comprehensive, we consider that all resources, including the fingerprinting

script, are delivered by the first party.

user closes a tab without blocking it. It is particularly useful when fingerprinters also

collect dynamic information, such as clicks and mouse movements in addition to the

fingerprinting attributes. Thus, they besides its fingerprint, they can also monitor all

her activity on the page. While this feature is also interesting for security purposes, one

should be careful since the

beforeunload

event used to signal that a user is closing the

page is often badly implemented in headless browsers.13

Upon reception of the fingerprint, the server can also collect the HTTP headers associated

with the

GET

or the

POST

request used to send the fingerprint, add these attributes to

fingerprint and then store the fingerprint in a database.

2.2.2 Building a Browser Fingerprint

In this subsection, we present the different attributes constituting a browser fingerprint.

While fingerprinting can be used for security purposes, we focus on attributes used

13https://github.com/GoogleChrome/puppeteer/issues/2386

20 State-of-the-art

for tracking. We provide more details about fingerprinting attributes used for security

at the end of this chapter, as well as in Chapter 5where we explain how commercial

fingerprinters detect crawlers based on their fingerprint. Fingerprint attributes require

two properties when used for tracking:

1. Uniqueness.

While not each attribute need to be unique individually, their

combination—i.e., the browser fingerprint—should be unique in order to distinguish

between different browsers. Indeed, if different browsers have the same fingerprint,

they cannot be tracked using browser fingerprinting.

2. Stability.

Even in the case where a browser fingerprint is unique, tracking requires

a certain stability of the fingerprinting. Indeed, if we consider an extension that

randomizes the value of a canvas at each visit, then the browser fingerprint keeps

on being unique solely because the canvas is unique. Nevertheless, since the canvas

keeps on changing, it becomes challenging for a fingerprinter to keep track of the

fingerprint over time.

We distinguish three main families of attributes constituting a fingerprint: HTTP

headers, attributes collected using JavaScript and attributes collected using Flash. For

each category, we present the different attributes of this category. We explain how these

attributes are collected and we also provide examples, as well as information about the

attribute such as its uniqueness.

2.2.2.1 HTTP Headers

When a browser sends an HTTP request to obtain a page or to transmit data using

the

XMLHttpRequest

API for example, it attaches headers to its request that provide

information to the server receiving this request. The role of these headers has been

defined in different Request For Comments (RFC), in particular in the RFC 7231 [

]

where they define the semantics and the contents of header. They also explain how some

of the headers leak information about the user or the device, and the risk it can be used

for fingerprinting (Section 9.7 of the RFC)14.

We present four different HTTP headers, as well as a fifth attribute, the order of the

headers, that leak information about the device and its user and that can therefore be

used for fingerprinting.

Fingerprinting risks related to HTTP headers: https://tools.ietf.org/html/rfc7231#section-

9.7

2.2 Browser Fingerprinting 21

User-Agent.

This header provides information about the device and the software, a

browser in our case, sending the request. The semantic and the content of this header

are defined in the section 5.5.3 of the RFC 7231 [

]. It can be used by servers to gather

analytics data or for compatibility purposes when an application is only available on

certain kinds of devices. The

User-Agent

header provides several information useful

for fingerprinting, such as the browser and its version, as well as the Operating System

(OS). To protect against fingerprinting, the RFC advises developers not to include fine-

grained details about the device. Nevertheless, it does not specify any format for the

User-Agent

header. Thus, as we show in the table presenting examples of user agents,

some applications on mobile devices with an embedded browser may indicate sensitive

information, such as the name of the carrier.

User-Agent Description

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3)

AppleWebKit/537.36 (KHTML, like Gecko)

Chrome/72.0.3626.121 Safari/537.36

Chrome browser version 72 on

MacOS

Opera/9.30 (Nintendo Wii; U; ; 3642; en) Opera browser on a Wii

Mozilla/5.0 (iPhone; CPU iPhone OS 12_1 like Mac OS X)

AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/16B92

[FBAN/MessengerForiOS;FBAV/ 192.0.0.46.101;

FBBV/131204877; FBDV/iPhone8,4; FBMD/iPhone;FBSN/iOS;

FBSV/12.1;FBSS/2;FBCR/Play;

FBID/phone;FBLC/pl_PL;FBOP/5]

Browser integrated in the

Messenger app on iPhone

Accept-Language.

This header is sent by the browser to indicate the languages the

user prefers [

]. The user can declare multiple languages, each one associated with a

preference value. This preference value, also called quality value, is specified using a

parameter. Thus, both the list of languages and their associated quality values chosen

by the user can be collected to be part of a fingerprint. Contrary to the majority of

the fingerprinting attributes that reflect the nature of the device or the browser, this

attribute reflects the user preferences.

Accept-Encoding.

This header is sent by the browser to indicate the accepted encodings

for the response. Similarly to the

Accept-Language

header, the browser can indicate

multiple encodings, each with a quality value. Nevertheless, quality values are not

commonly used with this header in the main browsers.

22 State-of-the-art

Accept-Language Comments

ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7

Russian in priority, then

American english then any form

of English.

en,en-US; q=0.9,de-DE; q=0.8,de; q=0.7,fr; q=0.6,pl;

q=0.5,uk; q=0.4,ru; q=0.3,sv; q=0.2,nb;q=0.1

American English or any form or

English in priority. Then

German, French, Polish,

Ukrainian, Russian, Swedish,

Norwegian.

zh-CN,zh;q=0.9,en;q=0.8 Chinese then English.

Accept-Encoding Comments

br, gzip, deflate Encoding header sent by Safari

gzip , deflate, br Encoding header sent Chrome and Firefox

Accept.

The

header specifies the response media types accepted by the browser.

Similarly to the

Accept-Language

header, the browser can indicate multiple types, each

with a quality value to indicate its preferences.

Accept Description

text/html,application/xhtml+xml, application/xml;

q=0.9,image/webp, image/apng,*/*;q=0.8

Accept header when requesting a

page on Chrome version 72.

text/html,application/xhtml+xml, application/xml;

q=0.9,image/webp, */*;q=0.8

Accept header when requesting a

page on Firefox version 65

Order of the HTTP headers.

Besides the headers values, different studies [

]

also showed that the order of the HTTP headers depend on the browser and can be used

to identify a browser. While the type of browser is already specified in the

User-Agent

header, this can be used for verification.

2.2 Browser Fingerprinting 23

2.2.2.2 JavaScript Attributes

Attributes collected using JavaScript are the main source of entropy for browser finger-

prints. In order to help developers adapt their websites to their user device—for example,

to change the style depending on the size of the screen—browsers expose different APIs

that leak information about the device. We present how different JavaScript APIs

accessible without any permission, such as the canvas or the audio API, are used by

fingerprinters to gather highly unique fingerprinting attributes.

We first introduce several attributes that can be accessed using the

navigator

object,

a special object exposed by default in all main browsers, which provides information

about the browser and the OS.

navigator.userAgent.

The user agent value can also be accessed in JavaScript trough

the

navigator.userAgent

property. In normal conditions—i.e., in the absence of any

user agent spoofers, this property returns the same value as the user agent contained in

the HTTP headers.

navigator.plugins.

This attribute returns the list of plugins installed in the browser.

For each plugin, it provides information about its name, the associated filename, a

description as well as a version of the plugin. Due to the deprecation of the Netscape

Plug-in API (NPAPI),

mostly because of security reasons, the entropy of this entropy

has decreased over time.

15Navigator object: https://developer.mozilla.org/en-US/docs/Web/API/Navigator

16https://blog.chromium.org/2013/09/saying-goodbye-to-our-old-friend-npapi.html

24 State-of-the-art

Plugins Description

Chromium PDF Plugin:: Portable Document

Format::internal-pdf-viewer::

__application/x-google-chrome-pdf pdf Portable

Document Format

Plugins on a Chrome browser

Shockwave Flash:: Shockwave Flash 31.0 r0::

NPSWF32_31_0_0_108.dll::

31.0.0.108__application/x-shockwave-flash swf Adobe

Flash movie,application/futuresplash spl

FutureSplash movie

Browser with the Flash plugin.

The

.dll

file extension indicates

that they browser is running on

Windows.

Edge PDF Viewer::Portable Document

Format::::__application/pdf pdf Edge PDF Viewer Plugins on an Edge browser.

navigator.mimeTypes.

The

mimeTypes

property returns an array containing the list

of MIME types supported by the browser. Each MIME type object provides information

about the type supported, a description and the filename:

•Type:

’Portable Document Format’,

description:

’application/x-google-chrome-

pdf’ and filename: ’pdf’

•Type:

’Widevine Content Decryption Module’,

description:

’application/x-ppapi-

widevine-cdm’

navigator.platform.

It returns the platform the browser is running on. While this

information is redundant with the OS that contained in the

User-Agent

header, it can

be used to verify if the OS claimed has been modified.

2.2 Browser Fingerprinting 25

Platform Comments

Linux x86_64,Linux armv7l,Linux armv8l,Linux i686,

Linux aarch64

Possible values for browsers

running on Linux.

MacIntel Value for browsers running on

MacOS.

iPad,iPhone Possible values for browsers

running on iOS.

Win64,Win32 Possible values for browsers

running on Windows.

navigator.hardwareConcurrency.

This property returns an integer representing the

number of logical processors available to the browser.

navigator.oscpu.

The

oscpu

property returns a string corresponding to the operating

system of the device. Similarly to the

platform

attribute, it is also redundant with

the OS contained in the User-Agent header. Contrary to navigator.platform that is

available in all the main browsers, this attribute is only available in Firefox.

oscpu Comments

Linux x86_64,Linux armv7l,Linux armv8l,Linux i686,

Linux aarch64

Possible values for browsers

running on Linux.

Intel Mac OS X 10.12,Intel Mac OS X 10.9,Intel Mac OS

X 10.11

Value for browsers running on

MacOS.

Windows NT 6.1; Win64; x64,Windows NT 10.0; WOW64,

Windows NT 5.2; WOW64

Possible values for browsers

running on Windows.

navigator.languages.

It returns an array containing the user’s preferred languages.

The array is ordered by preference with the most preferred language first. The value

returned is based on the same value as the

Accept-Language

header, the main difference

is that it does not include the quality values represented by the letter

"q"

in the header.

26 State-of-the-art

Date.getTimezoneOffset.

The

getTimezoneOffset

method of the

Date

class returns

the difference in minutes between the user timezone and UTC timezone. As pointed out

by Gomez et al. citegomez2018hiding, the entropy of this attribute mostly depends on the

distribution of the location of the users visiting the website that collect the fingerprints.

navigator.enumerateDevices.

The

enumerateDevices

function returns the list of

input and output media devices, such as microphones or webcams. When no permission

is granted, it can simply be used to count the distinct number of speakers, microphones,

and webcams. Nevertheless, in the case a media permission has been granted to access a

webcam, for example, then

enumerateDevices

can provide more fine-grained information

about the peripherals, such as their name or whether or not it is built-in.

navigator.cookieEnabled.

This property returns a boolean indicating whether or not

cookies are enabled by the browser. Since it has only two possible values,

true

false

this attribute has a low entropy [27].

navigator.doNotTrack.

The

doNotTrack

property aims at indicating whether or not

a user accepts to be tracked. Depending on the browser, it returns

"0"

if the users refuses

to be tracked,

"1"

if she accepts to be tracked. Some browsers do not specify its value

and decide to return

null

instead. Nevertheless, starting from version 12, Apple decided

to remove the

doNotTrack

property from the

navigator

object because they consider it

misleading.

Indeed, users tend to believe it protects them from tracking even though

there are no proofs that advertisers and trackers in general respect its value.

navigator.getBattery.

The function

getBattery

returns an object containing infor-

mation about the device’s battery that can be used for tracking [

]. The returned object

contains the following information:

•charging: a property that represents whether or not the battery is charging,

•chargingTime:

a property that represents the time before the battery is fully

charged,

•level: a property that represents the charging level of the battery.

navigator.deviceMemory.

The

deviceMemory

property returns the amount of memory

of the device in gigabytes. It is only available on Chromium-based browsers, such as

Chrome and Opera, since December 2017 (Chrome version 63).18

17https://developer.apple.com/safari/technology-preview/release-notes/

18https://developer.mozilla.org/en-US/docs/Web/API/Navigator/deviceMemory

2.2 Browser Fingerprinting 27

Navigator prototype.

Acar et al. [

] showed that the order of the properties of the

navigator

object, as well as the presence or absence of certain properties, can be used to

fingerprint a browser and its version. For example, on Chrome 68 the navigator prototype

has 58 properties, while the Samsung browser version 7 has only 56 properties, and

Safari mobile 12 has between 33 and 39 properties. More generally, besides the special

case of the

navigator

object, Mulazzani et al. [

] and Nikiforakis et al. [

] showed that

the presence or absence of features could be used to accurately identify the version of a

browser. While this feature does not bring any information not already contained in the

User-Agent

header, it can be used to verify if the browser claimed has been modified by

a spoofer.

Canvas fingerprinting.

Mowery et al. [

] showed that the HTML canvas API could be

used to generate images whose rendering depends on the browser and the device. These

canvas use different techniques that, when combined, generate an image whose rendering

is highly unique. For example, Acar et al. [

] showed that commercial fingerprinters used

strings that are pangrams—i.e., strings constituted of all the letters of the alphabet—or

use emojis since their rendering depends on the OS and the kinds of device. Figure 2.2

presents the canvas generated by Akamai and PerimeterX fingerprinting scripts.

(a) Canvas fingerprint generated by Akamai Bot Manager fin-

gerprinting script.

(b) Canvas fingerprint generated by PerimeterX fingerprinting

script.

Figure 2.2 Example of two canvas fingerprints used by commercial fingerprinting scripts.

Window and screen size.

The browser exposes different properties, through the

screen

and the

window

objects, that reflect the size of the screen and the window.

28 State-of-the-art

Table 2.1 presents and defines these different attributes. Figure 2.3 presents a screenshot

of a browser on MacOS that shows how these attributes relate to each other.

Table 2.1 Definition of different attributes that provide information about the size of

the screen and the window. For each attribute we present a possible value for of the

attribute. All the possible values shown in the table come from the same user.

Attribute Possible value Description

screen.width 1280

Width of the web-exposed screen area in

pixels. In there case where there are multiple

screen, it should return the value of the screen

where the browser window is located. The

value is not influenced by the size of the

browser window.

screen.height 1024 Height of the web-exposed screen area in

pixels. Similar definition as screen.width.

screen.availWidth 1280

Amount of horizontal space in pixels available

to the browser window.

screen.availHeight 1024

Amount of vertical space in pixels available to

the browser window.

window.innerWidth 1050 Width of the browser window in pixels,

including the size of the scroll bar.

window.innerHeight 1050

height of the viewport, i.e. the part of the

webpage a user can see, in pixels, including

the size of the scroll bar.

window.innerHeight 932

height of the viewport in pixels, including the

size of the scroll bar.

window.outerWidth 1050

Width in pixels of the whole browser window.

window.outerHeight 1004

Height in pixels of the whole browser window.

screen.colorDepth 24 Color depth of the screen.

Audio fingerprinting.

Similarly to canvas fingerprinting that uses the HTML canvas

API to generate highly unique images, audio fingerprinting leverages the Web Audio

2.2 Browser Fingerprinting 29

Figure 2.3 Presentation of the different attributes related to the size of the screen and

the window.

30 State-of-the-art

API to generate sound signals with high entropy. Englehardt et al. [

] showed that one

popular fingerprinting script relied on a

OscillatorNode

object to generate and process

an audio signal. Due to hardware and software differences, the resulting signal is slightly

different depending on the device.

WebGL.vendor/renderer.

The WebGL API enables to draw 3D shapes in the browser.

Although it works in the majority of the browsers and devices—even devices without a

GPU thanks to technologies, such as SwiftSchader

that enables to have a compatible

API on a CPU—the WebGL API keeps exposing information about the user device to

help developers to tailor their code to the user device. In particular, two attributes

exposed by the WebGL API can be used for fingerprinting. The first attribute is the

WebGL vendor and returns the name of the GPU vendor:

•Apple Inc.

•Intel Open Source Technology Center2

•Qualcomm

•ATI Technologies Inc

The second attribute, WebGL renderer, returns the name of the GPU:

•Adreno (TM) 405

•AMD PITCAIRN (DRM 2.50.0 / 4.15.0-43-generic, LLVM 6.0.0)

•ANGLE (AMD Radeon HD 7310 Graphics Direct3D9Ex vs_3_0 ps_3_0)

•NVIDIA Quadro K4000 OpenGL Engine

WebGL canvas.

Besides static attributes, the WebGL API can also be used to generate

a 3D canvas fingerprint. Laperdrix et al. [

] used the WebGL API to generate 3D shapes.

Nevertheless, they did not succeed in crafting a stable and unique WebGL canvas. More

recently, Cao et al. [

] contradicted Laperdrix et al. findings and showed the WebGL API

could be used to generate canvas that are both unique and stable, even across different

browsers of the same machine. They carefully selected different parameters, such as the

texture, the anti-aliasing or the light intensity to render more than 20 different tasks. To

create unique 3D scenes, the tasks exploit different mechanisms, such as the fact that

interpolation algorithms used by fragment shaders vary depending on the graphic card.

https://developers.google.com/web/updates/2012/02/SwiftShader-brings-software-

3D-rendering-to-Chrome

2.2 Browser Fingerprinting 31

Figure 2.4 Examples of two 3D scenes generated with WebGL using Cao et al. ’s [

]

approach.

The tasks generate fingerprints that are also resilient when the screen or the window

size changes, or when the zoom level is altered. Figure 2.4 presents two examples of 3D

scenes they generate. They also showed that even when WebGL was not using GPU,

e.g. when the device has no GPU or a blacklisted GPU, and uses the SwiftSchader

library to run the computation on the CPU, the 3D scenes still have entropy.

Touch screen.

The presence of a touch screen, as well as its characteristics, can be

used for fingerprinting. In order to test the presence of touch support on the device,

one can create a

TouchEvent

and observe if it succeeds or look at the presence of the

ontouchstart

property in the

window

object. In case the device has touch support, one

can use the

navigator.maxTouchPoints

or the

navigator.msMaxTouchPoints

prop-

erties to obtain the number of simultaneous touch contact points supported by the

device.

Audio and video codecs.

Audio and video codecs support depends on the browser

and the OS.

During the analyses conducted in this thesis, we observed some of the

commercial fingerprinting scripts testing the presence of audio and video codecs using

the function

HTMLMediaElement.canPlayType

. Given an audio or a video type, this

function return three possible values:

1. "probably", which means that the media type appears to be playable,

2. "maybe"

indicates that it is not possible to tell if the type can be played without

playing it,

https://developer.mozilla.org/en-US/docs/Web/HTML/Supported_media_

formats

32 State-of-the-art

3. "", an empty string indicating that the type cannot be played.

Font enumeration.

At the end of this section, we present how the whole list of fonts

installed on the system can be obtained using Flash. Nevertheless, with the decrease of

the popularity of the Flash plugin caused by its deprecation,

fingerprinters have come

up with new approaches to obtain the list of fonts installed on the system [

]. The idea

to test if a font is installed is to compare the size of two HTML elements, one that uses

the system fallback font and the other element that uses the font whom the fingerprinter

wants to test the presence. It can be done the following way:

1. The script creates a div element containing a span element,

The script sets a predefined text with a fixed size. Moreover, it sets a font-family

that does not exist. Thus, the browser will use the fallback font of the system,

The script measure and save the size of the span element using its

offsetWidth

and offsetHeight properties,

For each font whom the script wants to test the presence on the user system, it

creates a

span

element inside a

div

. Then, it sets the text of the

span

element

using the same string and size as in step 2 and it specifies that the text should be

rendered using the font that it wants to test. Finally, the script measures the size

of the span element,

If the

span

element has the same dimensions as the

span

element that use the

fallback font, then it means the font is not present on the device. Otherwise, it

means the font is installed.

To be sure to decrease the chance of false negatives—i.e., fonts that would not be

detected—the font-size should be large enough, so that even small differences in the

font rendering are amplified and can be detected by

offsetWidth

and

offsetHeight

properties. Gomez et al. [

] collected fonts on more than 2M users using this approach

and showed that the list of fonts provided more than 6.9 bits of entropy.

Fifield et al. [

] showed that simply measuring how different Unicode glyphs are rendered

can provide a stable and unique identifier. Indeed, the rendering of the font depends on

different factors, such as the fonts or anti-aliasing. They measured the size of the glyph

bounding boxes for different Unicode characters and found that across the 1,016 different

devices in their experiment, 349 could be identified solely using the font metrics.

https://www.bleepingcomputer.com/news/security/google-chrome-flash-usage-

declines-from-80-percent-in-2014-to-under-8-percent-today/

2.2 Browser Fingerprinting 33

Performance fingerprinting.

Mowery et al. [

] used the SunSpider and the V8

benchmarks to build a fingerprint. In total, they run 39 performance tests, each five times

and measured the time each test takes to execute. Using these timing information, they

create different heuristics to predict the OS, the browser, as well as the CPU architecture.

While the test sample is relatively small, less than 1,000 different configurations, they

are still able to achieve a browser classification accuracy of more than 80%. In the

case of CPU architecture, they achieve an accuracy of 45.3%, which is still interesting

considering that a random choice would have resulted in an accuracy of 6.7%. While

the CPU architecture can be used as an additional attribute in a fingerprint, being able

to properly classify the OS and the browser enables to verify if the values displayed

in the user agent have been spoofed. More recently, Sanchez et al. [

] proposed an

approach that measures the time to execute sequences of cryptographic functions to

generate fingerprints capable of distinguishing similar devices.

Extension probing.

Similarly to the list of plugins, the list of extensions can be used

as a fingerprinting attribute. Nevertheless, the main difference between plugins and

extensions is that there is no API to retrieve the list of extensions installed by a user.

Thus, the different techniques we present to obtain the list of extensions either rely on

bugs or side effects caused by the usage of these extensions, Mowery et al. [

] showed

that it was possible to infer the list of websites whitelisted by the NoScript extension

by observing whether or not scripts from a certain domain could be executed or they

were blocked. Since these whitelists are often unique, they argued it could be used as an

additional fingerprinting technique. Starov et al. [

] showed that browser extensions

could be identified because of the way they interact with the DOM. Among the 10,000

most popular extensions of the Chrome store, around 15% had a unique way to interact

with the DOM, making their presence detectable. They also showed that among 854

users, 14.1% had a unique set of browser extensions. Sjosten et al. [

] proposed an

approach that leverages Web Accessible Resources (WAR) to test the presence of browser

extensions. Their approach is able to detect more than 50% of the top 1,000 Chrome

extensions. Even though Firefox protected against this kind of attacks by randomizing

each extension identifier,

Sjosten et al. [

] showed it was still possible to test the

presence of extensions using a revelation attack. Their strategy is to convince the

extension to inject content in the DOM using a WAR URL, making, therefore, the

extension reveal its unique randomized identifier that can be used for tracking. Thus,

Protecting against extension probing: https://developer.mozilla.org/en-US/docs/Mozilla/

Add-ons/WebExtensions/manifest.json/web_accessible_resources

34 State-of-the-art

with this approach, they can reveal the presence of an extension and also obtain a unique

and stable identifier.

2.2.2.3 List of Fonts Using Flash

Flash usage went from 80% in 2014 to 8% in 2018.

Flash, and plugins, in general, have

been deprecated by browser vendors

mainly because of the security risk they represent

and are now being replaced by browser extensions that have fewer rights as plugins used

to have. In particular, plugins are able to access more information than JavaScript. We

only present the Flash attribute with the most entropy, the list of fonts. As we show in

the oldest contribution of this thesis presented in Chapter 3, FP-Stalker, where we use

fingerprinting to track browser over time, the only Flash attribute still worth considering

was the list of fonts. Nevertheless, we show that even on fingerprints collected around

2017, it does not bring significant information since most of the users had already Flash

disabled. The

Font.enumerateFonts

enables to collect the complete list of fonts using

Flash. Contrary to JavaScript font enumeration that needs to test the presence of each

font, this method is straightforward and provides a simple mechanism to obtain the list

of all the fonts installed on the system, even the most uncommon fonts. Thus, when

Flash’s use was still high, it was one of the attributes with the highest entropy [

Eckersley [3] also showed that the order of the fonts depended on the system.

Even though the Flash plugin can be used to obtain other attributes, such as the platform,

the preferred languages or the screen resolution, we decide not to present these attributes

because of the decline of Flash usage, and the fact that these attributes do not provide

significantly more entropy than their JavaScript counterpart. Moreover, other plugins,

such as Java or Silverlight were also used by fingerprinters to obtain more fine-grained

information as the one provided in JavaScript. Nevertheless, in a 2013 study conducted by

Nikiforakis et al. [

], they showed that none of the three popular commercial fingerprinters

they studied were still using Java.

2.2.3 Studying Browser Fingerprints Diversity

Mayer [

] brought to light the privacy problems that arise from browser diversity and

customization. Since there are different OS, browsers, screen resolutions or plugins,

https://www.bleepingcomputer.com/news/security/google-chrome-flash-usage-

declines-from-80-percent-in-2014-to-under-8-percent-today/

24https://blog.chromium.org/2013/09/saying-goodbye-to-our-old-friend-npapi.html

2.2 Browser Fingerprinting 35

this diversity could be exploited to uniquely identify browsers. At the time the thesis

was written in 2009, the situation was even worse due to the widespread use of Java

applets and Flash Action scripts that had access to even more attributes than JavaScript

programs. Over two weeks, Mayer collected fingerprints from 1,328 different browsers,

among which 1,278 (96.23%) were unique.

Mayer’s work motivated the first large-scale study on browser fingerprinting uniqueness

conducted by Eckersley [

], with the collaboration of the Electronic Frontier Fondation

(EFF).

They created a website, Panopticlick,

on which they collected 470,161 finger-

prints between 27th January and 15th January 2010. Their results confirm Mayer’s initial

findings: 83.6% of the browsers had a unique fingerprint. Uniqueness was even higher,

94.2%, for browsers with either Flash or Java activated. Indeed, among Flash and Java

users, only 1% of the browsers had an anonymity set larger than two. They showed that

that the list of plugins and the list of fonts were the two attributes with the most entropy.

With this proportion of unique browser fingerprints, they argue that this technique can

be used for tracking, in particular as a mechanism to regenerate supercookies or deleted

cookies. To support this claim, they proposed a simple heuristic that aims at linking

multiple fingerprints of the same browser. First, they studied the stability of browser

fingerprints over time and showed that among the 8,833 users that had accepted a cookie

and that had visited the websites multiple times, more than 37% displayed at least one

change (besides activating or deactivating JavaScript) in their fingerprint. Nevertheless,

they are aware this number may be overestimated because of the nature of their website

that tends to make people change their fingerprint on purpose, e.g. by changing the list

languages they prefer or by deactivating a plugin. Nevertheless, they showed that despite

these frequent changes in the fingerprint, browser fingerprinting could still be used for

tracking. Their heuristic was able to make correct predictions 65% of the time, incorrect

predictions 0.56% of the times. Otherwise, 35% of the time, it made no prediction.

Laperdrix et al. [

] also created a website, AmIUnique, to study the diversity of

fingerprints. Between 2014 and 2015, they collected more than 118,000 fingerprints finger-

printing. In addition to the attributes collected in the study conducted by Eckerlsey [

they also collect new attributes, such as canvas [

] and WebGL fingerprinting. They use

the normalized Shannon’s entropy to compare their dataset with Panopticlick dataset.

They found similar results, except for the list of plugins and the list of fonts where they

obtained a lower entropy. This difference can be explained by the decrease of Flash usage,

25https://www.eff.org/

26https://panopticlick.eff.org/

36 State-of-the-art

which means that the list of fonts was not collected for all the fingerprints, therefore

decreasing its entropy. The difference can also be explained by the rise of mobile usage,

on which there are no plugins, which also includes Flash. Besides attributes also collected

on Panopticlick, they analyzed the entropy of seven new attributes, such as canvas and

WebGL fingerprinting or the presence of an ad-blocker. They found that canvas was

among the five most discriminating attributes, with a normalized entropy close to the

entropy of the list of plugins. Among the 118,934 fingerprints they collected, there

obtained 8,375 distinct canvas values, among which 5,533 were unique. They also studied

the differences between computer, either desktop or laptops, and mobile fingerprints.

While 90% of desktop fingerprints were unique, only 81% of mobile fingerprints were

unique. This difference was mostly explained by the low entropy of the list of fonts

and the list of plugins on mobile. Nevertheless, mobile fingerprints still achieve a high

uniqueness because of attributes such as the user agent or the canvas that are more

unique on mobile. Indeed, in the case of the user agent, they noticed that some phone

manufacturers were adding sensitive information to this header, such as the precise

version of the model or the version of the Android firmware. In the case of the canvas,

they noticed that the emoji included in it was also a great source of entropy since its

rendering depends on the phone OS version as well as the phone manufacturer.

More recently, between 2016 and 2017, Gomez et al. [

] collected more than 2 million

fingerprints on a popular french website from the Top 15 Alexa. Since it is a popular

website visited by a wide range of users, it avoids the bias of data collected by Eckersley

and Laperdrix studies. Indeed, as acknowledged by Eckersley, Panopticlick was mostly

visited by users aware of privacy issues on the web. Thus, these users may have a more

customized browser and device configurations than random users. Gomez et al. compared

the diversity of browser fingerprints in their dataset with the ones from Eckersley and

Laperdrix studies. They collected the same set of attributes, at the exception of the

canvas that was modified to obtain a higher uniqueness. They also collected the list of

fonts using JavaScript instead of Flash since Flash usage had already hugely decreased at

the time their study was conducted. They found significantly different results compare

to the two previous studies. While previous studies claimed that more than 80% of the

fingerprints were unique, only 33.6% are unique in their dataset. The difference is even

more important for mobile devices. While 81% of the mobile fingerprints collected on

AmIUnique were unique, only 18.5% of the fingerprints in their dataset are unique.

Despite this uniqueness difference, the attributes with the most entropy are still the

same:

2.2 Browser Fingerprinting 37

1. The list of plugins (9.49 bits of entropy),

2. The canvas fingerprint (8.55 bits of entropy),

3. The user agent (7.15 bits of entropy),

4. The list of fonts (6.90 bits of entropy).

Similarly to the AmIUnique study, they also observed a decrease of uniqueness on

mobile for the list of plugins (10.3 bits on computer against 0.2 bits on mobile) and

the lists of fonts (7.0 bits on computer against 2.2 bits on mobile), even when fonts are

obtained using JavaScript. Their study also confirmed that mobile user agents provide

more information compare to computer user agents (6.3 bits on computer against 8.7

bits on mobile).

While their study shows that browser fingerprint uniqueness has probably been overesti-

mated by previous studies, either because of small or biased datasets, it is unclear in

which proportions. It is difficult to measure the variation caused by their use of a more

representative dataset and the variation caused by the fact that attributes that used to

have a high entropy, such as the list of plugins, become less unique over time due to the

deprecation of plugins

that started being replaced by browser extensions. Indeed, to

compare fingerprint uniqueness and attributes entropy with previous studies, they have

restricted themselves to 17 attributes already collected by Eckersley [

] and Laperdrix et

al. [

]. While they improved the canvas and modified the way fonts are collected,

they did not take into account several attributes available at the time their study was

conducted. Thus, the main critic of this study is that it while it properly evaluates

the entropy of the attributes studied, it underestimates the fingerprint uniqueness by

excluding attributes that were available at the time their study was conducted between

2016 and 2017. In particular, they did not consider the following attributes:

1. navigator.enumerateDevices.

This function has been available since Chrome

version 45

(September 2015)

and provides information about the number of

microphones, speakers and webcams;

2. Audio fingerprinting.

This technique relies on the HTML Audio API available

since Chrome version 35 and Firefox version 25. In a crawl conducted in January

27https://blog.chromium.org/2013/09/saying-goodbye-to-our-old-friend-npapi.html

https://developer.mozilla.org/en-US/docs/Web/API/MediaDevices/

enumerateDevices

Google Chrome version history: https://en.wikipedia.org/wiki/Google_Chrome_version_

history

38 State-of-the-art

2016, Englehardt et al. [

] already mentioned the use of audio fingerprinting by

popular websites and estimated it had an entropy of 5.4 bits;

3. Screen and window properties.

While they collect information about the

screen width and height, as well as the color depth, they did not collect more ad-

vanced information

window.innerHeight/Width

window.outerHeight/Width

that enables to infer the presence of a desktop toolbar and its size, or whether or

not a bookmark bar is displayed in the browser;

4. Audio and video codecs.

The presence of audio and video codecs can be tested

using the

HTMLMediaElement.canPlayType

function that was already available at

the time their study was conducted 30;

5. Touch screen details.

In the case of mobile devices, they did not collect any

information about the maximum simultaneous touch points supported by the screen

using the navigator.maxTouchPoints property available since Chrome 35;31

6. Number of cores.

The

navigator.hardwareConcurrency

returns the number

of logical processors available to the browser and has been available since Chrome

37 and Firefox 48.32

Thus, it is unclear how different the fingerprint uniqueness would have been, had they

considered these attributes. In particular, when we consider their second research question

that studied the proportion of almost unique fingerprints—i.e., fingerprints that would

become unique if a slight modification was applied to attributes whom the user can

naturally modify through the browser user interface, such as the value of do not track or

the list of preferred languages—they showed that, for computer fingerprints, applying

small changes on random fingerprints would lead to a uniqueness rate of 80%. Therefore,

we should take care of the real fingerprint uniqueness. Moreover, adding to fingerprints

new attributes presented in Section 2.2, such as

navigator.deviceMemory

or extensions

probing, would probably also rise fingerprint uniqueness. We consider evaluating the

entropy of these attributes as part of future work.

https://developer.mozilla.org/en-US/docs/Web/API/HTMLMediaElement/

canPlayType

31https://developer.mozilla.org/en-US/docs/Web/API/Navigator/maxTouchPoints

https://developer.mozilla.org/en-US/docs/Web/API/NavigatorConcurrentHardware/

hardwareConcurrency

2.2 Browser Fingerprinting 39

2.2.4 Use of Browser Fingerprinting on the Web

We present multiple large-scale studies that analyzed the use of browser fingerprinting on

the web. We present these studies in a chronological order to better convey the evolution

of fingerprinting use and techniques over time.

The first large-scale studies on browser fingerprinting started in 2013, three years after

Mayer [

] and Eckersley [

] brought to light the privacy risk arising from browser

customization. Nikiforakis et al. [

] analyzed the code of three popular fingerprinters.

They noticed that commercial fingerprinters used more aggressive techniques than those

presented by Eckerlsey [

]. For example, commercial fingerprinters heavily relied on Flash

and ActiveX plugins to obtain information not available in JavaScript, such as whether

or not the browser is behind a proxy. They noticed that even for simple attributes,

such as the platform that can be accessed using

navigator.platform

or the user agent,

the Flash platform attribute provides more detailed information, such as the exact

version of the Linux kernel, which can be used both for tracking, as well as to exploit

vulnerabilities. They detected that fingerprinters adapted their behavior based on the

nature of the browser and the plugins available. For example, when the script detected

Internet Explorer, it tried to exploit specific APIs available only on Internet Explorer,

such as

navigator.systemLanguage

. When specific plugins were detected, two of the

fingerprinters even tried to invoke them to obtain sensitive information, such as the hard

disk identifier, the computer’s name, the installation date of Windows as well as the list

of installed system drivers. They also detected a shift in the way fonts were obtained

because of the decline of Flash. Thus, while two of the fingerprinters used Flash to obtain

the list of available fonts, one of the fingerprinters was using JavaScript [39].

They also crawled the Top Alexa 10K to study the adoption of these three fingerprinting

scripts among websites of the Top Alexa 10K. They detected 40 sites (0.4%) of sites

using scripts provided by one of the three commercial fingerprinters. They also used

Wepawet,

an online platform for the detection of web-based threats, to detect if these

scripts were used by less popular websites and found out that 3,804 domains analyzed by

Wepawet used one of these scripts.

Also in 2013, Acar et al. [

] proposed FPDetective, a crawling framework to detect and

analyze fingerprinting on the web. They applied FPDetective to the Top Alexa 1M

websites and were able to detect 16 new fingerprinting scripts, as well as new fingerprinting

techniques that had not been documented by previous studies. Instead of relying on lists

33Wepawet: https://wepawet.cs.ucsb.edu

40 State-of-the-art

of URLs to detect fingerprinting scripts, their crawler logs access to properties commonly

used for fingerprinting, such as the properties of the

navigator

and

screen

objects, as

well as properties used for JavaScript font enumeration such as

offsetWidth/Height

The crawler also intercepts calls to the

getFontData

functions used in Flash to obtain

the list of fonts. They consider a script is doing fingerprinting if it loads more than 30

fonts, enumerates plugins or mimeTypes and accesses the screen and navigator properties.

With this methodology, they detected 13 distinct fingerprinting scripts present on 404

websites doing JavaScript font enumeration.

In 2014, Acar et al. [

] conducted a large scale study about stateful and stateless tracking

mechanisms used in the wild. In particular, they were the firsts to measure the use of

canvas fingerprinting at scale. To detect scripts that collect canvas fingerprints, they log

values returned by the

toDataURL

function used to obtain the value of a canvas. They

also monitor the arguments of the

fillText

and

strokeText

functions used to draw

text on a canvas. To decrease false positives, they consider a script is using canvas for

fingerprinting if both the

toDataURL

and

fillText

strokeText

functions are called.

Moreover, they also define a constraint on the size of the canvas that should be at least

16x16 pixels. Finally, the image should not be requested in a lossy compression formation,

such as JPEG. They observed that 5.5% of the websites in the Top Alexa 100K were

using canvas fingerprinting on their home page. While there were 20 different companies

providing canvas fingerprinting scripts, one of the companies, AddThis, represented more

than 95% of the scripts. Moreover, they noticed that fingerprinters had considerably

improved the canvas fingerprinting techniques since the original study conducted by

Mowery et al. [

]. For example, new canvas fingerprinting scripts draw the same text

twice with different color and trigger the default fallback font. These scripts also use

pangrams—i.e., strings that include all the letters in the alphabet—as well as different

emojis. While Acar et al. argued that emojis were used to check if the browser supported

emojis, Laperdrix et al. [

] later showed that beyond testing emoji support, emojis were

also rich source of entropy since their representation depends on the OS and the device.

More recently, in 2016, Englehardt et al. [

] crawled the top Alexa 1M to study the

use of cookies and multiple fingerprinting techniques. They proposed openWPM, an

extensible crawler framework that aims at making privacy studies at scale easier. They

detected more than 81,000 third parties present on at least two first-parties. Moreover,

they showed that four companies, Google, Facebook, Twitter and Adnexus, were present

each on more than 10% of the websites crawled.

2.3 Countermeasures Against Fingerprinting 41

To measure the use of fingerprinting, they monitored access to properties commonly used

for fingerprinting, similarly to the approach proposed by Acar et al. [

]. They detected

that among the Top Alexa 1M websites, canvas fingerprinting was only used by 1.6% of

the websites. Nevertheless, canvas fingerprint was used by 5.1% of the websites in the

Top Alexa 1K. Thus, they showed a decrease of canvas fingerprinting use compare to the

previous study conducted by Acar [

] in 2014. In particular, the popular fingerprinting

script delivered by AddThis was no longer in use in 2016. They also measured the use of

canvas-based font enumeration and showed that it was used by 2.5% of the websites in

the Top Alexa 1M. Finally, they measured the use of audio fingerprinting at scale and

detected 518 websites that compute an audio fingerprint, among which 512 delivered

scripts from the same company.

2.3 Countermeasures Against Fingerprinting

In this section, we present the three main strategies to protect browser against browser

fingerprinting:

1. Blocking the execution of fingerprinting scripts.

This strategy can be

achieved by disabling JavaScript or by intercepting requests that load fingerprinting

scripts;

2. Breaking the stability of browser fingerprints.

Fingerprint tracking requires

both uniqueness and stability to be effective. This strategy aims at frequently

modifying the attributes constituting a fingerprint in order to break the fingerprint

stability, and thus make tracking impossible or less effective;

3. Breaking the uniqueness of browser fingerprints.

This strategy acts on the

uniqueness required for tracking. It aims at increasing the anonymity set of each

fingerprint so that multiple browsers from different users share the same fingerprint

or fingerprints with high similarity.

Countermeasures can achieve these strategies by implementing different mechanisms.

For example, in order to unify the value of fingerprints, one can either lie about the

values returned by fingerprint attributes so that all browsers return the same value or

one can block access to attributes with a high entropy so that browsers converge towards

a similar fingerprint. In the following subsections, we go through the three different

defense strategies. For each of the strategies, we present the different countermeasures

42 State-of-the-art

that use this strategy and the different mechanisms they implement to achieve it. Note

that some countermeasures may implement multiple strategies or hybrid strategies. For

example, FaizKhademi et al. [

] proposed a modified Chromium with two modes, a first

mode that aims at making all fingerprints look the same and a second mode that aims at

breaking the stability of fingerprints by randomizing their values.

We first present countermeasures that protect by blocking the execution of fingerprinting

scripts. Then, we present the countermeasures that aim at breaking the stability of

fingerprints and the countermeasures that unify browser fingerprints to make each browser

less unique. Finally, we present the main weaknesses of the countermeasures presented

in this section.

2.3.1 Blocking Fingerprinting Script Execution

The first strategy we present relies on blocking the execution of fingerprinting scripts.

Blocking the execution of the script makes the collection of the fingerprint impossible.

While the server can still collect a reduced version of the fingerprint using HTTP headers,

not collecting JavaScript attributes hugely decreases the entropy of the fingerprint.

Countermeasures that aim at blocking script execution are not specifically designed to

counter browser fingerprinting. Nevertheless, they may include rules that block some

fingerprinting scripts. These countermeasures are among the most popular privacy-

enhancing technologies [

]. For example, in March 2019, four out of the ten most

popular browser extensions for Firefox where ad-blockers and tracker-blockers [

]. In

particular, the two most popular extensions, AdblockPlus [

] and uBlock Origin [

]

represent more than 11% of the total browser extensions used on Firefox. The majority

of these script blocking countermeasures relies on crowdsourced filter lists that specify

if a resource should be blocked. There exist different lists that serve different purposes.

One of the most popular, EasyList [

], focuses on blocking advertising content while

EasyPrivacy [

] focuses on blocking trackers, which can include fingerprinting-based

trackers. These lists are used in popular browser extensions, such as AdblockPlus [

uBlock origin [

] or Adblock [

], as well as browsers, such as Brave [

] that integrates

a native ad-blocker. Other browser extensions, such as Ghostery [

], rely on proprietary

filter lists to block content. One of the main problems of these lists, whether they are

proprietary or not, is that they need to be manually updated and require a significant

amount of work to be maintained [

]. Thus, other more dynamic approaches have been

proposed to get rid of these lists. For example, Privacy Badger [

], a browser extension

2.3 Countermeasures Against Fingerprinting 43

developed by the EFF, uses heuristics to determine if a request should be blocked. It

keeps track of third-party resources included in the pages visited and observe if their

behavior is similar to the ones of trackers based on their use of cookies, local storage or

even browser fingerprinting techniques. When it observes a suspicious third-party on more

than three domains, Privacy Badger automatically blocks the content. Umar et al. [

]

apply a machine learning-based approach that considers features extracted from HTML

elements, HTTP requests, and JavaScript to determine if a request should be blocked.

Merzdovnik et al. [

] quantified the effectiveness of ad-blockers and trackers blockers

at scale. They show that rule-based extensions, such as uBlock Origin or Ghostery

outperform learning-based extensions, such as Privacy Badger, even though they took

care of training Privacy Badger’s heuristic on 1,000 websites before applying it during

their evaluation. They show that while the majority of these blocking tools are effective

against stateful third-party trackers, they all failed to block well-known stateless trackers

that use browser fingerprinting. Englehardt et al. [

] also showed that popular filter lists

tend to detect only a fraction of fingerprinting scripts.

Finally, a more radical approach is to block the execution of JavaScript code. The

most popular tool for blocking JavaScript is the NoScript [

] browser extension. Other

browser extensions, such as uBlock Origin [

] and uMatrix [

], as well as other browsers,

such as Brave [

] or Tor browser [

] also propose convenient mechanisms to disable

JavaScript execution. While this approach guarantees to block JavaScript-based browser

fingerprinting, it may also render many websites unusable since the majority of websites

relies on JavaScript to make their site dynamic. Moreover, as shown by Yu et al. [

breaking websites and thus decreasing the usability can also lead to a decrease of privacy

as users are more tempted to disable their countermeasures, without understanding the

privacy implications of doing it.

2.3.2 Breaking Fingerprint Stability

Another defense strategy consists in modifying frequently the values of different attributes

of a fingerprint to break the stability property required for tracking fingerprints over time.

The user agent is a key attribute for fingerprinting as its value reflects the browser and the

OS used by the user. For this reason, a wide range of user agent spoofer extensions enables

to lie on the user agent sent by the browser. For example, Ultimate User Agent [

a Chrome extension enables to change the user agent enclosed in the HTTP requests

as the original purpose of this extension is to access websites that demand a specific

44 State-of-the-art

browser. The main drawbacks of user agent spoofers to protect against fingerprinting

lies in the fact that they create inconsistent browser fingerprints [

]—i.e., combinations

of attributes that cannot be found in the wild.

More advanced extensions, such as Random Agent Spoofer [

] aim to address

this inconsistency problem. Random Agent Spoofer (RAS) is an extension that

was available until Firefox 57—in version 57, Firefox changed the APIs for browser

extensions so that they become compatible with the Chrome browser—that protects

against fingerprinting by providing a mechanism that enables to switch between different

device profiles, composed of several attributes, such as the user agent, the platform,

and the screen resolution. Even though RAS is not available on modern versions of

Firefox, it has been forked and recently ported to web extensions that are supported

by the most recent versions of Firefox [

]. Since the device profiles used to spoof

fingerprints are extracted from real browser configurations, all of the attributes contained

in a profile are consistent with respect to each other. Besides spoofing attributes, RAS

also enables to block advanced fingerprinting techniques, such as canvas, WebGL or

WebRTC fingerprinting.

Nikiforakis et al. [

] proposed PriVaricator, a modified Chromium browser that

randomizes the list of plugins and the list of fonts. Besides the high entropy of these two

attributes [

], the main reason PriVaricator focuses on these attributes is to

avoid inconsistencies in the fingerprints generated. Indeed, their strategy does not lie

about the browser or its version nor about the platform the browser is running on, making

it more difficult for an adversarial fingerprinter to detect the use of a countermeasure.

To randomize the list of plugins, they define a probability of hiding each individual

entry in the list of plugins. Concerning the list of fonts, they focus on font enumeration

using JavaScript. They override values returned by two properties,

offsetHeight

and

offsetWidth

as well as the

getBoundingClientRect

function, the three of them being

used for font enumeration [

]. They proposed three font randomization policies

that become active whenever a script accessed one of these properties more than a defined

threshold. They implemented their changes directly into Chromium C++ code for

performance purposes and also because the

offsetWidth

and

offsetHeight

properties

are not properties of the

HTMLElement

prototype, making it more difficult to override

these properties directly in JavaScript in an efficient way. They evaluated their approach

based on three criteria:

1. Performance.

They measured the performance overhead using three JavaScript

benchmarks and noticed no statistically significant overhead;

2.3 Countermeasures Against Fingerprinting 45

2. Privacy protection.

They also evaluated the privacy gain against four fingerprint-

ers: BlueCava and Coinbase, two commercial fingerprinting scripts, FingerprintJS,

an open-source fingerprinting script and PetPortal, an academic research platform.

Overall, their approach was able to generate unique and different fingerprints

against the four fingerprinters, which means it was effective to protect against

browser fingerprinting;

3. Visual breakage.

Finally, they evaluated the visual breakage caused by their

tool. To do so, they instrumented PriVaricator with different randomization

policies that were considered successful during the evaluation step; They measured

a negligible visual breakage of 0.6% on average with their third randomization

policy.

Torres et al. [

] proposed FP-Block, a Firefox browser extension that ensures that

any embedded party will see a different fingerprint for each site it is embedded in. Thus,

a browser fingerprint can no longer be linked between different websites, even when the

same third-parties are included on these websites. Whenever a user visits a new site, FP-

Block generates a new fingerprint so that different websites observe different fingerprints

for the same browser. Their approach focuses on properties of the

navigator

and the

screen

objects. It also adds random noise to canvas fingerprints. FP-Block’s authors

are aware of what they call the “fingerprinting countermeasure paradox”—i.e., the

fact that using a fingerprinting countermeasure can make a user more identifiable if

the countermeasure is detected since she becomes more unique and identifiable. Thus,

contrary to naive countermeasures that randomize the value of attributes without any

constraints, FP-Block tries to ensure fingerprint consistency. To generate consistent

fingerprints, they model how different attributes of a fingerprint relate to each other

using a Markov chain model.

FaizKhademi et al. [

] proposed FPGuard, a combination of a modified Chromium

and a browser extension. The browser extension aims at detecting fingerprinting scripts to

blacklist them, while the modified Chromium aims at spoofing attributes of a fingerprint.

The browser extension monitor accesses to attributes and functions used for fingerprinting.

To detect if a script uses fingerprinting, they rely on 9 different metrics, such as the

number of

navigator

and

screen

properties accessed or whether a canvas element has

been programmatically accessed. Then, using these metrics, they assign a score that

represents a level of suspicion that the script is doing fingerprinting. Whenever a script

is considered suspicious, it is added to a blacklist and is automatically blocked the

next times a website tries to load it. Thus, FPGuard can also be partly classified in

46 State-of-the-art

the set of countermeasures that block fingerprinting script execution. Their modified

Chromium spoofs the values of attributes and functions used in fingerprinting whenever

a user visits a website. They argue that naively randomizing the values of fingerprints

can degrade the user experience since some websites rely on the screen resolution or

the type of device to properly display their content. Thus, they aim at generating

fingerprints that “represent the properties of the browser almost correctly”. To do so,

they developed different randomization strategies for the

screen

and

navigator

objects,

for the list of plugins, for the canvas as well as for the list of fonts. For example, they may

alter the browser sub-version or randomize canvas by adding minor noise to its content.

FPGuard’s authors are aware that their countermeasure can be detected because of

their extension module that overrides getters in JavaScript. Nevertheless, they argue that

since a fingerprinter cannot recover the original values, it is still a privacy improvement.

Baumann et al. [

] proposed DCB, a modified Chromium, that protects against browser

fingerprinting and in particular, Flash fingerprinting. To do so, they proposed two

opposite strategies that leverage a dataset of real browser fingerprints:

1. 1:N.

one browser, many configurations. The goal of this strategy is to have unique

fingerprints that keep on changing to break the stability required for tracking;

2. N:1.

many browsers, one configuration. The goal of this strategy is to create

collisions between the fingerprint of different browsers. To do so they apply

modifications to the fingerprints of different browsers so that they converge towards

a unique fingerprint. Thus, browsers with this fingerprint become more difficult to

track since they are not unique anymore.

The N:1 strategy means that DCB can also be partly classified in the set of counter-

measures that aim at breaking the uniqueness of browser fingerprints. They proposed

the notion of configuration groups to generate consistent fingerprints. While this no-

tion is not well defined in the paper, they provide an example saying that a possible

configuration group could consist of users with the same browsers, OS and language.

Thus, no actual users would have to adopt a configuration that contradicts their real

system and browser configurations. In addition to modifying attributes related to the

size of the screen or the browser language, they also proposed a particular strategy

to randomize canvas fingerprint. Instead of applying random noise to canvas, DCB

modifies the

drawTextInternal

function used to render canvas in a deterministic way.

Thus, between two sessions, canvas is different, but for a given session the modifications

applied to canvas are constant. Thus, DCB is resilient to replay attacks—i.e., attacks

2.3 Countermeasures Against Fingerprinting 47

where a fingerprinter asks to generate the same canvas multiple time to observe if the

values returned are the same. Similarly to Nikiforakis et al. [

], they tested DCB against

BlueCava and Coinbase, two commercial fingerprinting scripts, and FingerprintJS, an

open-source fingerprinting library. While their 1:N strategy was able to generate more

than 99% of unique fingerprints, making it effective against long-term fingerprinting

tracking, their N:1 strategy that aimed at breaking the uniqueness of browser fingerprints

performed worse since some of the fingerprinters were still able to distinguish different

browsers that should have had the same fingerprint.

FPRandom [

] is a modified version of Firefox that adds randomness in the com-

putation of the canvas fingerprint, as well as the audio fingerprint. FPRandom also

randomizes the order of the

navigator

properties. They decided to focus on canvas

fingerprinting since it is a strong source of entropy [

]. Moreover, both audio and

canvas fingerprints rely on multimedia functions that can be slightly altered without

significantly degrading the user experience. Concerning the order of the

navigator

properties, it is not defined by the ECMAScript specification [

], which means that it is

up to the browser vendor to decide the order. FPRandom includes two modes, one in

which the noise added to the canvas and the audio fingerprint is different at every call and

a second mode where the noise remains constant during a session. The goal of the second

mode is to protect against replay attacks. Contrary to canvas poisoning extensions that

apply random noise independently for each pixel, FPRandom adds a more consistent

noise. Indeed, they modify the

parseColor

function of the

canvasRenderingContext2D

class so that whenever a color is added to the canvas, it is slightly modified. Thus, with

their approach, all the shapes of the canvas that use a given color will have the same

color.

Canvas Defender [

] is a browser extension available on Chrome and Firefox that

adds a uniform noise to a canvas. The first time Canvas Defender is installed, it

generates four random numbers corresponding to the noise that will be applied to the

red, green, blue and alpha components of each pixel. Thus, since the four random noise

numbers are constant, the extension is not vulnerable to replay attacks.

Finally, Blink [

] exploits reconfiguration through virtual machines or containers to

clone real fingerprints—i.e., contrary to countermeasures that lie on their identity by

simply altering the values of the attributes collected—Blink generates virtual environ-

ments containing different fonts, plugins, browsers in order to break the stability of

fingerprints, without introducing inconsistencies. Thus, the main strength of Blink is

that contrary to other countermeasures that lie on the values of the attributes in order to

48 State-of-the-art

change the fingerprint, Blink does not lie on the fingerprint. Indeed, all the fingerprint

it generates are genuine, the modifications are done directly at the virtual machine level,

which means that they could exist in the wild. Nevertheless, this approach has a cost: it

requires to use a virtual machine of multiple gigabits in order to browse the web, which

makes this approach less user-friendly than a browser extension or a forked browser.

Moreover, since Blink is running in a virtual machine, users can be detected using red

pill techniques [70].

2.3.3 Breaking the Uniqueness of Browser Fingerprints

The last defense strategy we present aims at breaking the uniqueness of browser finger-

prints from different browsers so that users become less unique. Contrary to the previous

strategies that aimed at breaking the stability of the fingerprints by randomizing different

attributes, this strategy aims at increasing the anonymity set of each user. To achieve

this goal, one can either block access to attributes with high entropy or spoof their values

so that multiple users return the same value.

Started from version 41

, Firefox started to implement anti-browser fingerprinting

features similar to those available in the Tor Browser. Firefox standardizes values of

attributes used for fingerprinting, such as the user agent, the timezone or the number

of cores in the CPU. For the user agent, it spoofs the browser version by replacing

it with the version of the latest extension support release (ESR) of Firefox. To spoof

the timezone, Firefox pretends the user is located UTC timezone—i.e., it returns a

timezone offset of 0 when

new Date().getTimezoneOffset()

is called. Concerning

the number of cores, it modifies the value of

navigator.hardwareConcurrency

so that

all browsers pretend to have two cores. Besides standardizing the values of attributes,

Firefox also blocks access to critical functions used for canvas fingerprinting. Whenever a

script tries to access a canvas value using

toDataURL

getImageData

, Firefox asks the

permission to the user,

similarly to what is done with the geolocation API for example.

Firefox also blocks the access to several APIs, such as WEBGL_debug_renderer_info,36

the geolocation or the device sensors. Moreover, whenever the fingerprinting protection

is activated, navigator.plugins and navigator.mimeTypes also return empty arrays.

34https://bugzilla.mozilla.org/show_bug.cgi?id=418986

35https://bugzilla.mozilla.org/show_bug.cgi?id=967895

36https://bugzilla.mozilla.org/show_bug.cgi?id=1337157

2.3 Countermeasures Against Fingerprinting 49

Brave browser [

] is a Chromium-based browser oriented towards privacy that

proposes specific countermeasures against browser fingerprinting. In particular, they

enable to block attributes identified as having a high entropy, such as audio, canvas,

and WebGL fingerprinting. They also block local IP address leakage through WebRTC

and they disable the battery API since it can be used for fingerprinting purposes [

Moreover, Brave browser also integrates natively an ad-blocker and a tracker blocker

that rely on crowdsourced filter lists, as well as a simple mechanism to disable the

execution of JavaScript on a page.

Canvas Blocker [

] is a Firefox extension that blocks access to the HTML5 canvas

API. Besides blocking, it also provides another mode, similar to Canvas Defender [

that randomizes the value of a canvas every time it is retrieved. Thus, it can also

be classified in the category of countermeasures that act by breaking the stability of

fingerprints.

2.3.4 Summary of Existing Countermeasures

Table 2.2 provides an overview of the different countermeasures we presented in this

section associated with their strategies to protect against fingerprinting.

2.3.5 Limits of Fingerprinting Countermeasures

Eckersley discussed the impact of privacy-enhancing technologies against fingerprinting.

He considers some of the countermeasures can be productive in the case they coun-

termeasures can be detected because of their side effects. For example, they noticed

users in their dataset that were using Privoxy,

a privacy-enhancing proxy that aims at

blocking advertising and trackers. Privoxy was altering the user agent sent by adding

the "Privoxy" string in it. Thus, while the original goal of the extension was to preserve

users against trackers, this feature could be used by fingerprinters to specifically target

privacy-aware users. More generally, they argue that whenever a countermeasure has

observable side-effect, its effect can be counterproductive if it is used by a few users.

Nikiforakis et al. [

] also discussed the privacy impact of using simple countermeasures,

such as user agent spoofers. They showed that, while the user agent provides information

about the user’s browser and OS, changing its value can be counterproductive. Indeed,

37https://www.privoxy.org/

50 State-of-the-art

Table 2.2 Overview of the different countermeasures and their strategies to protect against

browser fingerprinting.

Countermeasure Blocking script Breaking stability Unifying

uBlock Origin [49]✓

Adblock [52]✓

Ghostery [54]✓

Privacy Badger [55]✓

Adblock Plus [48]✓

NoScript [58]✓

uMatrix [59]✓

Tor browser/Firefox [60]✓ ✓

Brave browser [53]✓ ✓

FPGuard [45]✓ ✓

PriVaricator [7]✓

FP-Block [66]✓

DCB [8]✓ ✓

Canvas Defender [68]✓

Random Agent Spoofer [63]✓ ✓

Canvas Blocker [71]✓ ✓

Blink [69]✓

FP-Random [9]✓

Ultimate User Agent [62]✓

2.3 Countermeasures Against Fingerprinting 51

naive spoofers tend to generate inconsistent fingerprints—i.e., combination of attributes

that cannot be found in the wild. For example, in the user agent the browser claims

to be on MacOS while the

navigator.platform

attribute returns Linux. Thus, they

argue that using such countermeasures for privacy purposes can be counterproductive

since fingerprinters can more easily target users with this kind of extensions. They also

showed that it is unclear how effective user agent spoofers are at breaking the stability

of fingerprints. Similarly to Mulazanni et al. [

], they showed that using different

techniques such as the presence or absence of different browser features, or the behavior

(mutability and order) of special browser built-in objects such as

navigator

screen

it was possible to infer the real browser of the user.

Acar et al. [

] studied the Tor browser protection against fingerprinting. They showed

the difficulty to properly protect against all fingerprinting channels. Indeed, while the

Tor browser aimed at protecting against font enumeration by limiting the number of

fonts that can be queried by a page, it was still possible to test the presence of fonts

without any limits using the CSS

font-face

directive. They also studied Fireglove,

a browser extension created for research purposes. Fireglove aimed also at protecting

against font enumeration by limiting the number of fonts that can be loaded by a tab

by reporting wrong

offsetHeight/Width

values. Nevertheless, they could bypass the

protection using the

getBoundingClientRect

function that provides similar information

offsetHeight/Width

. Fireglove also randomized the value of different attributes

such as the screen resolution, or the user agent. Nevertheless, Acar et al. showed that

Fireglove failed at properly lying about the real nature of the browser. For example,

when they pretended to be a Chrome browser, they did not remove APIs only available

on Firefox such as navigator.mozCameras.

More recently, in 2019, Schwarz et al. [

] proposed an approach to automatically learn

the browser fingerprint differences between browsers. While their approach can be used

for targeting exploits that work only on specific OS, architecture or browser, it can

also be used to detect the presence of privacy-enhancing extensions. They applied their

automated approach to 6 privacy extensions and were able to detect all of them. In

particular, similarly to the evaluation we conduct in Chapter 4, they were able to detect

the presence of the Canvas Defender countermeasures because of its side-effects.

52 State-of-the-art

2.4 Security Applications

In this section, we first present approaches that leverage browser fingerprinting in a

security context. We show how it can be used to enhance authentication and to detect

crawlers. Then, we present other non-fingerprinting crawler detection techniques to

better understand how fingerprinting based detection compare to them.

2.4.1 Enhancing Web Security Using Browser Fingerprinting

2.4.1.1 Enhancing Authentication Using Fingerprinting

Different studies focused on the use of browser fingerprinting to enhance the security

of HTTP sessions and authentication, either a second factor or in addition to other

traditional second factors, such as SMS or emails. The idea is, for each user, to collect

the browser fingerprints of her trusted devices. Then, whenever she tries to connect to a

website, the server can verify both the correctness of her password and whether or not

the device the user is trying to connect from belongs to her list of trusted devices.

Unger et al. [

] proposed to use fingerprinting to protect against HTTP(s) session

hijacking. Their framework, SHPF, collects the browser fingerprint when the user logs-in.

The fingerprint collected is constituted of HTTP headers, as well as features corresponding

to the presence or the absence of several CSS and HTML5 features, whom presence

was not widespread at the time the paper was written in 2013. Then, continuously

during a session, their framework collects and monitors the browser fingerprint to detect

changes that could indicate that the session has been hijacked by an attacker, for example,

because the session cookie has been stolen using an XSS vulnerability [

]. Thus, if the

fingerprint changes, going from a Windows 10 on Chrome 65, to an Ubuntu on Firefox

55 for example, or if the IP address suddenly changes, SHPF interrupts the session and

redirects the user to a stronger authentication mechanism.

Preuveneers et al. [

] developed an authentication framework, SmartAuth, that uses

dynamic context fingerprinting to enhance authentication. Contrary to Unger’s [

]

approach that considers only static fingerprints, whose values did not depend on any

kind of context, SmartAuth takes different context information into account, such as

the geolocation or the time. For example, when a user is connecting at work she may use

a monitor with a certain resolution. Nevertheless, when she comes home, she may use

another monitor that has a different resolution. Thus, their approach aims at taking this

2.4 Security Applications 53

context into account to provide better security guarantees, as well as less false positives

during the authentication process. The fingerprints they collect are constituted of different

attributes accessible using JavaScript, such as the user language, the screen resolution or

the list of plugins. They also collect the IP address range and information provided by

HTTP headers. They acknowledged one of the problems with using fingerprinting in a

security context: fingerprint attributes spoofing. Since fingerprints are collected in the

browser—i.e., on the client side—every attribute can be modified by an attacker, either

when it is transmitted to the server or at runtime by overriding JavaScript getters and

functions used in the fingerprinting process. Thus, they added a checksum mechanism

to ensure the fingerprint collected has not been modified during its transmission to

the server. They also acknowledged the possibility of an attacker stealing a browser

fingerprint in addition to the password, for example using a phishing website, and then

this attacker could try to replay the fingerprint stolen in order to bypass the security

mechanisms. To protect against replay attacks, they added a counter whose value is

incremented during the user session, as well as a timestamp whom they verify the value

on the server-side whenever a fingerprint is collected. Thus, if a fingerprint corresponds

to a non-trusted device, or if they detect that it has been tampered, they redirect the

user a stronger fallback authentication mechanism. They evaluated their approach using

more than 2,000 different system and network configurations in 10 different contexts and

were able to achieve a 99% accuracy using only 10 fingerprinting attributes.

Alaca et al. [

] analyzed the strengths and weaknesses of several fingerprinting attributes

when used in an authentication context. Contrary to tracking, when fingerprinting is used

in a security context, it is important to consider the fact that an attacker may try to spoof

fingerprint attributes or replay fingerprints. They analyzed 29 attributes, ranging from

simple attributes provided by the browser using JavaScript, to more complex attributes,

such as the TCP fingerprinting stack. For each of these attributes, they analyzed five

important properties to consider when fingerprinting is used in a security context:

1. Entropy.

Attributes with high entropy enables to distinguish between different

users and thus reduce the chance of fingerprint collisions that could be used by an

attacker to connect from an untrusted device;

2. Repeatability.

Given the same browser, hardware and network configuration, has

an attribute the same value?

3. Resource consumption.

The CPU and memory overhead required to obtain

the value of an attribute; While they consider a high overhead is not necessarily a

54 State-of-the-art

problem for authentication since each attribute is collected only once, this property

is important when fingerprinting is used continuously to protect against session

hijacking [11];

4. Spoofing resistance.

While some attributes, such as the IP address are difficult

to spoof, it is not the case of simple attributes, such as the user agent that can be

overridden using few lines of JavaScript or a browser extension;

5. Stability.

The attribute should be stable over time in order to link different

fingerprints of the same browser.

Van Goethem et al. [

] focused on authentication on mobile devices. They proposed

an approach that does not require any permission and that works in a browser. To

build a fingerprint, they use data provided by the accelerometer sensor, all while making

the phone vibrate using the

navigator.vibrate

API. They consider that accelerometer

sensor is a good candidate for authentication since it enables to create an approach more

resilient against replay and spoofing attacks. When a user registers, their fingerprinting

script collects several traces of accelerometer data. Each trace is constituted of multiple

chunks that correspond to a varying period of time during which the phone is vibrating.

For each chunk, they collect the information returned by the accelerometer and extract

features such as the minimum, maximum and mean acceleration along the different axis.

Then, when a user tries to connect to her account, they send her a challenge that consists

in generating randomly selected chunks. If enough chunks are consistent, according to

a determined threshold, the user is allowed to connect. They evaluated the uniqueness

and stability of the different features they proposed using three different mobile devices.

They showed that accelerometer data were both unique and stable, in particular, short

chunks contained more entropy than long chunks. They evaluated their approach using

15 different devices and only one fingerprint was classified as accepted even though it

should have been rejected.

In his thesis, Laperdrix [

] proposed to use canvas fingerprinting in a challenge-response

protocol as a second-factor authentication mechanism. Their approach relies on canvas

rendering because of its unpredictable but yet stable nature. Contrary to other static

fingerprinting attributes studied in the literature, it is less vulnerable to spoofing and

replay attacks when used properly. The first time a user connects to her account, the

server sends her one challenge that consists in drawing a canvas on which several random

operations are applied, such as writing text or drawing geometric shapes. Once rendered,

the browser sends the result of the canvas rendering to the server that stores it. The next

2.4 Security Applications 55

time the users tries to connect, the server sends her two different challenges. The first

challenge is the same as the one used for the previous connection. After the first canvas

is drawn, the browser sends the value back to the server that verifies if it is equal to the

canvas they stored at the previous connection. If it is the case, the user can connect to

her account, otherwise, the server may ask the user to verify her identity using another

mechanism, such as an SMS or an email. The browser also solves the second challenge

and sends the value back to the server that stores it. Thus, the value of the second

canvas will be used for the next connection of the user. Since canvas challenges are not

reused, even if an attacker was able to steal a canvas value from her victim to replay it,

she will not be able to solve the canvas rendering challenge because it will have changed.

For this approach to work, they need, given a certain random seed, to generate unique

and stable canvas. They run three collect phases where they collected different kinds of

canvas fingerprints that use different canvas primitives, such as drawing text or curves.

Their final canvas algorithm takes as input a seed and randomizes different parameters

such as the number of strings to draw, their size and their rotation, as well as curves,

color gradient and the shadow applied to the canvas. They verified the uniqueness of

their canvas challenges on more than 1 million browsers using different seeds and did

not observe any collisions. They also used the AmIUnique extensions

to study the

stability of their canvas and showed that, on 27 browsers monitored for a year, half of

them had less than three canvas changes. Concerning performance, they showed that

from end to end, when considering the generation of the challenge, the time for the

browser to solve it by drawing the canvas and then hash the result, it takes less than 250

ms on average.

2.4.1.2 Detecting Bots and Crawlers Using Fingerprinting

Another security application of browser fingerprinting is bot and crawler detection. While

Acar et al. [

] identified scripts from companies that claimed to use it for bot and crawler

detection, few studies focused on this topic.

Bursztein et al. [

] proposed Picasso, an approach that relies on canvas fingerprinting [

]

to create dynamic challenges that aim at detecting emulated devices or devices that lie in

their user agent. Their approach has several applications, such as distinguishing between

real Android devices used by humans and emulated devices used to post fake reviews

on App Store or to artificially increase the number of videos viewed. Similarly to the

38https://amiunique.org/timeline

56 State-of-the-art

approach proposed in Laperdrix’s thesis [

], Picasso is a challenge-response protocol that

relies on the unpredictable but yet stable nature of canvas rendering. Their approach

aims at being resilient to replay attacks and to skilled adversaries with perfect knowledge

of the code used for the detection. Picasso relies on drawing random canvas primitives,

such as Bezier curves, polynomial curves or text, with random parameters. Contrary to

situations where canvas is used for tracking, the canvas generated should be the same

among devices and browsers of the same nature but different otherwise. Their approach

works as follows:

A server sends the canvas algorithm code, along with a random seed used to

initialize the pseudo-random generator (PRNG), as well as a number of rounds,

i.e., the number of canvas primitives to draw,

2. The client initializes the PRNG and creates an empty canvas,

The client executes the challenge. At each round, it randomly selects a canvas

primitive with random parameters and applies shadow and color gradient to the

canvas,

At the end of each round, the value of the canvas is hashed with the output of the

previous round,

5. The final result is sent back to the server for verification.

Their challenge can be seen as a proof of work since it requires the device to spend

CPU or GPU resources, as well as RAM resources to solve it. Moreover, the amount of

resources needed to solve the challenge can easily be increased, either by increasing the

size of the canvas or the number of rounds. A challenging part of their approach is the

bootstrap phase. During this phase, they need to collect values of canvas for different

seeds and different classes of devices. To address this problem, they proposed several

solutions, such as sending challenges to trusted devices, e.g. devices belonging to users

logged in to the website service and with a good reputation, or to buy missing devices,

which can be expensive.

To protect against replay attacks or against attackers that would pre-compute canvas

values for several challenges using other devices, they continuously generate new challenges.

Their idea is similar to what was done for the first versions of Google reCAPTCHA

that displayed two words in the CAPTCHA: a word known and a word unknown. In

their case, they send a challenge whom they know the response along with multiple

challenges whom they ignore the response. If the known challenge is solved correctly,

2.4 Security Applications 57

they consider the responses to all the other unknown challenges to be correct. Thus,

they can continuously get values associated with new challenges. To protect against a

pollution attack—i.e., an attacker that would submit the right answer to the known

challenge but wrong answers to unknown challenges—they apply similar techniques as

the one used to protect against cheating in user-generated labels [75].

They evaluated their approach on more than 52 million devices and were able to distinguish

all device classes with 100% accuracy. In total, they obtained around 130K unique

responses from 52M challenges. They showed that the time to generate a canvas is linear

with respect to the number of rounds. When 50 rounds were used, it required at most 400

ms. Anecdotally, during their experiment they were able to detect PhantomJS browsers

that spoofed their browser and OS, running attacks from AWS ec2 instances.

2.4.2 Detecting Bots and Crawlers Without Fingerprinting

HTTP headers and trafic shape analysis.

Jacob et al. [

] proposed a system to manage unwanted crawlers. Their system builds a

knowledge base of IP addresses and indicates if they belong to a human or to a crawler.

When a request is done, they consult the knowledge. If the IP address belongs to a

human or to an allowed crawler, e.g. Google bot, the request is allowed. Otherwise, if

the request belongs to an unwanted crawler, the request is blocked. When the IP address

is not in the knowledge base, they need to determine whether or not it belongs to a

human or a crawler. To do so, they proposed three approaches:

1. Heuristic based detection.

Their first approach leverages a set of heuristics

that rely on features extracted from the HTTP headers, as well as the URLs

requested by the user. Some of these features had already been proposed in the

state-of-the-art [

–

], such as the error rate, the proportion of pages revisited or

whether or not the user ignores cookies. They also proposed new features, such

as low URL parameters usage or whether or not URLs are accessed in alphabetic

order. If the majority of the heuristics consider the user is a human, then the

IP address is whitelisted in the knowledge base. Otherwise, they consider the IP

address is used by a crawler and the IP address is blacklisted;

2. Traffic shape detection.

Their second approach models the traffic of a user as a

time series. It aims at being more effective than other state-of-the-art approaches

that divide chunks of requests into session on which they extract features, such

58 State-of-the-art

as the mean arrival time or the number of requests. In their case, the goal of

this second approach is to extract features from the user time series to classify

the traffic. First they compute the sample auto-correlation function (SAC). This

statistical function aims at estimating the stability of the traffic for an IP address.

They extract features from this function, such as the speed of decay, that captures

the short term stability of the traffic. For example, a small decay is often linked to

a non-human activity since it indicates the traffic is stable over a long period of

time, contrary to a fast decay that indicates more instability, which is often linked

to human activity. They also extract the number of local spikes in the function, as

well as whether or not these spikes are observed at defined lags such as half days or

days, which are often signs of human activity. In a second time, they use time-series

decomposition [

]—i.e., a process that consists in decomposing a time-series into

a trend, a season, and a noise component. Since naive crawlers tend to have a

stable activity, the trend component should be almost stable—i.e., the derivative of

the trend component should be almost 0, contrary to users which have more erratic

patterns. Concerning seasonality, humans tend to have patterns at the day and

weeks frequency, contrary to crawlers which do not follow human cycles. To classify

the traffic as human or crawler, they proposed three machine-learning classifiers,

each using features extracted from the SAC function and from the time-series

decomposition. The final result is the majority of the three classifiers.

3. Distributed crawlers.

Their third approach enables to detect crawling campaign

distributed over multiple hosts. To do detect such campaigns, they apply incre-

mented clustering techniques on time series. The idea is to cluster time-series with

high similarity, which could represent crawlers with the same code, launched from

different IP address. Since crawlers are not launched at the exact same time, their

clustering approach aims at being resilient to translation in the time series.

All of their three approaches require a significant amount of data before they can

accurately classify the traffic. Thus, if an IP address is not yet in the knowledge base,

they need to obtain data on it. To address this problem, they allow a number of requests

for each IP address not in the knowledge base. Once the IP address has reached the

limit, then their system sends crawler traps, such as CAPTCHAs or hidden links.

They implemented their approach and evaluated it on a large social network. They used

10 days of real-world traffic taken from the server logs. They exclude users who had less

than 1,000 requests per day since they consider it is not enough for their second and

third approaches based on time-series analysis. Nevertheless, they consider these sources

2.4 Security Applications 59

will be handled by their active containment policy. For their evaluation they needed to

obtain the ground truth—i.e., whether or not an IP address belongs to a human or a

crawler. To obtain these labels, they used a semi-automatic approach:

•

They applied their approach based on HTTP headers and asked for feedback from

the social network engineers. Based on this feedback, they adjusted the labels.

•

They performed manual analysis by looking at the time series, user agent strings

to adjust the labels.

In order to choose the different parameters of their heuristics and train their machine

learning models, they used a train set of more than 70M requests from 813 IP addresses.

Then, they evaluated their three heuristics on a test set of 62M requests from 763 IP

addresses. Their first approach that relies on a set of heuristics based on HTTP headers

and URLs achieved a detection rate of 71.6%. Their second approach based on traffic

shape analysis achieved an accuracy of 94.89%. Finally, their third approach that aims

at detecting distributed crawling campaign was able to achieve 91.89% of accuracy.

Combining the different the first two approaches did not improve the accuracy since

crawlers detected by HTTP headers and URLs based heuristics were already detected by

the traffic shape detection approach.

CAPTCHAs.

CAPTCHAs [

] rely on Turing tests, such as image recognition, to

determine if a user is human. Figure 2.5 shows an example of a Google reCAPTCHA, a

popular service proposed by Google. While the use of CAPTCHAs is widespread, their

main drawback is that they require user interaction. Moreover, recent progress in image

and audio recognition, as well as crowdsourcing services [

] have made it easier to

break popular CAPTCHA services, such as Google’s reCAPTCHA [87,88].

Behavior biometrics.

Chu et al. [

] leverage behavioral biometrics to detect bots

that post spam in the comment section of blogs. Their hypothesis is that human users

need their mouse to navigate on a blog and also their keyboard to type comments. They

run customized bots in a controlled environment to collect events such as keystrokes or

clicks and train a decision tree that relies on the features collected to predict if a user

trying to post a comment is a human or a bot.

Video games & social networks.

The problem of bot detection has also been studied

in the context of video games [

] and social networks, such as Twitter [

Wang et al. [

] addressed the problem of detecting fake accounts used to send spam

or spread malware via social media. They trained an SVM classifier using features

60 State-of-the-art

Figure 2.5 Example of a Google reCAPTCHA.

extracted from clickstreams—i.e., sequences of HTTP requests made by a user. They

also proposed an unsupervised approach that requires less labeled data than the state-of-

the-art. Nevertheless, this problem is different from crawler detection since social media

users are logged in. Moreover, contrary to crawlers that are not human by definition,

fake accounts may be operated by humans.

Detecting virtual machines.

Another strategy to detect crawlers is to look for features

correlated with crawling. For example, the use of an IP address blacklist [

] to detect

if a user is connected from a public proxy, or if the requests are coming from a cloud

provider, such as AWS or Azure. Ho et al. [

] proposed several red pills that rely on the

time to execute different operations such as spawning web workers, to detect browsers

running in virtual machines, which may also indicate the presence of a non-human user.

2.5 Conclusion

This thesis provides three main contributions to better understand fingerprinting in

a context of tracking and security. My first contribution aims at providing a better

2.5 Conclusion 61

measurement of the stability of browser fingerprints, as well as evaluating how long users

can be tracked using only their browser fingerprints.

Through my second contribution, I evaluate the privacy implications of using fingerprinting

countermeasures. I show that even countermeasures that claim to generate consistent

fingerprints can be detected, which can harm user privacy. Based on these findings, I

provide recommendations to build more effective fingerprinting countermeasures.

While several studies measured the use of fingerprinting on the web as a tracking

mechanism, none of them focused on its use for bot detection. Through my third and

last contribution, I aim to fill this gap by measuring the fraction of websites that use

fingerprinting for crawler detection, describing the techniques they use and evaluating

their resilience against adversarial attackers.

2.5.1 FP-Stalker: Tracking Browser Fingerprint Evolutions

Through my first contribution, I aim to provide better measurements of browser finger-

prints stability and how long can browsers be tracked using only their fingerprint. In

2010, Eckersley [

] measured fingerprint stability and how it could be used for tracking

using 470,161 fingerprints collected on the Panopticlick website. Eckersley acknowledged

two main biases in the dataset collected:

•

First, users coming on the Panopticlick are aware their fingerprint is collected and

may be challenged to change it purposefully, e.g. by removing a plugin or adding

a language, to see how it impacts their uniqueness;

•

Secondly, the Panoticlick website is likely to be visited by users with privacy

countermeasures, in particular fingerprinting countermeasures, such as user agent

spoofers, that artificially modify fingerprints.

These biases can impact the accuracy of the fingerprint stability and the tracking

measurements. Moreover, since their study was conducted, new fingerprinting attributes

that rely on HTML5 features have been proposed. Thus, I conduct a large scale study

that aims to minimize these measurement biases by leveraging browser extensions over

long period of times, and that analyze the most recent fingerprinting attributes. In

particular, I study the stability of new techniques with high entropy such as canvas

fingerprinting that did not exist when Eckersley’s study was conducted. Moreover, no

study measured how the uniqueness and the stability of browser fingerprints translate

62 State-of-the-art

in terms of tracking duration. Thus, I address this issue by measuring how long can

browsers be tracked using only their fingerprints.

FP-Scanner: The Privacy Implications of Browser Fingerprint Inconsisten-

cies.

My second contribution aims at better understanding the privacy implications of

using browser fingerprinting countermeasures. Nikiforakis et al. [

] and Acar et al. [

]

showed that countermeasures, such as user agent spoofers can be detected because they

generate inconsistent fingerprints, which can harm user privacy. Nevertheless, since

these two studies have been published, several countermeasures that claim to generate

consistent fingerprints have been developed [

] but the consistency of the fin-

gerprints they generate has not been properly evaluated. Moreover, during the crawls

conducted in this thesis, I encountered fingerprinting scripts from a commercial tracking

company, Augur [

], and discovered the presence of multiple tests that aim at detecting

the presence of fingerprinting countermeasures. Thus, this finding shows the need to

properly evaluate fingerprinting countermeasures since commercial fingerprinters may try

to detect their presence.

To better evaluate these countermeasures, I propose to extend the notion of fingerprint

inconsistency introduced by Nikiforakis et al. to more advanced countermeasures, such

as canvas poisoners. I design a test suite that leverages these inconsistencies and evaluate

how existing countermeasures can be detected, and what is the impact on user privacy.

I also evaluate how existing fingerprinting solutions, either open-source or commercial,

perform to detect these countermeasures. Finally, I address a lack of recommendations for

countermeasures developers by providing good practices when developing fingerprinting

countermeasures so that they do not end up being counterproductive.

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers. In 2013, Acar et al. [10] crawled 100,000 websites and showed

that a significant fraction of fingerprinting scripts was used for security purposes, in par-

ticular, crawler and bot detection. While several studies [

] focus on the techniques

used by commercial fingerprinters for tracking, none investigates how fingerprinting can

be used to detect bots.

Different non-fingerprinting based techniques have been proposed for crawler detection.

One of the most popular, CAPTCHA [

], relies on Turing tests, such as image or

audio recognition, to detect if a user is human or not. However, with recent progress

in automatic image and audio recognition, as well as different services that propose

to solve CAPTCHAs for money [

], CAPTCHAs can easily be bypassed [

2.5 Conclusion 63

Other techniques rely on the analysis of the sequence of requests sent by the web client.

Traditional rate-limiting techniques [

–

] analyze features, such as the number

of requests or the number of pages loaded, to classify the web client as a human or

a malicious crawler. More advanced techniques [

] extract features from time series

representing the data sent to a website in order to identify if the traffic originates from a

human user or a crawler.

However, I argue that browser fingerprinting can be used to address some of the weaknesses

of state-of-the-art crawler detection techniques, such as:

•

Contrary to CAPTCHAs, browser fingerprinting does not require any user interac-

tion;

•

Contrary to methods based on HTTP requests or time series analysis [

], finger-

printing requires a single request to decide whether or not a client is a crawler.

In this thesis, I plan to improve the understanding concerning the use of browser

fingerprinting as an additional layer for crawler detection. First, I measure its use among

popular websites of the top Alexa 10K and analyze the different techniques used by

commercial fingerprinters to detect crawlers. I explain how fingerprinting techniques

for crawler detection differ from techniques used for tracking. Finally, one of the main

challenges in using fingerprinting in a security context lies in the fact that fingerprints

are collected in the client-side, and therefore, can be modified by an attacker. Thus, I

evaluate the resilience of current fingerprinting techniques against an adversarial crawler

developer.

Part III

Contributions

Chapter 3

Fp-Stalker: Tracking Browser

Fingerprint Evolutions

The majority of the browser fingerprinting literature has focused on fingerprint uniqueness,

a condition required for tracking. However, fingerprint uniqueness, by itself, is insufficient

for tracking if fingerprints change frequently. Indeed, one needs to keep track of these

evolutions to link them to previous fingerprints belonging to the same browser. In this

chapter, I aim at measuring how vulnerable are browsers against fingerprinting-based

tracking by measuring how long they be tracked using solely their fingerprint. This

chapter extends my Fp-Stalker paper published at S&P 18 [

]. I conduct an analysis

on a dataset that contains more than 25,000 new fingerprints, which results in a dataset of

122

350 fingerprints from 2

346 browsers. First, in Section 3.1 I describe our dataset and

highlight the limits of browser fingerprint uniqueness for tracking purposes by showing

that fingerprints change frequently (around 50% of browser instances changed their

fingerprints in less than 5 days, 70% in less than 10 days). Then, in Section 3.2 I

propose Fp-Stalker, a novel algorithm that detects if two fingerprints originate from

the same browser instance, which refers to an installation of a browser on a device.

However, browser instances change overtime, e.g. they are updated or configured

differently, causing their fingerprints to evolve. Therefore, I introduce two variants of Fp-

Stalker: a rule-based and a hybrid variant, which leverages rules and a random forest.

In Section 3.3, I evaluate my approach using 122

350 browser fingerprints originating

from 2

346 browser instances, which we collected over two years. The fingerprints were

collected using two browser extensions advertised on the AmIUnique website, one for

Firefox and the other for Chrome. I compare both variants of Fp-Stalker and an

68 Fp-Stalker: Tracking Browser Fingerprint Evolutions

implementation of the algorithm proposed by Eckersley. In my experiments, I evaluate

Fp-Stalker’s ability to correctly link browser fingerprints originating from the same

browser instance, as well as its ability to detect fingerprints that originate from unknown

browser instances. I show that Fp-Stalker can link, on average, fingerprints from a

given browser instance for more than 49.3 days, which represents an improvement of 34

days compared to the closest algorithm from the literature. I also discuss the impact

of my findings, in particular when browser fingerprinting is used in addition to other

stateful tracking techniques such as cookies or E-tag. Finally, I conclude this chapter in

Section 3.4.

3.1 Browser Fingerprint Evolutions

This paper focuses on the linkability of browser fingerprint evolutions over time. Using

fingerprinting as a long-term tracking technique requires not only obtaining unique browser

fingerprints, but also linking fingerprints that originate from the same browser instance.

Most of the literature has focused on studying or increasing fingerprint uniqueness [

While uniqueness is a critical property of fingerprints, it is also critical to understand

fingerprint evolution to build an effective tracking technique. Our study provides more

insights into browser fingerprint evolution in order to demonstrate the effectiveness of

such a tracking technique.

Input dataset

The raw input dataset we collected contains 199

909 fingerprints

obtained from 8

898 different browser instances. All browser fingerprints were obtained

from AmIUnique extensions for Chrome and Firefox installed from July 2015 to early

October 2017 by 8

898 participants in this study. The extensions load a page in the

background that fingerprints the browser. Compared to a fingerprinting website, the only

additional information we collect is a unique identifier we generate per browser instance

when the extension is installed. This serves to establish the ground truth. Moreover, we

pre-process the raw dataset by applying the following rules:

We remove browser instances with less than 7browser fingerprints. This is because

to study the ability to track browsers, we need browser instances that have been

fingerprinted multiple times;

We discard browser instances with inconsistent fingerprints due to the use of

countermeasures that artificially alter the fingerprints. To know if a user installed

3.1 Browser Fingerprint Evolutions 69

09/2015

12/2015

03/2016

06/2016

09/2016

12/2016

03/2017

06/2017

09/2017

200

400

600

800

Number of distinct

browser instances

2000

4000

6000

8000

Number of fingerprints

Number of browser instances

Number of fingerprints

Figure 3.1 Number of fingerprints and distinct browser instances per month

such a countermeasure, we check if the browser or OS changes and we check that

the attributes are consistent among themselves. Although countermeasures exist in

the wild, they are used by a minority of users and, we argue, should be treated by

a separate specialized anti-spoofing algorithm. We leave this task for future work.

After applying these rules, we obtain a final dataset of 122

350 fingerprints from 2

346

browser instances. All following graphs and statistics are based on this final dataset.

Figure 3.1 presents the number of fingerprints and distinct browser instances per month

over the two year period. The decrease in October 2017 is caused by the fact that we

collected fingerprints until October 6th.

Most users heard of our extensions through posts published on popular tech websites,

such as Reddit, Hackernews or Slashdot. Users install the extension to visualize the

evolution of their browser fingerprints over a long period of time, and also to help

researchers understand browser fingerprinting in order to design better countermeasures.

We explicitly state the purpose of the extension and the fact it collects their browser

fingerprints. Moreover, we received an approval from the Institutional Review Board

(IRB) of our research center for the collection as well as the storage of these browser

fingerprints. As a ground truth, the extension generates a unique identifier per browser

instance. The identifier is attached to all fingerprints, which are automatically sent every

4hours. In this study, the browser fingerprints we consider are composed of the standard

attributes described in Table 3.1.

70 Fp-Stalker: Tracking Browser Fingerprint Evolutions

Table 3.1 An example of a browser fingerprint collect by the AmIUnique extension.

Attribute Source Value Examples

Accept HTTP header text/html,application/xhtml+xml,application

/xml;q=0.9,image/webp,*/*;q=0.8

Connection HTTP header close

Encoding HTTP header gzip, deflate, sdch, br

Headers HTTP header Connection Accept X-Real-IP DNT Cookie

Accept-Language Accept-Encoding User-Agent Host

Languages HTTP header en-US,en;q=0.8,es;q=0.6

User-agent HTTP header

Mozilla/5.0 (Windows NT 10.0; Win64; x64)

AppleWebKit/537.36 (KHTML, like Gecko)

Chrome/54.0.2840.99 Safari/537.36

Canvas JavaScript

Cookies JavaScript yes

Do not track JavaScript yes

Local storage JavaScript no

Platform JavaScript MacIntel

Plugins JavaScript

Plugin 0: Chrome PDF Viewer; ; mhiehjai.

Plugin 1: Chrome PDF Viewer; Portable

Document Format; internal-pdf-viewer.

Plugin 2: Native Client; ; internal-nacl-plugin.

Resolution JavaScript 2560x1440x24

Timezone JavaScript -180

WebGL Javascript NVIDIA GeForce GTX 750 Series; Microsoft .

Fonts Flash List of fonts installed on the device

3.1 Browser Fingerprint Evolutions 71

100101102103104

Number of fingerprints

Anonymity set size

Figure 3.2 Browser fingerprint anonymity set sizes

Figure 3.2 illustrates the anonymity set sizes against the number of participants involved

in this study. The long tail reflects that 95 % of the browser fingerprints are unique

among all the participants and belong to a single browser instance, while only 20 browser

fingerprints are shared by more than 5browser instances.

Evolution triggers

Browser fingerprints naturally evolve for several reasons. We

identified the following categories of changes:

Automatic evolutions

happen automatically and without direct user intervention.

This is mostly due to automatic software upgrades, such as the upgrade of a

browser or a plugin that may impact the user agent or the list of plugins;

Context-dependent evolutions

being caused by changes in the user’s context. Some

attributes, such as

resolution

timezone

, are indirectly impacted by a contextual

change, such as connecting a computer to an external screen or traveling to a

different timezone; and

User-triggered evolutions

that require an action from the user. They concern configuration-

specific attributes, such as cookies,do not track or local storage.

To know how long attributes remain constant and if their stability depends on the browser

instance, we compute the average time, per browser instance, that each attribute does

72 Fp-Stalker: Tracking Browser Fingerprint Evolutions

Table 3.2 Durations the attributes remained constant for the median, the 90

and the

95th percentiles.

Percentile (days)

Attribute Trigger 50th 90th 95th

Resolution Context Never 3.0 1.8

User agent Automatic 31.9 11.7 7.4

Plugins Automatic/User 42.6 12.9 8.8

Fonts Automatic Never 15.8 6.4

Headers Automatic 371.2 33.0 13.9

Canvas Automatic 306.9 36.5 17.8

Major browser version Automatic 49.8 30.8 20.0

Timezone Context 291.8 58.6 30.1

Renderer Automatic Never 85.2 33.7

Vendor Automatic Never 146.1 56.5

Language User Never 155.6 55.6

Dnt User Never 203.7 58.6

Encoding Automatic Never 124.1 70.4

Accept Automatic Never 194.6 128.5

Local storage User Never Never Never

Platform Automatic Never Never Never

Cookies User Never Never Never

not change. Table 3.2 presents the median, the 90

and 95

percentiles of the duration

each attribute remains constant, on average, in browser instances. In particular, we

observe that the

User agent

is rather unstable in most browser instances as its value is

systematically impacted by software updates. In comparison, attributes such as

local storage

and

do not track

rarely change if ever. Moreover, we observe that

attributes evolve at different rates depending on the browser instance. For example,

canvas

remains stable for 306

9days in 50% of the browser instances, whereas it changes

every 36

5days for 10% of them. The same phenomena can be observed for the

screen

resolution

where more than 50% of the browser instances never see a change, while 10%

change every 3days on average. This is likely explained by laptops that are connected

regularly to external monitors. More generally this points to some browser instances

being quite stable, and thus, more trackable, while others are not.

3.1 Browser Fingerprint Evolutions 73

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Days before a change occurs in a fingerprint

0.0

0.2

0.4

0.6

0.8

1.0

P(X)

Over all fingerprints

Average per browser instance

Figure 3.3 CDF of the elapsed time before a fingerprint evolution for all the fingerprints,

and averaged per browser instance.

Evolution frequency

Another key indicator to observe is the elapsed time (

) before

a change occurs in a browser fingerprint. Figure 3.3 depicts the cumulative distribution

function of

for all fingerprints (blue), or averaged per browser instance (orange). After

one day, at least one transition occurs in 57

35% of the observed fingerprints. The 90

percentile is observed after 9

8days and the 95

percentile after 18

08 days. This means

the probability that at least one transition occurs in 9

8days is 0

9(blue). It is important

to point out that changes occur more or less frequently depending on the browser instance

(orange). While some browser instances change often (22

6% change in less than two

days) others, on the contrary, are much more stable (29

8% have no changes after 10

days). In this context, keeping pace with the frequency of change is likely a challenge for

browser fingerprint linking algorithms and, to the best of our knowledge, has not been

explored in the state of the art.

Evolution rules

While it is difficult to anticipate browser fingerprint evolutions, we

can observe how individual attributes evolve. In particular, evolutions of the

User agent

attribute are often tied to browser upgrades, while evolutions of the

Plugins

attribute

refers to the addition, deletion or upgrade of a plugin (upgrades change its version).

Nevertheless, not all attribute changes can be explained in this manner, some values are

74 Fp-Stalker: Tracking Browser Fingerprint Evolutions

difficult to anticipate. For example, the value of the

canvas

attribute is the result of

an image rendered by the browser instance and depends on many different software and

hardware layers. The same applies, although to a lesser extent, to screen

resolution

which can take unexpected values depending on the connected screen. Based on these

observations, the accuracy of linking browser fingerprint evolutions depends on the

inference of such evolution rules. The following section introduces the evolution rules

we first identified empirically, and then learned automatically, to achieve an efficient

algorithm to track browser fingerprints over time.

3.2 Linking Browser Fingerprints

Fp-Stalker’s goal is to determine if a browser fingerprint comes from a known browser

instance—i.e., it is an evolution—or if it should be considered as from a new browser

instance. Because fingerprints change frequently, and for different reasons (see section 3.1),

a simple direct equality comparison is not enough to track browsers over long periods of

time.

In Fp-Stalker, we have implemented two variant algorithms with the purpose of

linking browser fingerprints, as depicted in Figure 3.4. The first variant is a rule-based

algorithm that uses a static ruleset, and the second variant is an hybrid algorithm that

combines both rules and machine learning. We explain the details and the tradeoffs of

both algorithms in this section. Our results show that the rule-based algorithm is faster

but the hybrid algorithm is more precise while still maintaining acceptable execution

times. We have also implemented a fully random forest-based algorithm, but the small

increase in precision did not outweigh the large execution penalty, so we do not present

it further in this paper.

3.2.1 Browser fingerprint linking

When collecting browser fingerprints, it is possible that a fingerprint comes from a previous

visitor—i.e., a known browser instance—or from a new visitor—i.e., an unknown browser

instance. The objective of fingerprint linking is to match fingerprints to their browser

instance and follow the browser instance as long as possible by linking all of its fingerprint

evolutions. In the case of a match, linked browser fingerprints are given the same

identifier, which means the linking algorithm considers they originate from the same

3.2 Linking Browser Fingerprints 75

browser instance. If the browser fingerprint cannot be linked, the algorithm assigns a

new identifier to the fingerprint.

More formally, given a set of known browser fingerprints

, each

f∈F

has an identifier

f.id

that links to the browser instance it belongs to. Given an unknown fingerprint

fu/∈F

for whom we ignore the real

, a linking algorithm returns the browser instance

identifier

fk.id

of the fingerprint

that maximizes the probability that

and

belong

to the same browser instance. This computation can be done either by applying rules,

or by training an algorithm to predict this probability. If no known fingerprint can be

found, it assigns a new id to

. For optimization purposes, we only hold and compare

the last

fingerprints of each browser instance

. The reason is because if we

linked, for example, 3browser fingerprints

and

to a browser instance

then,

when trying to link an unknown fingerprint

, it is rarely useful to compare

to the

oldest browser fingerprints of

. That is, newer fingerprints are more likely to produce a

match, hence we avoid comparing old fingerprints in order to improve execution times.

In our case we set the value of νto 2.

3.2.2 Rule-based Linking Algorithm

The first variant of Fp-Stalker is a rule-based algorithm that uses static rules obtained

from statistical analyses performed in Section 3.1. The algorithm relies on rules designed

from attribute stability presented in Table 3.2 to determine if an unknown fingerprint

belongs to the same browser instance as a known fingerprint

. We also define rules

based on constraints that we would not expect to be violated, such as, a browser’s family

should be constant (e.g., the same browser instance cannot be Firefox one moment and

Chrome at a later time), the Operating System is constant, and the browser version is

either constant or increases over time. The full list of rules are as follow:

The

platform

and

browser family

must be identical for any given browser

instance. Even if this may not always be true (e.g. when a user updates from

Windows 8 to 10), we consider it reasonable for our algorithm to lose track of a

browser when such a large change occurs since it is not frequent;

The

browser version

remains constant or increases over time. This would not be

true in the case of a downgrade, but this is also, not a common event;

Due to the results from our statistical analyses, we have defined a set of attributes

that must not differ between two fingerprints from the same browser instance. We

76 Fp-Stalker: Tracking Browser Fingerprint Evolutions

(a) Rule-based variant of Fp-Stalker.

Uses a set of static rules to determine if

fingerprints should be linked to the same

browser instance or not.

(b) Hybrid variant of Fp-Stalker. The training phase is used

to learn the probability that two fingerprints belong to the same

browser instance, and the testing phase uses the random forest-

based algorithm to link fingerprints.

Figure 3.4 Fp-Stalker: Overview of both algorithm variants. The rule-based algorithm

is simpler and faster but the hybrid algorithm leads to better fingerprint linking.

3.2 Linking Browser Fingerprints 77

consider that

local storage

Dnt

and

canvas

should be constant for any

given browser instance. As observed in Table 3.2, these attributes do not change

often, if at all, for a given browser instance. In the case of

canvas

, even if it seldomly

changes for most users (see Table 3.2, the changes are unpredictable making them

hard to model. Since

canvas

are quite unique among browser instances [

], and

do not change too frequently, it is still interesting to consider that it must remain

identical between two fingerprints of the same browser instance;

We impose a constraint on fonts: if both fingerprints have Flash activated—i.e.

we have a list of fonts available—then the fonts of

must either be a subset or a

superset of the fonts of

, but not a disjoint set. That means that between two

fingerprints of a browser instance, it will allow deletions or additions of fonts, but

not both;

We define a set of attributes that are allowed to change, but only within a certain

similarity. That means that their values must have a similarity ratio

75,

as defined in the Python library function

difflib.SequenceMatcher().ratio

These attributes are

user agent

vendor

renderer

plugins

language

headers. We allow at most two changes of this kind;

We also define a set of attributes that are allowed to change, no matter their value.

This set is composed of

resolution

timezone

and

encoding

. However, we only

allow one change at the same time among these three attributes;

7. Finally, the total number of changes from rules 5 and 6 must be less than 2.

The order in which rules are applied is important for performance purposes: we ordered

them from the most to least discriminating. The first rules discard many candidates,

reducing the total number of comparisons. In order to link

to a fingerprint

, we

apply the rules to each known fingerprint taken from

. As soon as a rule is not matched,

the known fingerprint is discarded and we move onto the next. If a fingerprint matches

all the rules, then it is added to a list of potential candidates,

candidates

. Moreover,

in case fingerprints

and

are identical, we add it to the list of exact matching

candidates,

exact

. Once the rule verification process is completed, we look at the two

lists of candidates. If

exact

is not empty, we check if there is only one candidate or if

all the candidates come from the same browser instance. If it is the case, then we link

with this browser instance, otherwise we assign a new id to

. In case no exact

candidate is found, we look at

candidates

and apply the same technique as for

exact

We summarize the rule-based approach in Algorithm 1.

78 Fp-Stalker: Tracking Browser Fingerprint Evolutions

Algorithm 1 Rule-based matching algorithm

function FingerprintMatching(F,fu)

rules ={rule1,..., rule6}

candidates ∅

exact ∅

for fk∈Fdo

if VerifyRules(fk,fu,rules)then

if nbDif f = 0 then

exact exact ∪ ⟨fk⟩

else

candidates candidates ∪ ⟨fk⟩

end if

end for

if |exact|>0and SameIds(exact)then

return exact[0].id

else if |candidates|>0and SameIds(candidates)then

return candidates[0].id

else

return GenerateNewId()

end if

end function

SameIds is a function that, given a list of candidates, returns true if all of them share the

same id, else false.

3.2 Linking Browser Fingerprints 79

On a side note, we established the rules using a simple univariate statistical analysis to

study attribute stability (see Table 3.2), as well as some objective (e.g., rule 1) and other

subjective (e.g., rule 4) decisions. Due to the difficulty in making complex yet effective

rules, the next subsection presents the use of machine learning to craft a more effective

algorithm.

3.2.3 Hybrid Linking Algorithm

The second variant of Fp-Stalker mixes the rule-based algorithm with machine learning

to produce a hybrid algorithm. It reuses the first three rules of the previous algorithm,

since we consider them as constraints that should not be violated between two fingerprints

of a same browser instance. However, for the last four rules, the situation is more fuzzy.

Indeed, it is not as clear when to allow attributes to be different, how many of them

can be different, and with what dissimilarity. Instead of manually crafting rules for each

of these attributes, we propose to use machine learning to discover them. The interest

of combining both rules and machine learning approaches is that rules are faster than

machine learning, but machine learning tends to be more precise. Thus, by applying

the rules first, it helps keep only a subset of fingerprints on which to apply the machine

learning algorithm.

3.2.3.1 Approach Description

The first step of this algorithm is to apply rules 1,2and 3on

and all

fk∈F

. We keep

the subset of browser fingerprints

fksub

that verify these rules. If, during this process, we

found any browser fingerprints that exactly match

, then we add them to

exact

. In

case

exact

is not empty and all of its candidates are from the same browser instance, we

stop here and link

with the browser instance in

exact

. Otherwise, if there are multiple

exact candidates but from different browser instances, then we assign a new browser

id to

. In the case where the set of exact candidates is empty, we continue with a

second step that leverages machine learning. In this step, for each fingerprint

fk∈fksub

we compute the probability that

and

come from the same browser instance using

a random forest model. We keep a set of fingerprint candidates whose probability is

greater than a

threshold parameter. If the set of candidates is empty, we assign a

new id to

. Otherwise, we keep the set of candidates with the highest and second

highest probabilities,

ch1

and

ch2

. Then, we check if

ch1

contains only one candidate or

if all of the candidates come from the same browser instance. If it is not the case, we

80 Fp-Stalker: Tracking Browser Fingerprint Evolutions

check that either the probability

ph1

associated with candidates of

ch1

is greater than

the probability

ph2

associated with candidates of

ch2

diff

, or that

ch2

and

ch1

contains

only candidates from the same browser instance. Algorithm 2summarizes the hybrid

approach.

3.2.3.2 Machine Learning

Computing the probability that two fingerprints

and

originate from the same

browser instance can be modeled as a binary classification problem where the two classes

to predict are

same browser instance

and

different browser instance

. We use

the random forest algorithm [

100

] to solve this binary classification problem. A random

forest is an ensemble learning method for classification that operates by constructing a

multitude of decision trees at training time and outputting the class of the individual

trees. In the case of Fp-Stalker, each decision tree makes a prediction and votes if

the two browser fingerprints come from the same browser instance. The result of the

majority vote is chosen. Our main motivation to adopt a random forest instead of other

classifiers is because it provides a good tradeoff between precision and the interpretation

of the model. In particular, the notion of feature importance in random forests allows

Fp-Stalker to interpret the importance of each attribute in the decision process.

In summary, given two fingerprints,

fu/∈F

and

fk∈F

, whose representation is reduced

to a single feature vector of

features

⟨x1,x2, ..., xM⟩

, where the feature

the comparison of the attribute

for both fingerprints (the process of transforming

two fingerprints into a feature vector is presented after). Our random forest model

computes the probability

(

fu.id

fk.id |

(

x1,x2, ..., xM

)) that

and

belong to the

same browser instance.

Input Feature Vector

To solve the binary classification problem, we provide an input

vector

⟨x1,x2, ..., xM⟩

features to the random forest classifier. The features are

mostly pairwise comparisons between the values of the attributes of both fingerprints (e.g.,

Canvas

User agent

). Most of these features are binary values (

) corresponding

to the equality or inequality of an attribute, or similarity ratios between these attributes.

We also include a

number of changes

feature that corresponds to the total number of

different attributes between

and

, as well as the time difference between the two

fingerprints.

3.2 Linking Browser Fingerprints 81

Algorithm 2 Hybrid matching algorithm

function FingerprintMatching(F,fu,λ)

rules ={rule1,rule2,rule3}

exact ∅

Fksub ∅

for fk∈Fdo

if VerifyRules(fk,fu,rules)then

if nbDif f = 0 then

exact exact ∪ ⟨fk⟩

else

Fksub Fksub ∪ ⟨fk⟩

end if

end for

if |exact|>0then

if SameIds(exact)then

return exact[0].id

else

return GenerateNewId()

end if

candidates ∅

for fk∈Fksub do

⟨x1,x2, ..., xM⟩=FeatureVector(fu,fk)

pP(fu.id =fk.id | ⟨x1,x2, ..., xM⟩)

if p≥λthen

candidates candidates ∪ ⟨fk,p⟩

end if

end for

if |candidates|>0then

ch1,ph1getCandidatesRank(candidates,1)

ch2,ph2getCandidatesRank(candidates,2)

if SameIds(ch1)and ph1> ph2+diff then

return candidates[0].id

end if

if SameIds(ch1∪ch2)then

return candidates[0].id

end if

return GenerateNewId()

end function

getCandidatesRank is a function that given a list of candidates and an rank

, returns a

list of candidates with the ith greatest probability, and this probability.

82 Fp-Stalker: Tracking Browser Fingerprint Evolutions

In order to choose which attributes constitute the feature vector we made a feature

selection. Indeed, having too many features does not necessarily ensure better results.

It may lead to overfitting—i.e., our algorithm correctly fits our training data, but does

not correctly predict on the test set. Moreover, having too many features also has a

negative impact on performance. For the feature selection, we started with a model using

all of the attributes in a fingerprint. Then, we looked at feature importance, as defined

by [

101

], to determine the most discriminating features. In our case, feature importance

is a combination of uniqueness, stability, and predictability (the possibility to anticipate

how an attribute might evolve over time). We removed all the components of our feature

vector that had a negligible impact (feature importance

0.001). Finally, we obtained

a feature vector composed of the attributes presented in Table 3.3. We see that the

most important feature is the number of differences between two fingerprints, and the

second most discriminating attribute is the time difference between the two fingerprints

compared. Even attributes with low entropy, such as the list of languages, are among

the most important features. Although this may seem surprising, it can be explained

by the stability of the list of languages, as shown in Table 3.2, which means that if two

fingerprints have different languages, this often means that they do not belong to the

same browser instance. In comparison, screen resolution also has low entropy but it

changes more often than the list of languages, leading to low feature importance. This is

mostly caused by the fact that since screen resolution changes frequently, having two

fingerprints with a different resolution does not add a lot of information to determine

whether or not they are from the same browser instance. Finally, we see a high drop

in feature importance after rank 7 (from 0

018 to 0

004), which means that most of the

information required for the classification is contained in the first seven features.

Training Random Forests

This phase trains the random forest classifier to estimate

the probability that two fingerprints belong to the same browser instance. To do so, we

split the input dataset introduced in Section 3.1 chronologically into two sets: a training

set and a test set. The training set is composed of the first 40 % of fingerprints in our

input dataset, and the test set of the last 60%. The random forest detects fingerprint

evolutions by computing the evolutions between fingerprints as feature vectors. During

the training phase, it needs to learn about correct evolutions by computing relevant

feature vectors from the training set. Algorithm 3describes this training phase, which is

split into two steps.

3.2 Linking Browser Fingerprints 83

Table 3.3 Feature importances of the random forest model calculated from the fingerprint

train set.

Rank Feature Importance

1Number of changes 0.324

2Time difference 0.225

3User agent HTTP 0.194

4Languages HTTP 0.112

5Plugins 0.094

6Canvas 0.020

7Renderer 0.018

8Resolution 0.004

9Timezone 0.002

10 Fonts 0.001

Algorithm 3 Compute input feature vectors for training

function BuildTrainingVectors(ID,F,δ,ν)

T∅

for id ∈ID do ▷Step 1

Fid BrowserFingerprints(id,F )

for ft∈Fid do

TT∪FeatureVector(ft, ft−1)

end for

for f∈Fdo ▷Step 2

frrandom(F)

if f.id =fr.id then

TT∪FeatureVector(f, fr)

end if

end for

return T

end function

84 Fp-Stalker: Tracking Browser Fingerprint Evolutions

Step 1

, for every browser instance (

) of the training set, we compare each of its

fingerprints (

ft∈BrowserFingerprints

(

id,F

)) present in the training set (

) with

the previous one (

ft−1

). By doing so, Fp-Stalker captures the atomic evolutions that

occur between two consecutive fingerprints from the same browser instance. We apply

BuildTrainingVectors

() for different collect frequencies (time difference between

and

t−

1) to teach our model to link fingerprints even when they are not equally spaced

in time.

While

Step 1

teaches the random forest to identify fingerprints that belong to the same

browser instance, it is also necessary to identify when they do not.

Step 2

compares

fingerprints from different browser instances. Since the number of fingerprints from

different browser instances is much larger than the number of fingerprints from the same

browser instance, we limit the number of comparisons to one for each fingerprint. This

technique is called undersampling [

102

] and it reduces overfitting by adjusting the ratio

of input data labeled as true—i.e., 2 fingerprints belong to the same browser instance—

against the number of data labeled as false—i.e., 2 fingerprints are from different browser

instances. Otherwise, the algorithm would tend to simply predict false.

Random forest hyperparameters.

Concerning the number of trees of the random

forest, there is a tradeoff between precision and execution time. Adding trees does

obtain better results but follows the law of diminishing returns and increases training

and prediction times. Our goal is to balance precision and execution time. The number

of features plays a role during the tree induction process. At each split,

features are

randomly selected, among which the best split is chosen [

103

]. Usually, its default value

is set to the square root of the length of the feature vector. The diff parameter enables

the classifier to avoid selecting browser instances with very similar probabilities as the

origin of the fingerprint; we would rather create a new browser instance than choose the

wrong one. It is not directly related to random forest hyperparameters but rather to

the specificities of our approach. In order to optimize the hyperparameters number of

trees and number of features, as well as the diff parameter, we define several possible

values for each and run a grid search to optimize the accuracy. This results in setting

the hyperparameters to 10 trees and 3features, and the diff value to 0.20.

After training our random forest classifier, we obtain a forest of decision trees that predict

the probability that two fingerprints belong to the same browser instance. Figure 3.5

illustrates the first three levels of one of the decision trees. These levels rely on the

languages

, the

number of changes

and the

user agent

to take a decision. If an

3.2 Linking Browser Fingerprints 85

languages HTTP <= 0.978

number of changes <= 4.5

True

number of changes <= 5.5

False

languages HTTP <= 0.759 number of changes <= 5.5

(...) (...) (...) (...)

user agent HTTP <= 0.869 user agent HTTP <= 0.974

(...) (...) (...) (...)

Figure 3.5 First 3 levels of a single tree classifier from our forest.

attribute has a value below its threshold, the decision path goes to the left child node,

otherwise it goes to the right child node. The process is repeated until we reach a leaf of

the tree. The prediction corresponds to the class (same/different browser instance) that

has the most instances over all the leaf nodes.

Lambda threshold parameter

For each browser fingerprint in the test set, we

compare it with its previous browser fingerprint and with another random fingerprint

from a different browser, and compute the probability that it belongs to the same browser

instance using our random forest classifier with the parameters determined previously.

Using these probabilities and the true labels, we choose the

value that minimizes the

false positive rate, while maximizing the true positive rate. However, this configuration

parameter depends on the targeted application of browser fingerprinting. For instance, if

browser fingerprinting is used as a second-tier security mechanism (e.g., to verify the

user is connecting from a known browser instance), we set

to a high value. This

makes the algorithm more conservative, reducing the risk of linking a fingerprint to an

incorrect browser instance, but it also increases false negatives and results in a reduction

of the duration the algorithm can effectively track a browser. On the opposite end, a

low

value will increase the false positive rate, in this case meaning it tends to link

browser fingerprints together even though they present differences. Such a use case might

be acceptable for constructing ad profiles, because larger profiles are arguably more

useful even if sometimes contaminated with someone else’s information. By applying

this approach, we obtained a λthreshold equal to 0.994.

86 Fp-Stalker: Tracking Browser Fingerprint Evolutions

3.3 Empirical Evaluation of Fp-Stalker

This section assesses Fp-Stalker’s capacity to i) correctly link fingerprints from the

same browser instance, and to ii) correctly predict when a fingerprint belongs to a

browser instance that has never been seen before. We show that both variants of Fp-

Stalker are effective in linking fingerprints and in distinguishing fingerprints from new

browser instances. However, the rule-based variant is faster while the hybrid variant is

more precise. Finally, we discuss the impact of the collect frequency on fingerprinting

effectiveness, and we evaluate the execution times of both variants of Fp-Stalker.

Figure 3.6 illustrates the linking and evaluation process. Our database contains perfect

tracking chains because of the unique identifiers our extensions use to identify browser

instances. From there, we sample the database using different collection frequencies and

generate a test set that removes the identifiers, resulting in a mix of fingerprints from

different browsers. The resulting test set is then run through Fp-Stalker to reconstruct

the best browser instance chains as possible.

3.3.1 Key Performance Metrics

To evaluate the performance of our algorithms and measure how vulnerable users are

to browser fingerprint tracking, we consider several metrics that represent the capacity

to keep track of browser instances over time and to detect new browser instances. This

section presents these evaluation metrics, as well as the related vocabulary. Figure 3.6

illustrates the different metrics with a scenario.

Atracking chain is a list of fingerprints that have been linked—i.e., fingerprints for

which the linking algorithm assigned the same identifier. A chain may be composed of

one or more fingerprints. In case of a perfect linking algorithm, each browser instance

would have a unique tracking chain—i.e., all of its fingerprints are grouped together and

are not mixed with fingerprints from any other browser instances. However, in reality,

fingerprinting is a statistical attack and mistakes may occur during the linking process,

which means that:

Fingerprints from different browser instances may be included in the same tracking

chain,

Fingerprints from a given browser instance may be split into different tracking

chains.

3.3 Empirical Evaluation of Fp-Stalker 87

The lower part of Figure 3.6 shows examples of these mistakes. Chain 1 has an incorrect

fingerprint

fpB1

from

Browser B

, and

chain 3

and

chain 4

contain fingerprints from

browser C

that have not correctly been linked—i.e.,

fpC3

and

fpC4

were not linked

leading to a split).

We present the tracking duration metric to evaluate the capacity of an algorithm to track

browser instances over time. We define tracking duration as the period of time a linking

algorithm matches the fingerprints of a browser instance within a single tracking chain.

More specifically, the tracking duration for a browser

in a chain

chaink

is defined as

CollectF req uency ×

bi∈chaink−

1). We subtract one because we consider a browser

instance to have been tracked, by definition, from the second linked fingerprint onwards.

The average tracking duration for a browser instance

is the arithmetic mean of its

tracking duration across all the tracking chains the instance is present in. For example,

in Figure 3.6, the tracking duration of

browser B

chain 1

is 0

×CollectF req uency

and the tracking duration in

chain 2

is 1

×CollectF req uency

, thus the average tracking

duration is 0

×CollectF req uency

. In the same manner, the average tracking duration

of browser C is 1.5×C ollectF req uency.

The maximum tracking duration for a browser instance

is defined as the maximum

tracking duration across all of the tracking chains the browser instance is present in.

In the case of

browser C

, the maximum tracking duration occurred in

chain 3

and is

equal to 2×CollectF req uency.

The Number of assigned ids represents the number of different identifiers that have been

assigned to a browser instance by the linking algorithm. It can be seen as the number

of tracking chains in which a browser instance is present. For each browser instance,

a perfect linking algorithm would group all of the browser’s fingerprints into a single

chain. Hence, each browser instance would have a number of assigned ids of 1. Figure 3.6

shows an imperfect case where

browser C

has been assigned 2different ids (

chain 3

and chain 4).

The ownership ratio reflects the capacity of an algorithm to not link fingerprints from

different browser instances. The

owner

of a tracking chain

chaink

is defined as the browser

instance

that has the most fingerprints in the chain. Thus, we define ownership ratio

as the number of fingerprints that belong to the

owner

of the chain divided by the length

of the chain. For example, in

chain 1

browser A

owns the chain with an

ownership

ratio

because it has 4out of 5of the fingerprints. In practice, an ownership ratio

88 Fp-Stalker: Tracking Browser Fingerprint Evolutions

Figure 3.6 Overview of our evaluation process that allows testing the algorithms using

different simulated collection frequencies.

close to 1means that a tracking profile is not polluted with information from different

browser instances.

3.3.2 Comparison With Panopticlick’s Linking Algorithm

We compare Fp-Stalker to the algorithm proposed by Eckersley [

] in the context of

the Panopticlick project. To the best of our knowledge, there are no other algorithms

to compare to. Although Eckersley’s algorithm has been characterized as “naive" by

its author, we use it as a baseline to compare our approach. The Panopticlick

algorithm is summarized in Algorithm 4. It uses the following 8attributes:

User agent

cookies enabled

screen resolution

timezone

plugins

fonts

and

local

storage

. Given an unknown fingerprint

,Panopticlick tries to match it to a

previous fingerprint of the same browser instance if a sufficiently similar one exists—i.e.,

no more than one attribute changed. Otherwise, if it found no similar fingerprints, or

too many similar fingerprints that belong to different browser instances, it assigns a new

3.3 Empirical Evaluation of Fp-Stalker 89

Algorithm 4 Eckersley fingerprint matching algorithm [3]

ALLOW E D ={cookies,resolution, timezone, local}

function FingerprintMatching(F,fu)

candidates ∅

for fk∈Fdo

changes Diff(fu, fk)

if |changes|= 1 then

candidates candidates ∪ ⟨fk,changes⟩

end if

end for

if |candidates|= 1 then

⟨fk,a⟩ candidates[0]

if a∈ALLOW E D then

return fk

else if MatchRatio(fu(a),fk(a)) >0.85 then

return fk

else

return NU LL

end if

end function

MatchRatio refers to the Python standard library function

difflib.SequenceMatcher().ratio() for estimating the similarity of strings.

90 Fp-Stalker: Tracking Browser Fingerprint Evolutions

id. Moreover, although at most one change is allowed, this change can only occur among

the following attributes: cookies,resolution,timezone and local storage.

3.3.3 Dataset Generation Using Fingerprint Collect Frequency

To evaluate the effectiveness of Fp-Stalker we start from our test set of 59

159

fingerprints collected from 1

395 browser instances(60% of our input dataset, see Sec-

tion 3.2.3.2). However, we do not directly use this set. Instead, by sampling the test

set, we generate new datasets using a configurable collect frequency. Because our input

dataset is fine-grained, it allows us to simulate the impact fingerprinting frequency has

on tracking. The intuition being that if a browser is fingerprinted less often, it becomes

harder to track.

To generate a dataset for a given collect frequency, we start from the test set of 59

159

fingerprints, and, for each browser instance, we look at the collection date of its first

fingerprint. Then, we iterate in time with a step of

collect_f requency

days and recover

the browser instance’s fingerprint at time

collect_f requency

. It may be the same

fingerprint as the previous collect or a new one. We do this until we reach the last

fingerprint collected for that browser id. This allows us to record a sequence of fingerprints

that correspond to the sequence a fingerprinter would obtain if the browser instance was

fingerprinted at a frequency of

collect_f requency

days. The interest of sampling is that

it is more realistic than using all of the fingerprints from our database since they are

very fine-grained. Indeed, the extension is capable of catching even short-lived changes

in the fingerprint (e.g., connecting an external monitor), which is not always possible in

the wild. Finally, it allows us to investigate how fingerprint collection frequency impacts

browser tracking. Figure 3.7 provides an example of the process to generate a dataset

with a

collect_f requency

of two days. Table 3.4 presents, for each simulated collect

frequency, the number of fingerprints in the generated test sets.

The browser fingerprints in a generated test set are ordered chronologically. At the

beginning of our experiment, the set of known fingerprints (

) is empty. At each iteration,

Fp-Stalker tries to link an unknown fingerprint

with one of the fingerprints in

. If it can be linked to a fingerprint

, then Fp-Stalker assigns the id

fk.id

otherwise it assigns a new id. In both cases,

is added to

. The chronological order

of the fingerprints implies that at time

, a browser fingerprint can only be linked with

a former fingerprint collected at a time

t′< t

. This approach ensures a more realistic

3.3 Empirical Evaluation of Fp-Stalker 91

Figure 3.7 Example of the process to generate a simulated test set. The dataset contains

fingerprints collected from browser’s

and

, which we sample at a

collect_f requency

of 2 days to obtain a dataset that allows us to test the impact of

collect_f requency

fingerprint tracking.

scenario, similar to online fingerprint tracking approaches, than if we allowed fingerprints

from the past to be linked with fingerprints collected in the future.

3.3.4 Tracking Duration

Figure 3.8 plots the average tracking duration against the collect frequency for the three

algorithms. On average, browser instances from the test set were present for 129

4days,

which corresponds to the maximum value our linking algorithm could potentially achieve.

We see that the hybrid variant of Fp-Stalker is able to keep track of browser instances

for a longer period of time than the two other algorithms. In the case where a browser

gets fingerprinted every three days, Fp-Stalker can track it for 50

8days, on average.

More generally, the hybrid variant of Fp-Stalker has an average tracking duration of

about 9days more than the rule-based variant and 34 days more than the Panopticlick

algorithm.

Figure 3.9 presents the average maximum tracking duration against the collect frequency

for the three algorithms. We see that the hybrid algorithm still outperforms the two

other algorithms because the it constructs longer tracking chains with less mistakes. On

average, the maximum average tracking duration for Fp-Stalker’s hybrid version is in

the order of 82 days, meaning that at most users were generally tracked for this duration.

Figure 3.10 shows the number of ids they assigned, on average, for each browser instance.

We see that Panopticlick’s algorithm often assigns new browser ids, which is caused

92 Fp-Stalker: Tracking Browser Fingerprint Evolutions

Table 3.4 Number of fingerprints per generated test set after simulating different collect

frequencies.

Collect frequency (days) Number of fingerprints

1227,706

2114,288

376,480

457,569

546,247

638,681

733,285

829,219

10 23,560

15 16,003

20 12,231

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Collect frequency (days)

Average tracking duration

(days)

Panopticlick

Rule-based

Hybrid

Figure 3.8 Average tracking duration against simulated collect frequency for the three

algorithms

3.3 Empirical Evaluation of Fp-Stalker 93

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Collect frequency (days)

Average maximum tracking

duration (days)

Panopticlick

Rule-based

Hybrid

Figure 3.9 Average maximum tracking duration against simulated collect frequency for

the three algorithms. This shows averages of the longest tracking durations that were

constructed.

by its conservative nature. Indeed, as soon as there is more than one change, or multiple

candidates for linking, Panopticlick’s algorithm assigns a new id to the unknown browser

instance. However, we can observe that both Fp-Stalker’s hybrid and rule-based

variants perform similarly.

Finally, Figure 3.11 presents the average ownership of tracking chains against the

collect frequency for the three algorithms. We see that, despite its conservative nature,

Panopticlick’s ownership is 0

94, which means that, on average, 6% of a tracking chain

is constituted of fingerprints that do not belong to the browser instance that owns the

chain—i.e., it is contaminated with other fingerprints. We also see that Fp-Stalker has

an average ownership of 0

924, against 0

977 for the rule-based. Thus, while the hybrid

version can tracker browser instances for longer period of times than the rule-based

version, this can be partly explained by the fact that the hybrid version links more

frequently fingerprints from different browser instances.

When it comes to linking browser fingerprints, Fp-Stalker’s hybrid variant is better,

or as good as, the rule-based variant. The next paragraphs focus on a few more results

we obtain with the hybrid algorithm. Figure 3.12 presents the cumulative distribution

of the average and maximum tracking duration when

collect_f requency

equals 7days

94 Fp-Stalker: Tracking Browser Fingerprint Evolutions

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Collect frequency (days)

Number of ids per user

(lower is better)

Panopticlick

Rule-based

Hybrid

Figure 3.10 Average number of assigned ids per browser instance against simulated collect

frequency for the three algorithms (lower is better).

for the hybrid variant. We observe that, on average, 12

4% of the browser instances

are tracked more than 100 days. When it comes to the the longest tracking chains, we

observe that more than 32

4% of the browser instances have been tracked at least once

for more than 100 days during the experiment. These numbers show how tracking may

depend on the browser and its configuration. Indeed, while some browsers are never

tracked for a long period of time, others may be tracked for multiple months. This is also

due to the duration of presence of browser instances in our experiments. Few browser

instances were present for the whole experiment, most for a few weeks, and at best we

can track a browser instance only as long as it was present. The graph also shows the

results of the perfect linking algorithm (grey line), which can also be interpreted as the

distribution of duration of presence of browser instances in our test set.

The boxplot in Figure 3.13 depicts the number of ids generated by the hybrid algorithm

for a collect frequency of 7 days. It shows that half of the browser instances have been

assigned 2identifiers, which means they have one mistake, and more than 90% have less

than 9identifiers.

Finally, we also look at the distribution of the chains to see how often fingerprints from

different browser instances are mixed together. For the Fp-Stalker hybrid variant,

more than 95% of the chains have an ownership superior to 0

8, and more than 90%

3.3 Empirical Evaluation of Fp-Stalker 95

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0

Collect frequency (days)

0.90

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

Average ownership

(higher is better)

Panopticlick

Rule-based

Hybrid

Figure 3.11 Average ownership of tracking chains against simulated collect frequency for

the three algorithms. A value of 1 means the tracking chain contains only fingerprints of

the same browser instance.

0 50 100 150 200 250 300

Tracking duration (collect frequency = 7 days)

100

Percentage of browser instances

Perfect tracking time

FP-Stalker hybrid maximum tracking time

FP-Stalker hybrid average tracking time

Figure 3.12 CDF of average and maximum tracking duration for a collect frequency of 7

days (Fp-Stalker hybrid variant only).

96 Fp-Stalker: Tracking Browser Fingerprint Evolutions

Collect frequency = 7 days

Number of ids

per browser instance

Figure 3.13 Distribution of

number of ids

per browser for a collect frequency of 7days

(Fp-Stalker hybrid variant only).

have perfect ownership—i.e.,1. This shows that a small percentage of browser instances

become highly mixed in the chains, while the majority of browser instances are properly

linked into clean and relatively long tracking chains.

Comparison with the original dataset.

Compare to the original version of the

paper [

] published in 2018 with fewer fingerprints, we observe that the hybrid algorithm

has a similar average tracking duration. The average maximum tracking duration

increases, which can be explained by the fact that our evaluation dataset spans over a

longer period. However, this difference does not necessarily imply there is a performance

improvement. It is likely caused by the fact that as browser instances remain longer

in the dataset, this increases the chance that some of them can be tracked longer.

Concerning the number of identifiers per user, we also observe similar results with the

original dataset: the hybrid and rule-based algorithms have similar performances, all

while significantly outperforming the Panopticlick algorithm. Nevertheless, we observe a

significant difference concerning the average ownership, i.e., the quality of the tracking

chains. With the original dataset, the hybrid algorithm and the rule-based algorithms

performed similarly, with an average ownership

975, no matter the collect frequency.

However, in this chapter, we observe a significant drop for the hybrid algorithm, going

from an average ownership of 0

985 in the original paper to 0

924 with the new dataset.

This difference illustrates the difficulty of having machine learning models whose accuracy

3.3 Empirical Evaluation of Fp-Stalker 97

remains stable over time. Indeed, even though the model was trained on 40% of the new

dataset, its performance decreased. This drop can be partly explained by the fact that

the random forest model is trained on old fingerprints while applied on recent fingerprints

for evaluation. Nevertheless, browser fingerprinting is a constantly evolving field, which

means that when we learn to predict fingerprint evolutions on the first 40% of the dataset,

it does not fully transfer to the way the latest 60% of the fingerprints collected evolved.

Thus, for better performance, it would have required to train the model on more recent

fingerprints. While this is not possible in our case because of the number of fingerprints

available and our need to have an evaluation dataset of significant size, it is not the case

of commercial fingerprinters that have more fingerprints available to train their models.

A second factor that can explain the average ownership drop lies in the increase of the

dataset size. Since there are more fingerprints, there is more chance the dataset contains

similar or close fingerprints. This problem has already been discussed by Eckersley [

]

and shows the limits of using relatively small datasets to study browser fingerprinting.

We discuss this problem of representativity of the dataset as well as other limits in

Section 3.3.6. Thus, it is possible that our machine learning model lacked discrimination

power to distinguish between fingerprints originating from the same browser instances

and fingerprinting originating from different browser instances. Nevertheless, there exist

solutions to address this problem:

1. Doing more advanced feature engineering;

Adding new attributes to the fingerprints collected and use them in the machine

learning model.

Concerning the second solution, I argue that adding new attributes, such as audio

fingerprinting or more detailed attributes related to the size of the screen and the window,

would help to improve the accuracy of the tracking algorithm.

3.3.5 Benchmark/Overhead

This section presents a benchmark that evaluates the performance of Fp-Stalker’s

hybrid and rule-based variants. We start by providing more details about our imple-

mentation, then we explain the protocol used for this benchmark, demonstrate that

our approach can scale, and we show how our two variants behave when the number of

browser instances increases.

98 Fp-Stalker: Tracking Browser Fingerprint Evolutions

The implementations

of Fp-Stalker used for this benchmark are developed in

Python, and the implementation of the random forest comes from the Scikit-Learn library.

In order to study the scalability of our approach, we parallelized the linking algorithm to

run on multiple nodes. A master node is responsible for receiving linkability requests,

then it sends the unknown fingerprint to match

to slave nodes that compare

with

all of the

present on their process. Then, each slave node sends its set of candidates

associated either with a probability in case of the hybrid algorithm, or the number of

changes in case of the rule-based version. Finally, the master node takes the final decision

according to the policy defined either by the rule-based or hybrid algorithm. After the

decision is made, it sends a message to each node to announce whether or not they should

keep

in their local memory. In the case of the benchmark, we do not implement an

optimization for exact matching. Indeed, normally the master nodes should hold a list of

the exact matches associated with their ids.

The experimental protocol

aims to study scalability. We evaluate our approach on

a standard Azure cloud instance. We generate fake browser fingerprints to increase the

test set size. Thus, this part does not evaluate the previous metrics, such as

tracking

duration

, but only the execution times required to link synthetic browser fingerprints,

as well as how well the approach scales across multiple processes.

The first step of the benchmark is to generate fake fingerprints from real ones. The

generation process consists in taking a real fingerprint from our database and applying

random changes to the canvas and the timezone attributes. We apply only two random

changes so that generated fingerprints are unique, but they do not have too many

differences which would reduce the number of comparisons. This point is important

because our algorithms include heuristics related to the number of differences. Thus,

by applying a small number of random changes, we do not discard all

fingerprints,

making it the worst case scenario for testing execution times. Regarding the browser

ids, we assign two generated fingerprints to each browser instance. It would not have

been useful to generate more fingerprints per browser instance since we compare an

unknown fingerprint only with the last 2fingerprints of each browser instance. Then,

the master node creates

slave processes and sends the generated fingerprints to them.

The fingerprints are spread evenly over the processes.

Once the fingerprints are stored in the slave processes memory, we start our benchmark.

We get 100 real fingerprints and try to link them with our generated fingerprints. For each

3.3 Empirical Evaluation of Fp-Stalker 99

1 2 4 8 16

Number of processes

Execution speed up (%)

Figure 3.14 Speedup of average execution time against number of processes for Fp-

Stalker’s hybrid variant

fingerprint, we measure the execution time of the linking process. In this measurement,

we measure:

1. The number of fingerprints and browser instances.

2. The number of processes spawned.

We execute our benchmark on a

Standard D16 v3

Azure instance with 16 virtual

processors and 64 Gb of RAM, which has an associated cost of $576 USD per month.

Figure 3.14 shows the execution time speedup in percentage against the number of

processes for the hybrid approach. We see that that as the number of processes increases,

we obtain a speedup in execution time. Going from 1to 8processes enables a speed up

of more than 80%. Figure 3.15 shows the execution time to link a fingerprint against the

number of browser fingerprints for Fp-Stalker’s hybrid and rule-based variants, using

16 processes. Better

tracking duration

from the hybrid variant (see 3.3.4) is obtained

at the cost of execution speed. Indeed, for any given number of processes and browser

instances, the rule-based variant links fingerprints about 5times faster. That said, the

results show that the hybrid variant links fingerprints relatively quickly.

However, the raw execution times should not be used directly. The algorithm was

implemented in Python, whose primary focus is not performance. Moreover, although we

100 Fp-Stalker: Tracking Browser Fingerprint Evolutions

Rules Hybrid

500K fingerprints

Execution time (second)

Rules Hybrid

1M fingerprints

Rules Hybrid

2M fingerprints

Figure 3.15 Execution times for Fp-Stalker hybrid and rule-based to link a fingerprint

using 16 processes. Time is dependent on the size of the test set. The increased

effectiveness of the hybrid variant comes at the cost slower of execution times.

3.3 Empirical Evaluation of Fp-Stalker 101

scaled by adding processes, it is possible to scale further by splitting the linking process

(e.g., depending on the combination of OS and browser, send the fingerprint to more

specialized nodes). In our current implementation, if an unknown fingerprint from a

Chrome browser on Linux is trying to be matched, it will be compared to fingerprints

from Firefox on Windows, causing us to wait even though they have no chance of being

linked. By adopting a hierarchical structure where nodes or processes are split depending

on their OS and browser, it is possible to increase the throughput of our approach.

Furthermore, the importance of the raw execution speeds depend highly on the use

case. In the case where fingerprinting is used as a way to regenerate cookies (e.g., for

advertising), a fingerprint only needs to be linked when the cookie is missing or has been

erased, a much less frequent event. Another use case is using browser fingerprinting as a

way to enhance authentication [

]. In this case, one only needs to match the fingerprint

of the browser attempting to sign-in with the previous fingerprints from the same user,

drastically reducing the number of comparisons.

3.3.6 Threats to Validity

First, the results we report in this work depend on the representativity of our browser

fingerprint dataset. We developed extensions for Chrome and Firefox, the two most

popular web browsers, and distributed them through standard channels. This does

provide long term data, and mitigates a possible bias if we had chosen a user population

ourselves, but it is possible that the people interested in our extension are not a good

representation of the average Web surfer.

Second, there is a reliability threat due to the difficulty in replicating the experiments.

Unfortunately, this is inherent to scientific endeavors in the area of privacy: these works

must analyze personal data (browser fingerprints in our case) and the data cannot be

publicly shared. Yet, the code to split the data, generate input data, train the algorithm,

as well as evaluate it, is publicly available online on GitHub.1

Finally, a possible internal threat lies in our experimental framework. We did extensive

testing of our machine learning algorithms, and checked classification results as thoroughly

as possible. We paid attention to split the data and generate a scenario close to what

would happen in a web application. However, as for any large scale experimental

1https://github.com/Spirals-Team/FPStalker

102 Fp-Stalker: Tracking Browser Fingerprint Evolutions

infrastructure, there are surely bugs in this software. We hope that they only change

marginal quantitative things, and not the qualitative essence of our findings.

3.3.7 Discussion

This chapter studies browser fingerprint linking in isolation, which is its worst-case

scenario. In practice, browser fingerprinting is often combined with stateful tracking tech-

niques (e.g., cookies, Etags) to respawn stateful identifiers [

]. In such cases, fingerprint

linking is performed much less frequently since most of the time a cookie is sufficient and

inexpensive to track users. Our work shows that browser fingerprinting can provide an

efficient solution to extend the lifespan of cookies, which are increasingly being deleted

by privacy-aware users.

Browser vendors and users would do well to minimize the differences that are so easily

exploited by fingerprinters. Our results show that some browser instances have highly

trackable fingerprints, to the point that very infrequent fingerprinting is quite effective.

In contrast, other browser instances appear to be untrackable using the attributes we

collect. Vendors should work to minimize the attack surfaces exploited by fingerprinters,

and users should avoid customizing their browsers in ways that make them expose unique

and linkable fingerprints.

Depending on the objectives, browser fingerprint linking can be tuned to be more

conservative and avoid false positives (e.g., for second-tier security purposes), or more

permissive (e.g., ad tracking). Tuning could also be influenced by how effective other

tracking techniques are. For example, it could be tuned very conservatively and simply

serve to extend cookie tracking in cases where privacy-aware users, which are in our

opinion more likely to have customized (i.e., unique and linkable) browser configurations,

delete their cookies.

3.4 Conclusion

In this chapter, we investigated the stability of browser fingerprints and show that besides

having a high entropy, new techniques, such as canvas fingerprinting, remain stable for

long periods of time. Then, we measured how long can browsers be tracked using solely

their fingerprint and proposed Fp-Stalker, an approach to link fingerprint changes

over time. We address the problem with two variants of Fp-Stalker. The first one

3.4 Conclusion 103

builds on a ruleset identified from an analysis of grounded programmer knowledge. The

second variant combines the most discriminating rules by leveraging machine learning to

sort out the more subtle ones.

We trained the Fp-Stalker hybrid variant with a training set of fingerprints that

we collected for 2years through browser extensions installed by 2

346 volunteers. By

analyzing the feature importance of our random forest, we identified the

number of

changes

, the

time difference

, as well as the

user agent

, as the three most important

features.

We ran Fp-Stalker on our test set to assess its capacity to link fingerprints, as well as

to detect new browser instances. Our experiments demonstrate that the hybrid variant

can correctly link fingerprint evolutions from a given browser instance for 49

3consecutive

days on average, against 40

9days for the rule-based variant. When it comes to the

maximum tracking duration, with the hybrid variant, more than 32.4% of the browsers

can be tracked for more than 100 days.

Concerning the differences with the paper published at S&P 18 [

], while the perfor-

mances of the hybrid algorithm in term of tracking duration and number of ids per users

are quite similar, we observe a significant difference concerning the average ownership,

i.e., the quality of the tracking chains. We observe a significant drop for the hybrid

algorithm, going from an average of 0

985 in the original paper to 0

924 with the new

dataset. This can be partly explained by the fact that the model is trained on old finger-

prints while applied on recent fingerprints during the evaluation. Nevertheless, browser

fingerprinting is a constantly evolving field as new features are added and deprecated.

Thus, when our random forest model learns how fingerprints evolve on the first 40%

fingerprints, it does not fully transfer to the most recent 60% fingerprints. Thus, to

obtain better performance, it would have required to train the linking model on more

recent fingerprints. While this was not possible in our cause because of the number of

fingerprints we have and our need to have an evaluation dataset of significant size, it

is not a problem for commercial fingerprinters that have more fingerprints available to

train their models. A second reason to explain this drop comes from the fact that as the

size of the dataset increase, so does the probability that two fingerprints are the same

or close, even though they belong to different browser instance. Thus, to address this

problem, one either needs to improve the feature engineering of the machine learning

model or add new fingerprinting attributes such as the audio fingerprinting or attributes

related to the size of the screen and the window.

104 Fp-Stalker: Tracking Browser Fingerprint Evolutions

Regarding the usability of Fp-Stalker, we measure the average execution time to link

an unknown fingerprint when the number of known fingerprints is growing. We show

that both our rule-based and hybrid variants scale horizontally. However, even though

our hybrid variant is better in term of tracking duration, we show that it introduces a

non-negligible overhead compared to the pure rule-based approach.

As we showed in this chapter, browser fingerprinting is a threat to privacy and can be

used in addition to other stateful mechanisms, such as cookies, to track users for longer

period of times. To protect against browser fingerprinting, several countermeasures

have been proposed. Nevertheless, these countermeasures may generate inconsistent

fingerprints that can make their users more identifiable and therefore trackable. Thus,

in the next chapter, we evaluate the effectiveness of fingerprinting countermeasures and

how using them impact the privacy of their users.

Chapter 4

Fp-Scanner: The Privacy

Implications of Browser Fingerprint

Inconsistencies

A wide range of countermeasures has been developed to protect against browser finger-

printing. Some of them lie on the nature of the device in order to fool the trackers, while

others add random noise to pixels of canvas fingerprints in order to break to the stability

required for tracking. Nevertheless, the way these countermeasures have been evaluated

does not properly asses their impact on user privacy, in particular regarding the quantity

of information they may indirectly leak by revealing their presence.

My work was motivated by Nikiforakis et al. [

] that first demonstrated how inconsistencies

introduced by user agent spoofers could be used to reveal their presence, which could help

fingerprinters track browsers with such extensions. This chapter goes beyond the specific

case of user agent spoofers and studies a wider range of state-of-the-art fingerprinting

countermeasures. Moreover, I also challenge the claim that being more distinguishable

necessarily makes tracking more accurate.

My motivation to conduct a better evaluation of fingerprinting countermeasures was

strengthened by recent findings when inspecting the code of a commercial fingerprinting

script used by Augur [

]. I discovered that this script computes an attribute called

spoofed

, which is the result of multiple tests that evaluate the consistency between the

user agent

, the

platform

navigator.oscpu

navigator.productSub

, as well as the

value returned by

eval.toString.length

used to detect a browser. Moreover, the code

also tests for the presence of touch support on devices that claim to be mobiles. Similar

106 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

tests are also present in the widely used open source library FingerprintJS2 [

]. While

we cannot know the motivations of fingerprinters when it comes to detecting browsers with

countermeasures—i.e., this could be used to identify bots, to block fraudulent activities,

or to apply additional tracking heuristics—I argue that countermeasures should avoid

revealing their presence as this can be used to better target the browser. Thus, I consider

it necessary to evaluate the privacy implications of using fingerprinting countermeasures.

In this chapter, I show how most of the fingerprinting countermeasures presented in

the state-of-the-art can negatively impact user privacy. First, in Section 4.1 I propose

FP-Scanner, a test suite that leverages fingerprint inconsistencies to detect if a user

has a fingerprinting countermeasure installed in her browser. In Section 4.2, I detail

my implementation of a fingerprinting script and an inconsistency scanner capable of

detecting altered fingerprints. I evaluate it against state-of-the-art countermeasures and

show that none of them can hide their presence to our scanner. Then, I go further and

show that even when countermeasures modify attributes such as the user agent or the

platform, our scanner can still recover the original OS and browser values. In Section 4.3

I discuss how fingerprinters can leverage this information to improve their tracking

algorithms. Finally, I conclude in Section 4.4. This chapter was originally published as a

conference paper entitled Fp-Scanner: The Privacy Implications of Browser Fingerprint

Inconsistencies [17] published at Usenix Security 18.

4.1 Investigating Fingerprint Inconsistencies

Based on our study of existing browser fingerprinting countermeasures published in

the literature, we organized our test suite to detect fingerprint inconsistencies along

4distinct components. The sequence of tests is ordered by the increasing complexity

required to detect an inconsistency. In particular, the first two tests aim at detecting

inconsistencies at the OS and browser levels, respectively. The third one focuses on

detecting inconsistencies at the device level. Finally, the fourth test aims at revealing

canvas poisoning techniques. Each test focuses on detecting specific inconsistencies that

could be introduced by a countermeasure. While some of the tests we integrate, such as

checking the values of both user agents or browser features, have already been proposed

by Nikiforakis et al. [

], we also propose new tests to strengthen our capacity to detect

inconsistencies. Figure 4.1 depicts the 4components of our inconsistency test suite.

4.1 Investigating Fingerprint Inconsistencies 107

Figure 4.1 Overview of the inconsistency test suite.

4.1.1 Uncovering OS Inconsistencies

Although checking the browser’s identity is straightforward for a browser fingerprinting

algorithm, verifying the host OS is more challenging because of the sandbox mechanisms

used by the script engines. In this section, we present the heuristics applied to check a

fingerprinted OS attribute.

User Agent.

We start by checking the user agent consistency [

], as it is a key attribute

to retrieve the OS and browser of a user. The user agent is available both from the client

side, through the navigator object (

navigator.userAgent

), and from the server side,

as an HTTP header (

User-Agent

). The first heuristics we apply checks the equality of

these two values, as naive browser fingerprinting countermeasures, such as basic user

agent spoofers, tend to only alter the HTTP header. The difference between the two

user agent attributes reflects a coarse-grained inconsistency that can be due to the OS

and/or the browser. While extracting the OS and the browser substrings can help to

reveal the source of the inconsistency, the similarity of each substring does not necessarily

guarantee the OS and the browser values are true, as both might be spoofed. Therefore,

we extract and store the OS, browser and version substrings as internal variables

OSref

browserRef,browserVersionRef for further investigation.

108 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

Table 4.1 Mapping between common OS and platform values.

OS Platforms

Linux Linux i686, Linux x86_64, Linux, Linux armv8l

Windows Win32, Win64

iOS iPhone, iPad

Android Linux armv71, Linux i686, Linux armv8l

macOS MacIntel

FreeBSD FreeBSD amd64, FreeBSD i386

Navigator platform.

The value of

navigator.platform

reflects the platform on which

the browser is running. This attribute is expected to be consistent with the variable

OSref

, extracted in the first step [

]. Nevertheless, consistent does not mean equal

as, for example, the user agent of a 32-bits Windows will contain the substring

WOW64

which stands for Windows on Windows 64-bits, while the attribute

navigator.platform

will report the value

Win32

. Table 4.1 therefore maps

OSref

and possible values of

navigator.platform for the most commonly used OSes.

WebGL. WebGL

is a JavaScript API that extends the HTML 5 canvas API to render

3D objects from the browser. In particular, we propose a new test that focuses on two

WebGL attributes related to the OS:

renderer

and

vendor

. The first attribute reports the

name of the GPU, for example

ANGLE (VMware SVGA 3D Direct3D11 vs_4_0 ps_4_0)

Interestingly, the substring

VMware

indicates that the browser is executed in a virtual

machine. Also, the

ANGLE

substring stands for Almost Native Graphics Layer Engine,

which has been designed to bring OpenGL compatibility to Windows devices. The second

WebGL attribute (

vendor

) is expected to provide the name of the GPU vendor, whose

value actually depends on the OS. On a mobile device, the attribute

vendor

can report

the string

Qualcomm

, which corresponds to the vendor of the mobile chip, while values,

Microsoft

, are returned for Internet Explorer on Windows, or

Google Inc

for a

Chrome browser running on a Windows machine. We summarize the mapping for the

attributes renderer and vendor in Table 4.2.

Browser plugins.

Plugins are external components that add new features to the browser.

When querying for the list of plugins via the

navigator.plugins

object, the browser

returns an array of plugins containing detailed information, such as their filename and

the associated extension, which reveals some indication of the OS. On Windows, plugin

4.1 Investigating Fingerprint Inconsistencies 109

Table 4.2 Mapping between

and substrings in WebGL

renderer

vendor

attributes

for common OSes.

OS Renderer Vendor

Linux Mesa, Gallium Intel, VMWare, X.Org

Windows ANGLE Microsoft, Google Inc

iOS Apple, PowerVR Apple, Imagination

Android Adreno, Mali, PowerVR Qualcomm, ARM, Imagination

macOS OpenGL, Iris Intel, ATI

Windows Phone Qualcomm, Adreno Microsoft

file extensions are

.dll

, on macOS they are

.plugin

.bundle

and for Linux based

OS extensions are

.so

. Thus, we propose a test that ensures that

OSref

is consistent

with its associated plugin filename extensions. Moreover, we also consider constraints

imposed by some systems, such as mobile browsers that do not support plugins. Thus,

reporting plugins on mobile devices is also considered as an inconsistency.

Media queries.

Media query is a feature included in CSS 3 that applies different

style properties depending on specific conditions. The most common use case is the

implementation of responsive web design, which adjusts the stylesheet depending on

the size of the device, so that users have a different interface depending on whether

they are using a smartphone or a computer. In this step, we consider a set of media

queries provided by the Firefox browser to adapt the content depending on the value

of desktop themes or Windows OS versions. Indeed, it is possible to detect the Mac

graphite theme using

-moz-mac-graphite-theme

media query [

104

]. It is also possible

to test specific themes present on Windows by using

-moz-windows-theme

. However,

in the case of Windows, there is a more precise way to detect its presence, and even

its version. It is also possible to use the

-moz-os-version

media query to detect if a

browser runs on Windows XP, Vista, 7, 8 or 10. Thus, it is possible to detect some Mac

users, as well as Windows users, when they are using Firefox. Moreover, since these

media queries are only available from Firefox, if one of the previous media queries is

matched, then it likely means that the real browser is Firefox.

Fonts.

Saito et al. [

105

] demonstrated that fonts may be dependent on the OS. Thus, if

a user claims to be on a given OS

, but does not list any font linked to this OS

and,

110 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

at the same time, displays many fonts from another OS

, then we may assume that OS

Ais not its real OS.

This first step in Fp-Scanner aims to check if the OS declared in the user agent is the

device’s real OS. In the next step, we extend our verification process by checking if the

browser and the associated version declared by the user agent have been altered.

4.1.2 Uncovering Browser Inconsistencies

This step requires the extraction of the variables

browserRef

and

browserVersion Ref

from the user agent to further investigate their consistency.

Error.

In JavaScript,

Error

objects are thrown when a runtime error occurs. There

exist 7different types of errors for client-side exceptions, which depend on the problem

that occurred. However, for a given error, such as a stack overflow, not all the browsers

will throw the same type of error. In the case of a stack overflow, Firefox throws

InternalError

and Chrome throws a

RangeError

. Besides the type of errors,

depending on the browser, error instances may also contain different properties. While two

of them—

message

and

name

—are standards, others such as

description

lineNumber

toSource

are not supported by all browsers. Even for properties, such as

message

and

name

, which are implemented in all major browsers, their values may differ for a

given error.

For example, executing

null

[0] on Chrome will generate the following error message

"Cannot read property ’0’ of null", while Firefox generates "null has no properties",

and Safari "null is not an object (evaluating ’null[0]’)".

Function’s internal representation.

It is possible to obtain a string representation

of any object or function in JavaScript by using the

toString

method. However, such

representations—e.g.,

eval.toString()

—may differ depending on the browser, with a

length that characterizes it. Firefox and Safari return the same string, with a length

of 37 characters, while on Chrome it has a length of 33 characters, and 39 on Internet

Explorer. Thus, we are able to distinguish most major desktop browsers, except for

Firefox and Safari. Then, we consider the property

navigator.productSub

, which

returns the build number of the current browser. On Safari,Chrome and Opera, it

always returns the string

20030107

and, combined with

eval.toString().length

, it

can therefore be used to distinguish Firefox from Safari.

4.1 Investigating Fingerprint Inconsistencies 111

Navigator object. Navigator

is a built-in object that represents the state and the

identity of the browser. Since it characterizes the browser, its prototype differs depending

not only on the browser’s family, but also the browser’s version. These differences come

from the availability of some browser-specific features, but also from two other reasons:

1. The order of navigator is not specified and differs across browsers [4];

For a given feature, different browsers may name it differently. For example, if we

consider the feature

getUserMedia

, it is available as

mozGetUserMedia

on Firefox

and webkitGetUserMedia on a Webkit-based browser.

Moreover, as

navigator

properties play an important role in browser fingerprinting,

our test suite detects if they have been overridden by looking at their internal string

representation. In the case of a genuine fingerprint whose attributes have not been

overridden in JavaScript, it should contain the substring

native code

. However, if a

property has been overridden, it will return the code of the overridden function.

Browser features.

Browsers are complex software that evolve at a fast pace by adding

new features, some being specific to a browser. By observing the availability of specific

features, it is possible to detect if a browser is the one it claims to be [

]. Since for a

given browser, features evolve depending on the version, we can also check if the features

available are consistent with

browserVersionRef

. Otherwise, this may indicate that the

browser version displayed in the user agent has been manipulated.

4.1.3 Uncovering Device Inconsistencies

This section aims at detecting if the device belongs to the class of devices it claims to

be—i.e., mobile or computer.

Browser events.

Some events are unlikely to happen, such as touch-related events

(

touchstart

touchmove

) on a desktop computer. On the opposite, mouse-related events

(

onclick

onmousemove

) may not happen on a smartphone. Therefore, the availability

of an event may reveal the real nature of a device.

Browser sensors.

Like events, some sensors may have different outputs depending on

the nature of devices. For example, the accelerometer, which is generally assumed to

only be available on mobile devices, can be retrieved from a browser without requesting

any authorization. The value of the acceleration will always slightly deviate from 0for a

real mobile device, even when lying on a table.

112 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

(a) Canvas fingerprint with no countermeasure.

(b) Canvas fingerprint with a countermeasure.

Figure 4.2 Two examples of canvas fingerprints (a) a genuine canvas fingerprint without

any countermeasures installed in the browser and (b) a canvas fingerprint altered by the

Canvas Defender countermeasure that applies a uniform noise to all the pixels in the

canvas.

4.1.4 Uncovering Canvas Inconsistencies

Canvas fingerprinting uses the HTML 5 canvas API to draw 2D shapes using JavaScript.

This technique, discovered by Mowery et al. [

], is used to fingerprint browsers. To do so,

one scripts a sequence of instructions to be rendered, such as writing text, drawing shapes

or coloring part of the image, and collects the rendered output. Since the rendering

of this canvas relies on the combination of different hardware and software layers, it

produces small differences from device to device. An example of the rendering obtained

on a Chrome browser running on Linux is presented in Figure 4.2a.

As we mentioned, the rendering of the canvas depends on characteristics of the device,

and if an instruction has been added to the script, you can expect to observe its effects

in the rendered image. Thus, we consider these scripted instructions as constraints that

must be checked in the rendered image. For example, the canvas in Figure 4.2b has been

obtained with the Canvas Defender extension installed. We observe that contrary to

the vanilla canvas that does not use any countermeasure (Figure 4.2a), the canvas with

the countermeasure has a background that is not transparent, which can be seen as a

constraint violation. We did not develop a new canvas test, we reused the one adopted

by state-of-the-art canvas fingerprinting [

]. From the rendered image, our test suite

checks the following properties:

Number of transparent pixels as the background of our canvas must be transparent,

we expect to find a majority of these pixels;

4.2 Empirical Evaluation 113

Number of isolated pixels, which are pixels whose rgba value is different than

0) and are only surrounded by transparent pixels. In the rendered image,

we should not find this kind of pixel because shapes or texts drawn are closed;

Number of pixels per color should be checked against the input canvas rendering

script, even if it is not possible to know in advance the exact number of pixels with

a given color, it is expected to find colors defined in the canvas script.

We also check if canvas-related functions, such as toDataUrl, have been overridden.

4.2 Empirical Evaluation

This section compares the accuracy of Fp-Scanner with FingerprintJS2 [

], an

open source fingerprinting script and Augur [

], a commercial fingerprinting script, to

classify genuine and altered browser fingerprints modified by state-of-the-art fingerprinting

countermeasures.

4.2.1 Implementing FP-Scanner

Instead of directly implementing and executing our test suite within the browser, thus

being exposed to countermeasures, we split Fp-Scanner into two parts. The first part

is a client-side fingerprinter, which uploads raw browser fingerprints on a remote storage

server. For the purpose of our evaluation, this fingerprinter extends state-of-the-art

fingerprinters, like FingerprintJS2, with the list of attributes covered by Fp-Scanner

(e.g., WebGL fingerprint). Table 4.3 reports on the list of attributes collected by this

fingerprinter. The resulting dataset of labeled browser fingerprints is made available to

leverage the reproducibility of our results. 1

The second part of Fp-Scanner is the server-side implementation, in Python, of the

test suite we propose (cf. Section 4.1). This section reports on the relevant technical

issues related to the implementation of the 4components of our test suite.

4.2.1.1 Checking OS Inconsistencies

OSRef

is defined as the OS claimed by the user agent attribute sent by the browser

and is extracted using a UA Parser library [

106

]. We used the browser fingerprint

1FP-Scanner dataset: https://github.com/Spirals-Team/FP-Scanner

114 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

Table 4.3 List of attributes collected by our fingerprinting script.

Attribute Description

HTTP headers List of HTTP headers sent by the browser and their associated value

User agent navigator Value of navigator.userAgent

Platform Value of navigator.platform

Plugins

List of plugins (description, filename, name) obtained by

navigator.plugins

ProductSub Value of navigator.productSub

Navigator prototype

String representation of each property and function of the

navigator

object prototype

Canvas

Base 64 representation of the image generated by the canvas fingerprint

test

WebGL renderer WebGLRenderingContext.getParameter("renderer")

WebGL vendor WebGLRenderingContext.getParameter("vendor")

Browser features Presence or absence of certain browser features

Media queries

Collect if media queries related to the presence of certain OS match or

not using window.matchMedia

Errors type 1 Generate a TypeError and store its properties and their values

Errors type 2

Generate an error by creating a socket not pointing to an URL and store

its string representation

Stack overflow Generate a stack overflow and store the error name and message

Eval toString length Length of eval.toString().length

Media devices Value of navigator.mediaDevices.enumerateDevices

TouchSupport

Collect the value of

navigator.maxTouchPoints

, store if we can create

aTouchEvent and if window object has the ontouchstart property

Accelerometer true

if the value returned by the accelerometer sensor is different of 0,

else false

Screen resolution Values of screen.width/ height, and screen.availWidth/ Height

Fonts Font enumeration using JavaScript [39]

Overwritten properties

Collect string representation of

screen.width/height

getters, as well as

toDataURL and getTimezoneOffset functions

4.2 Empirical Evaluation 115

dataset from AmIUnique [

] to analyze if some of the fonts they collected were only

available on a given OS. We considered that if a font appeared at least 100 times for

a given OS family, then it could be associated to this OS. We chose this relatively

conservative value because the AmIUnique database contains many fingerprints that

are spoofed, but of which we are unaware of. Thus, by setting a threshold of 100, we may

miss some fonts linked to a certain OS, but we limit the number of false positives—i.e.,

fonts that we would classify as linked to an OS but which should not be linked to it.

Fp-Scanner checks if the fonts are consistent with

OSRef

by counting the number of

fonts associated to each OS present in the user font list. If more than

= 1 fonts

are associated to another OS than

OSRef

, or if no font is associated to

OSRef

, then

Fp-Scanner reports an OS inconsistency. It also tests if

moz-mac-graphite-theme

and

@media(-moz-os-version: $win-version)

with

$win-version

equals to Windows

XP, Vista, 7,8or 10, are consistent with OSRef.

4.2.1.2 Checking Browser Inconsistencies

We extract

BrowserRef

using the same user agent parsing library as for

OSRef

. With

regards to JavaScript errors, we check if the fingerprint has a prototype, an error message,

as well as a type consistent with

browserRef

. Moreover, for each attribute and function

of the

navigator

object, Fp-Scanner also checks if the string representation reveals

that it has been overridden. Testing if the features of the browser are consistent with

browserRef

is achieved by comparing the features collected using Modernizr [

107

]

with the open source data file provided by the website Caniuse [

108

]. The file is freely

available on Github

and represents most of the features present in Modernizr as a

JSON file. For each of them, it details if they are available on the main browsers, and

for which versions. We consider that a feature can be present either if it is present by

default or it can be activated. Then, for each Modernizr feature we collected in the

browser fingerprint, we check if it should be present according to the Caniuse dataset.

If there are more than

= 1 errors, either features that should be available but are

not, or features that should not be available but are, then we consider the browser as

inconsistent.

List of available features per browser: https://github.com/Fyrd/caniuse/blob/master/

data.json

116 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

4.2.1.3 Checking Device Inconsistencies

We verify that, if the device claims to be a mobile, then the accelerometer value is set

true

. We apply the same technique for touch-related events. However, we do not

check the opposite—i.e., that computers have no touch related events—as some new

generations of computers include touch support. Concerning the screen resolution, we

first check if the screen height and width have been overridden.

4.2.1.4 Checking Canvas Poisoning

To detect if a canvas has been altered, we extract the 3metrics proposed in Section 4.1.

We first count the number of pixels whose rgba value is (0

0). If the image contains

less than

Ntp

= 4

000 transparent pixels, or if it is full of transparent pixels, then we

consider that the canvas has been poisoned or blocked. Secondly, we count the number

of isolated pixels. If the canvas contains more than 10 of them, then we consider it

as poisoned. We did not set a lower threshold as we observed that some canvas on

macOS and Safari included a small number of isolated pixels that are not generated

by a countermeasure. Finally, the third metric tests the presence of the orange color

(255

102

100) by counting the number of pixels having this exact value, and also

the number of pixels whose color is slightly different—i.e., pixels whose color vector

satisfies the following equation

∥(255,102,0,100)−vc∥<

4. Our intuition is that canvas

poisoners inject a slight noise, thus we should find no or few pixels with the exact value,

and many pixels with a slightly different color.

For each test of our suite, Fp-Scanner stores the details of each test so that it is possible

to know if it is consistent, and which steps of the analysis failed.

Estimating the parameters

Different parameters of Fp-Scanner, such as the

number of transparent pixels, may influence the accuracy of Fp-Scanner, resulting in

different values of true and false positives. The strategy we use to optimize the value of

a given parameter is to run the scanner test that relies on this parameter, and to tune

the value of the parameter to minimize the false positive rate (FPR)—i.e., the ratio of

fingerprints that would be wrongly marked as altered by a countermeasure, but that

are genuine. The reason why we do not run all the tests of the scanner to optimize a

given parameter is because there may be some redundancy between different tests. Thus,

changing a parameter value may not necessarily results in a modification of the detection

4.2 Empirical Evaluation 117

Table 4.4 List of relevant tests per countermeasure.

Test (scope)

RAS

UA spoofers

Canvas

extensions

FPRandom

Brave

Firefox

User Agents (global) ✓ ✓ ✓

Platform (OS) ✓ ✓ ✓

WebGL (OS) ✓ ✓ ✓

Plugins (OS) ✓ ✓ ✓

Media Queries (OS, browser) ✓ ✓ ✓

Fonts (OS) ✓ ✓ ✓

Error (browser) ✓ ✓ ✓

Function representation (browser) ✓ ✓

Product (browser) ✓ ✓

Navigator (browser) ✓ ✓ ✓

Enumerate devices (browser) ✓

Features (browser) ✓ ✓ ✓ ✓

Events (device) ✓ ✓

Sensors (device) ✓ ✓

ToDataURL (canvas) ✓ ✓

Pixels (canvas) ✓ ✓ ✓ ✓ ✓

as a given countermeasure may be detected by multiple tests. Moreover, we ensure that

countermeasures are detected for the appropriate symptoms. Indeed, while it is normal

for a canvas countermeasure to be detected because some pixels have been modified,

we consider it to be a false positive when detected because of a wrong browser feature

threshold, as the countermeasure does not act on the browser claimed in the user agent.

Table 4.4 describes, for each countermeasure, the tests that can be used to reveal its

presence. If a countermeasure is detected by a test not allowed, then it is considered as a

false positive.

Figures 4.3,4.4 and 4.5 shows the detection accuracy and the false positive rate (FPR)

for different tests and different values of the parameters to optimize. We define the

accuracy as

#T P +#T N

#F ingerpr ints

where true positives (TP) are the browser fingerprints correctly

classified as inconsistent, and true negatives (TN) are fingerprints correctly classified as

genuine. Table 4.5 shows, for each parameter, the optimal value we considered for the

118 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

2,500

5,000

7,500

10,000

12,500

15,000

17,500

20,000

Number of transparent pixels

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Detection accuracy

Accuracy

FPR

Figure 4.3 Detection accuracy and false positive rate using the transparent pixels test for

different values of Ntp (number of transparent pixels).

evaluation. The last column of Table 4.5 reports on the false positive rate, as well as the

accuracy obtained by running only the test that makes use of the parameter to optimize.

In the case of the number of transparent pixels

Ntp

we observe no difference between 100

and 16

500 pixels. Between 16

600 and 18

600 there is a slight improvement in terms of

accuracy caused by a change in the true positive rate. Thus, we chose a value of 17,200

transparent pixels since it provides both a false positive rate of 0 while maximizing the

accuracy.

Table 4.5 Optimal values of the different parameters to optimize, as well as the FPR and

the accuracy obtained by executing the test with the optimal value.

Attribute Optimal value FPR (accuracy)

Pixels: Ntp 17,200 0 (0.93)

Fonts: Nf2 0 (0.42)

Features: Ne1 0 (0.51)

4.2 Empirical Evaluation 119

2 4 6 8 10 12 14

Number of wrong fonts

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Detection accuracy

Accuracy

FPR

Figure 4.4 Detection accuracy and false positive rate using the fonts test for different

values of Nf(number of fonts associated with the wrong OS).

0 2 4 6 8 10 12 14 16 18

Number of wrong features

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Detection accuracy

Accuracy

FPR

Figure 4.5 Detection accuracy and false positive rate of the browser feature test for

different values of Ne(number of wrong features).

120 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

Concerning the number of wrong fonts

, we obtained an accuracy of 0

646 with a

threshold of one font, but this resulted in a false positive rate of 0

197. Thus, we chose a

value of

= 2 fonts, which makes the accuracy of the test decrease to 0

42 but provides

a false positive rate of 0.

Finally, concerning the number of browser features

, increasing the threshold resulted

in a decrease of the accuracy, and an increase of the false negative rate. Nevertheless,

only the false negative and true positve rates are impacted, not the false positive rate

that remains constant for the different values of Ne. Thus, we chose a value of Ne= 1

Even if the detection accuracy of the tests may seem low—0.42 for the fonts and 0.51 for

the browser features—these are only two tests among multiple tests, such as the

media

queries

WebGL

toDataURL

that can also be used to verify the authenticity of the

information provided in the user agent or in the canvas.

4.2.2 Evaluating FP-Scanner

4.2.2.1 Building a Browser Fingerprints Dataset

To collect a relevant dataset of browser fingerprints, we created a web page that includes

the browser fingerprinting script we designed. Besides collecting fingerprints, we also

collect the system ground truth—i.e., the real os, browser family and version, as well

as the list of countermeasures installed. In the scope of our experiment, we consider

countermeasures listed in Table 4.6, as they are representative of the diversity of strategies

we reported in the state-of-the-art (Section 2.3). Although other academic countermea-

sures have been published [

], it was not possible to consider them due to

the unavailability of their code or because they could not be run anymore. Moreover,

we still consider Random Agent Spoofer even though it is not available as a web

extension—i.e., for Firefox versions

57—since it modifies many attributes commonly

considered by browser fingerprinting countermeasures.

We built this browser fingerprints dataset by accessing this web page from different

browsers, virtual machines and smartphones, with and without any countermeasure

installed. The resulting dataset is composed of 147 browser fingerprints, randomly

challenged by 7different countermeasures. Table 4.6 reports on the number of browser

fingerprints per countermeasure. The number of browser fingerprints per countermeasure

is different since some countermeasures are deterministic in the way they operate. For

example, Canvas Defender always adds a uniform noise on all the pixels of a canvas.

4.2 Empirical Evaluation 121

Table 4.6 Comparison of accuracies per countermeasures

Countermeasure Number of

fingerprints

Accuracy

Scanner

Accuracy

FP-JS2

/Augur

Random Agent

Spoofer (RAS) 69 1.0 0.55

User agent spoofers (UAs) 22 1.0 0.86

Canvas Defender 26 1.0 0.0

Firefox protection 61.0 0.0

Canvas FP Block 31.0 0.0

FPRandom 71.0 0.0

Brave 41.0 0.0

No countermeasure 10 1.0 1.0

On the opposite, some countermeasures, such as Random Agent Spoofer, add more

randomness due to the usage of real profiles, which requires more tests.

4.2.2.2 Measuring the Accuracy of FP-Scanner

We evaluate the effectiveness of Fp-Scanner,FingerprintJS2 and Augur to correctly

classify a browser fingerprint as genuine or altered. Our evaluation metric is the accuracy,

as defined in Section 4.2.1. Overall, Fp-Scanner reaches an accuracy 1

0against 0

45 for

FingerprintJS2 and Augur, which perform equally on this dataset. When inspecting

the Augur and FingerprintJS2 scripts, and despite Augur’s obfuscation, we observe

that they seem to perform the same tests to detect inconsistencies. As the number of

fingerprints per countermeasure is unbalanced, Table 4.6 compares the accuracy achieved

per countermeasure.

We observe that Fp-Scanner outperforms FingerprintJS2 to classify a browser

fingerprint as genuine or altered. In particular, Fp-Scanner detects the presence of

canvas countermeasures while FingerprintJS2 and Augur spotted none of them.

4.2.2.3 Analyzing the Detected Countermeasures

For each browser fingerprint, Fp-Scanner outputs the result of each test and the value

that made the test fail. Thus, it enables us to extract some kinds of signatures for

different countermeasures. In this section, we execute Fp-Scanner in depth mode—i.e.,

for each fingerprint, Fp-Scanner executes all of the steps, even if an inconsistency is

122 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

detected. For each countermeasure considered in the experiment, we report on the steps

that revealed their presence.

User Agent Spoofers

are easily detected as they only operate on the user agent. Even

when both values of user agent are changed, they are detected by simple consistency

checks, such as platform for the OS, or function’s internal representation test for the

browser.

Brave

is detected because of the side effects it introduces, such as blocking canvas

fingerprinting. Fp-Scanner distinguishes Brave from a vanilla Chromium browser

by detecting it overrides

navigator.plugins

and

navigator.mimeTypes

getters. Thus,

when Fp-Scanner analyzes Brave’s navigator prototype to check if any properties

have been overridden, it observes the following output for plugins and mimeTypes getters

string representation:

() => { return handler }

. Moreover, Brave also overrides

navigator.mediaDevices.enumerateDevices

to block devices enumeration, which can

also be detected by Fp-Scanner as it returns a

Proxy

object instead of an object

representing the devices.

Random Agent Spoofer (RAS)

By using a system of profiles, RAS aims at intro-

ducing fewer inconsistencies than purely random values. Indeed, RAS passes simple

checks, such as having identical user agents or having a user agent consistent with

navigator.platform

. Nevertheless, Fp-Scanner still detects inconsistencies as RAS

only ensures consistency between the attributes contained in the profile. First, since

RAS is a Firefox extension, it is vulnerable to the media query technique. Indeed,

if the user is on a Windows device, or if the profile selected claims to be on Windows,

then the OS inconsistency is directly detected. In the case where it is not enough to

detect its presence, plugins or fonts linked to the OS enables us to detect it. Browser

inconsistencies are also easily detected, either using function’s internal representation test

or errors attributes. When only the browser version was altered, Fp-Scanner detects it

by using the combination of Modernizr and Caniuse features.

RAS overrides most of the navigator attributes from the Firefox configuration file.

However, the

navigator.vendor

attribute is overridden in JavaScript, which makes it

detectable. Fp-Scanner also detects devices which claimed to be mobile devices, but

whose accelerometer value was undefined.

Firefox fingerprinting protection

standardizes the user agent when the protec-

tion is activated and replaces it with

Mozilla/5.0 (Windows NT 6.1; Win64; x64;

rv:52.0) Gecko/20100101 Firefox/52.0

, thus lying about the browser version and

4.2 Empirical Evaluation 123

the operating system for users not on Windows 7 (Windows NT 6.1). While OS-related

attributes, such as

navigator.platform

are updated, other attributes, such as

webgl

vendor

and

renderer

are not consistent with the OS. For privacy reasons, Firefox

disabled OS-related media queries presented earlier in this chapter for its versions

57,

whether or not the fingerprinting protection is activated. Nevertheless, when the fin-

gerprinting protection is activated, Firefox pretends to be version 52 running on

Windows 7. Thus, it should match the media query

-moz-os-version

for Windows 7,

which is not the case. Additionally, when the browser was not running on Windows, the

list of installed fonts was not consistent with the OS claimed.

Canvas poisoners

including Canvas Defender,Canvas FP Block and FPRan-

dom were all detected by Fp-Scanner. For the first two, as they are browser extensions

that override canvas-related functions using JavaScript, we always detect that the function

toDataURL

has been altered. For all of them, we detect that the canvas pixel constraints

were not enforced from our canvas definition. Indeed, we did not find enough occurrences

of the color (255

102

100), but we found pixels with a slightly different color. Moreover,

in case of the browser extensions, we also detected an inconsistent number of transparent

pixels as they apply noise to all the canvas pixels.

Table 4.7 summarizes, for each countermeasure, the steps of our test suite that detected

inconsistencies. In particular, one can observe that Fp-Scanner leverages the work

of Nikiforakis et al. [

] by succeeding to detect a wider spectrum of fingerprinting

countermeasures that were previously escaped by their test suite (e.g., canvas extensions,

FPRandom [

] and Brave [

109

]). We also observe that the tests to reveal the presence

of countermeasures are consistent with the tests presented in Table 4.4.

4.2.2.4 Recovering the Ground Values

Beyond uncovering inconsistencies, we enhanced Fp-Scanner with the capability to re-

store the ground value of key attributes, like

browser family

and

browser version

To recover these attributes, we rely on the hypothesis that some attributes are harder to

spoof, and hence more likely to reflect the true nature of the device. When Fp-Scanner

does not detect any inconsistency in the browser fingerprint, then the algorithm simply

returns the values obtained from the user agent. Otherwise, it uses the same tests used

to spot inconsistencies, but to restore the ground values:

•OS value

To recover the real OS, we combine multiple sources of information,

including plugins extensions, WebGL renderer, media queries, and fonts linked to

124 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

Table 4.7 Fp-Scanner steps failed by countermeasures

Test (scope)

RAS

UA spoofers

Canvas

extensions

FPRandom

Brave

Firefox

User Agents (global)

Platform (OS) ✓

WebGL (OS) ✓ ✓ ✓

Plugins (OS) ✓ ✓

Media Queries (OS, browser) ✓ ✓ ✓

Fonts (OS) ✓ ✓

Error (browser) ✓ ✓

Function representation (browser) ✓

Product (browser) ✓ ✓

Navigator (browser) ✓ ✓

Enumerate devices (browser) ✓

Features (browser) ✓ ✓

Events (device) ✓ ✓

Sensors (device) ✓ ✓

ToDataURL (canvas) ✓ ✓

Pixels (canvas) ✓ ✓ ✓ ✓

4.2 Empirical Evaluation 125

OS. For each step, we obtain a possible OS. Finally, we select the OS that has been

predicted by the majority of the steps;

•Browser family

Concerning the browser family, we rely on function’s internal

representation (

eval.toString().length

) that we combine with the value of

productSub

. Since these two attributes are discriminatory enough to distinguish

most of the major browsers, we do not make more tests;

•Browser version

To infer the browser version, we test the presence or absence of

each Modernizr feature for the recovered browser family. Then, for each browser

version, we count the number of detected features. Finally, we keep a list of versions

with the maximum number of features in common.

Evaluation.

We applied this recovering algorithm to fingerprints altered only by

countermeasures that change the OS or the browser—i.e.,RAS,User agent spoofers

and Firefox fingerprinting protection. Fp-Scanner was able to correctly recover the

browser ground value for 100 % of the devices. Regarding the OS, Fp-Scanner was

always capable of predicting the OS family—i.e., Linux, MacOS, Windows—but often

failed to recover the correct version of Windows, as the technique we use to detect the

version of Windows relies on Mozilla media queries, which stopped working after version

58, as already mentioned. Finally, Fp-Scanner failed to faithfully recover the browser

version. Given the lack of discriminatory features in Modernizr,Fp-Scanner can only

recover a range of candidate versions. Nevertheless, this could be addressed by applying

natural language processing on browser release notes in order to learn the discriminatory

features introduced for each version.

4.2.3 Benchmarking FP-Scanner

This part evaluates the overhead introduced by Fp-Scanner to scan a browser fingerprint.

The benchmark we report has been executed on a laptop having an IntelCore i7 and

8 GB of RAM.

Performance of FP-Scanner.

We compare the performance of FP-Scanner with

FingerprintJS2 in term of processing time to detect inconsistencies. First, we auto-

mate Chrome headless version 64 using Pupeteer and we run 100 executions of

FingerprintJS2. In case of FingerprintJS2, the reported time is the sum of the exe-

cution time of each function used to detect inconsistencies—i.e.,

getHasLiedLanguages

getHasLiedResolution

getHasLiedOs

and

getHasLiedBrowser

. Then, we execute

126 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

FingerprintJS2

(1)

FP-Scanner

(default)

(2)

FP-Scanner

(depth)

(3)

FP-Scanner

(canvas only)

(4)

100

200

300

400

500

600

700

Execution time (ms)

Figure 4.6 Execution time of FingerprintJS2 inconsistency tests and Fp-Scanner with

different settings.

different versions of FP-Scanner on our dataset. Input datasets, such as the Caniuse

features file, are only loaded once, when Fp-Scanner is initialized. We start measuring

the execution time after this initialization step as it is only done once. Depending on the

tested countermeasure, Fp-Scanner may execute more or less tests to scan a browser

fingerprint. Indeed, against a simple user agent spoofer, the inconsistency might be

quickly detected by checking the two user agents, while it may require to analyze the

canvas pixels for more advanced countermeasures, like FPRandom. Thus, in Figure 4.6,

we report on 4boxplots representing the processing time for the following situations:

1. FingerprintJS2 inconsistency tests,

The scanner stops upon detecting one inconsistency (Fp-Scanner (default) mode),

3. All inconsistency tests are executed (Fp-Scanner (depth) mode),

Only the test that manipulates the canvas (

pixels

) is executed (Fp-Scanner

(canvas only) mode).

One can observe that, when all the tests are executed (3)—which corresponds to genuine

fingerprints—90% of the fingerprints are processed in less than 513

. However, we

observe a huge speedup when stopping the processing upon the first occurrence of an

4.3 Discussion 127

inconsistency (2). Indeed, while 83% of the fingerprints are processed in less than 0

the remaining 17% need more than 440

. This is caused by the fact that most of

the fingerprints we tested had installed countermeasures that could be detected using

straightforward tests, such as media queries or testing for overridden functions, whereas

the other fingerprints having either no countermeasures or FPRandom (17 fingerprints),

require to run all the tests. This observation is confirmed by the fourth boxplot, which

reports on the performance of the pixel analysis step and imposes additional processing

time to analyze all the canvas pixels. We recall that the pixel analysis step is required

only to detect FPRandom since even other canvas countermeasures can be detected by

looking at the string representation of

toDataURL

. Thus, when disabling the pixel analysis

test, Fp-Scanner outperforms FingerprintJS2 with a better accuracy (

92) and

a faster execution (90th percentile of 220ms).

Based on this evaluation, we can conclude that adopting an inconsistency test suite like

Fp-Scanner in production is a viable solution to detect users with countermeasures.

4.3 Discussion

In this chapter, we demonstrated that state-of-the-art fingerprinting countermeasures

could be detected by scanning for inconsistencies they introduce in browser fingerprints.

We first discuss the privacy implications of such a detection mechanism and then explain

how these techniques could be used to detect browser extensions in general.

4.3.1 Privacy Implications

We identify and present the two main privacy implications that can arise from the use of

browser fingerprinting countermeasures: discrimination and tracking.

4.3.1.1 Discrimination

Being detected with a countermeasure could lead to discrimination. For example,

Hannak et al. [

110

] demonstrated that some websites adjust prices depending on the user

agent. Moreover, many websites refuse to serve browsers with ad blockers or users of

the Tor browser and network. We can imagine users being delivered altered content or

being denied access if they do not share their true browser fingerprint. Similarly to ad

128 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

blocker extensions, discrimination may also happen with a countermeasure intended to

block fingerprinting scripts.

4.3.1.2 Trackability

Detecting countermeasures can, in some cases, be used to improve tracking. Nikiforakis et

al. [

] talk about the counterproductiveness of using user agent spoofers because they

make browsers more identifiable. We extend this line of thought to more generally

argue that being detected with a fingerprinting countermeasure can make browsers more

trackable, albeit this is not always the case. We assert that the ease of tracking depends

on different factors, such as being able to identify the countermeasure, the number of

users of the countermeasure, the ability to recover the real fingerprint values, and the

volume of information leaked by the countermeasure. To support this claim, we analyze

how it could impact the countermeasures we studied in this chapter.

Anonymity Set.

In the case of countermeasures with large user bases, like Firefox

with fingerprinting protection or Brave, although their presence can be detected, these

countermeasures tend to increase the anonymity set of their users by blocking different

attributes, and, in the case of Firefox, by sharing the same user agent, platform, and

timezone. Since they are used by millions of users at the time we wrote this chapter, the

information obtained by knowing that someone uses them does not compensate the loss in

entropy from the removal of fingerprinting attributes. Nevertheless, for countermeasures

with small user bases, such as Canvas Defender (21

downloads on Chrome,5

Firefox) or RAS (160

downloads on Firefox), it is unlikely that the anonymity

gained by the countermeasures compensate the information obtained by knowing that

someone uses them.

Increasing Targetability.

In the case of RAS, we show that it is possible to detect its

presence and recover the original browser and OS family. Also, since the canvas attribute

has been shown to have high entropy, and that RAS does not randomize it nor block it

by default, the combination of few attributes of a fingerprint may be enough to identify a

RAS user. Thus, under the hypothesis that no, or few, RAS users have the same canvas,

many of them could be identified by looking at the following subset of attributes:

being

a RAS user,predicted browser,predicted OS, and canvas.

Blurring Noise.

In the case of Canvas Defender, we show that even though they

claim to have a safer solution than other canvas countermeasure extensions, the way they

4.3 Discussion 129

operate makes it easier for a fingerprinter to track their users. Indeed, Canvas Defender

applies a uniform noise vector on all pixels of a canvas. This vector is composed of 4

random numbers between

−

10 and 30 corresponding to the red, green, blue and alpha

(rgba) components of a color. With a small user base, it is unlikely that two or more

users share both the same noise and the same original canvas. In particular, the formula

hereafter represents the probability that two or more users of Canvas Defender among

share the same noise vector, which is similar to the birthday paradox: 1

−Qk

i=1

−1

404−i

Thus, if we consider that the 21

Chrome users are still active, there is a probability

of 0

0082 that at least two users share the same noise vector. Moreover, by default

Canvas Defender does not change the noise vector. It requires the user to trigger

it, which means that if a user does not change the default settings or does not click on

the button to update the noise, she may keep the same noise vector for a long period.

Thus, when detecting that a browser has Canvas Defender installed, which can be

easily detected as the string representation of the

toDataURL

function leaks its code, if

the fingerprinting algorithm encounters different fingerprints with the same canvas value,

it can conclude that they originate from the same browser with high confidence. In

particular, we discovered that Canvas Defender injects a script element in the DOM

(cf. Listing 4.1). This script contains a function to override canvas-related functions

and takes the noise vector as a parameter, which is not updated by default and has a

high probability to be unique among Canvas Defender users. By using the JavaScript

Mutation observer API [

111

] and a regular expression (cf. Listing 4.2), it is possible

to extract the noise vector associated to the browser, which can then be used as an

additional fingerprinting attribute.

1function o ve r r i de M e th o d s ( do c Id , d at a ) {

2c on st s = d oc um e nt . c r ea te E le m en t ( ’script’)

3s. i d = g e tR a n d o mS t r i ng ( ) ;

4s. t y pe = " t ex t / ja v as c ri pt " ;

5c on s t c o de = d oc u m en t . c r ea t e T ex t N o de ( ’ t ry { ( ’+ o v er r id e De f au lt Me th o ds + ’

) ( ’ + da t a .r + ’,’+ d a ta . g + ’,’ + d at a . b + ’,’+ d at a . a + ’ ," ’ + s . id + ’

" ," ’ + s t or ed O bj e ct P re f ix + ’ ") ; } c at ch ( e ) { co n so l e . er r or ( e ) ;} ’ );

6s. a p pe n dC h il d ( co d e) ;

7var no d e = d o cu me n t . do c um en t El e me n t ;

8n od e . in s er t Be fo r e (s , no d e . fi rs t Ch i ld ) ;

9n od e [ do c Id ] = g e tR a nd o mS t r in g ( ) ;

10}

Listing 4.1 Script injected by Canvas Defender to override canvas-related function

1var o = new MutationObserver ( ( ms ) = > {

2ms . f o r Ea c h ( ( m ) = > {

130 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

3var script = " o v er r id eDe fa ul t Me t ho d s ";

4if ( m . a dd e d No d es [ 0 ]. t e xt . i n de x Of ( s cr i pt ) > - 1) {

5var n oi se = m . a dd e dN od e s [ 0] . te x t. m a tc h (/ \ d {1 , 2} , \ d{ 1 ,2 } ,\ d { 1 ,2 } ,\

d { 1 ,2 } /) [ 0 ]. s p li t ( " ," ) ;

7}) ;

8}) ;

9o. o b se r ve ( d o cu m en t . do c um e nt E le m en t , { c hi l dL i st : true , subtree:true }) ;

Listing 4.2 Script to extract the noise vector injected by Canvas Defender

Protection Level..

While it may seem more tempting to install an aggressive finger-

printing countermeasure—i.e., a countermeasure, like RAS, that blocks or modifies a

wide range of attributes used in fingerprinting—we believe it may be wiser to use a

countermeasure with a large user base even though it does not modify many fingerprinting

attributes. Moreover, in the case of widely-used open source projects, this may lead to a

code base being audited more regularly than less adopted proprietary extensions. We

also argue that all the users of a given countermeasure should adopt the same defense

strategy. Indeed, if a countermeasure can be configured, it may be possible to infer the

settings chosen by a user by detecting side effects, which may be used to target a subset

of users that have a less common combination of settings. Finally, we recommend a

defense strategy that either consists in blocking the access to an attribute or unifying

the value returned for all the users, rather than a strategy that randomizes the value

returned based on the original value. Concretely, if the value results from a randomization

process based the original value, as does Canvas Defender, it may be possible to infer

information on the original value.

4.3.2 Perspectives

In this chapter, we focused on evaluating the effectiveness of browser fingerprinting

countermeasures. We showed that these countermeasures can be detected because of their

side-effects, which may then be used to target some of their users more easily. We think

that the same techniques could be applied, in general, to any browser extension. Starov et

al. [

] showed that browser extensions could be detected because of the way they interact

with the DOM. Similar techniques that we used to detect and characterize fingerprinting

countermeasures could also be used for browser extension detection. Moreover, if an

extension has different settings resulting in different fingerprintable side effects, we argue

4.4 Conclusion 131

that these side effects could be used to characterize the combination of settings used by

a user, which may make the user more trackable.

4.3.3 Threats to Validity

A possible threat lies in our experimental framework. We did extensive testing of Fp-

Scanner to ensure that browser fingerprints were appropriately detected as altered.

Table 4.7 shows that no countermeasure failed the steps unrelated to its defense strategy.

However, as for any experimental infrastructure, there might be bugs. We hope that they

only change marginal quantitative results and not the quality of our findings. However,

we make the dataset, as well as the algorithm, publicly available online

, making it

possible to replicate the experiment.

We use a ruleset to detect inconsistencies even though it may be time-consuming to

maintain an up-to-date set of rules that minimize the number of false positives while

ensuring it keeps detecting new countermeasures. Moreover, in this chapter, we focused

on browser fingerprinting to detect inconsistencies. Nonetheless, we are aware of other

techniques, such as TCP fingerprinting [29], that are complementary to our approach.

Fp-Scanner aims to be general in its approach to detect countermeasures. Nevertheless,

it is possible to develop code to target specific countermeasures as we showed in the

case of Canvas Defender. Thus, we consider our study as a lower bound on the

vulnerability of current browser fingerprinting countermeasures.

4.4 Conclusion

In this chapter, we identified a set of attributes explored by Fp-Scanner to detect

inconsistencies and to classify browser fingerprints into 2categories: genuine fingerprints

and altered fingerprints by a countermeasure. Thus, instead of taking the value of a

fingerprint for granted, fingerprinters could check whether attributes of a fingerprint have

been modified to escape tracking algorithms, and apply different heuristics accordingly.

To support this study, we collected 147 browser fingerprints extracted from browsers

using state-of-the-art fingerprinting countermeasures and we showed that Fp-Scanner

was capable of accurately distinguishing genuine from altered fingerprints. We measured

the overhead imposed by Fp-Scanner and we observed that both the fingerprinter and

the test suite impose a marginal overhead on a standard laptop, making our approach

132 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies

feasible for use by fingerprinters in production. Finally, we discussed how the possibility

of detecting fingerprinting countermeasures, as well as being capable of predicting the

ground value of the browser and the OS family, may impact user privacy. We argued that

being detected with a fingerprinting countermeasure does not necessarily imply being

tracked more easily. We took as an example the different countermeasures analyzed in

this chapter to explain that tracking vulnerability depends on the capability of identifying

the countermeasure used, the number of users having the countermeasure, the capacity to

recover the original fingerprint values, and the information leaked by the countermeasure.

Although Fp-Scanner is general in its approach to detect the presence of counter-

measures, using Canvas Defender as an example, we show it is possible to develop

countermeasure-specific code to extract more detailed information.

In the first two contribution chapters of this thesis, I showed that browser fingerprinting is

a threat to privacy. Nevertheless, in the next chapter, I show that browser fingerprinting

can also be applied to increase security on the web. In particular, I focus on how

fingerprinting can be used to detect crawlers in addition to other approaches, such as

CAPTCHAs. After I measure the use of fingerprinting for crawler detection among

popular websites, I study the detection techniques they use and show some similarities with

the approaches proposed in this chapter for detecting fingerprinting countermeasures.

Chapter 5

FP-Crawlers: Evaluating the

Resilience of Browser Fingerprinting

to Block Adversarial Crawlers

In 2017, bot traffic represented more than 40% of the traffic on the web [

112

]. Bots are

used for various purposes ranging from ad-fraud, to automatically creating social media

fake accounts to spread malware. Bots are also used to automatically gather data on the

web, such as competitor prices, or to steal a website content. In this chapter, I focus

on crawlers—i.e., bots specialized in the collection of data available on the web. While

some of them collect data in agreement of the websites they crawl, it is not the case of

the majority that often infringe the terms of service.

To protect websites against unwanted crawlers, different techniques have been proposed,

such as CAPTCHAs or techniques that rely on features extracted from a series of HTTP

requests. In this chapter, I study how browser fingerprinting can be used to complement

existing crawler detection techniques. Indeed, browser fingerprinting addresses some of

the weaknesses of state-of-the-art crawler detection techniques:

•

Contrary to CAPTCHAs, browser fingerprinting does not require any user interac-

tion;

•

Contrary to methods based on HTTP requests or time series analysis [

], finger-

printing requires a single request to decide whether or not a client is a crawler.

In Section 5.1, I first study the adoption of browser fingerprinting for crawler detection.

I crawl the Alexa Top 10K and identify 291 websites that block crawlers and show

134

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

Identify websites that

block crawlers

Analysis of

ﬁngerprinting scripts

Evaluate resilience of

ﬁngerprinting

Section 1

Section 2

Alexa top

10K

Websites

blocking

crawlers

Detect websites

with ﬁngerprinting

Websites with

ﬁngerprinting

Websites without

ﬁngerprinting

Create

crawlers

Fingerprinting tests

 7 crawlers with modiﬁed

ﬁngerprints

Section 3

Dataset

Action

Asset

Legend

Figure 5.1 Overview of FP-Crawlers: In Section 5.1 I crawl the Alexa’s Top 10K to

measure the ratio of websites using fingerprinting for crawler detection. In Section 5.2, I

explain the key fingerprinting techniques they use. Finally, in Section 5.3 I evaluate the

resilience of fingerprinting against adversary crawlers.

that browser fingerprinting is used by 31.96 % of these websites (93 websites). Then,

in Section 5.2, I report on the key crawler detection techniques implemented by the

major fingerprinters and show that they use similar techniques to the ones presented in

Chapter 4used to reveal the presence of fingerprinting countermeasures. In Section 5.3, I

evaluate the resilience of these fingerprinting scripts against adversarial crawlers that try

to hide their identity to bypass security checks. I show that, while browser fingerprinting

is a good candidate for crawler detection, it can be bypassed by an adversary having

some knowledge on the fingerprints collected. In Section 5.4, I discuss the limits and the

challenges of browser fingerprinting for crawler detection. Finally, I provide a conclusion

to this chapter in Section 5.5.

Figure 5.1 provides an overview of this chapter and the different contributions along

sections.

5.1 Detecting Crawler Blocking and Fingerprinting Websites 135

5.1 Detecting Crawler Blocking and Fingerprinting

Websites

In this section, we describe our experimental protocol to classify websites that adopt

browser fingerprinting to detect and block crawlers.

This protocol is composed of two main steps:

1. Detecting websites that block crawlers.

From Alexa’s Top10K, we identify

the websites that detect and block crawlers. We consider that these websites at

least use the user agent in the HTTP headers to detect and block the crawlers.

The subset of detected websites then provides us an oracle that we use to evaluate

the resilience of browser fingerprinting (cf. Section 5.3);

2. Detecting websites that use fingerprinting.

Among the websites that block

crawlers (step 1), we detect the ones that use fingerprinting for crawler detection

and the ones that either do not use fingerprinting, or use fingerprinting but not to

detect crawlers. We then use the set of websites that use fingerprinting to make

our analysis (cf. the Section 5.2).

5.1.1 Detecting Websites Blocking Crawlers

We first identify websites from Alexa’s Top 10K that block crawlers based on their user

agent.

Crawler 1: easy to detect.

For each website of the Alexa Top 10K, a first crawler

(crawler 1) visits the homepage of the website and then browses up to 4random links

of the same domain accessible from the home page. We crawl only 4links since the

crawler can be easily identified by its user agent. We consider that a page is loaded when

there is no more than two active network connections for at least 500

(

networkidle2

event of the Puppeteer library). If the page is not loaded after 30

seconds

, we consider

it failed to load and we add the link of the page to a queue of links that are retried

at the end of the crawl. If a page is loaded, the crawler waits for 3

seconds

, dumps

the downloaded HTML, and takes a screenshot of the rendered page. Crawler 1 is

based on the Chromium headless version bundled with the Puppeteer library [

113

]

and is instrumented using Puppeteer. Using a Chrome headless based crawler enables

to crawl the majority of the websites as it supports modern features found in com-

mon non-headless browser. We do not modify the crawler user agent, allowing it to

136

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

be detected solely because this:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2)

AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome /72.0.3582.0 Safari

/537.36

. Thus, we can identify websites that block crawlers when these crawlers do not

try to hide their identity. Although the blocking decision might not be based on the

crawler’s entire fingerprint, but mostly on its user agent, it still provides an indication

concerning the way the website reacts to crawlers. Since Chrome headless is a popular

solution for crawling and has replaced older headless browsers, such as PhantomJS, we

consider it unlikely that websites that aim at blocking crawlers do not detect Chrome

headless. Moreover, its user agent has been added to popular user agent lists used for

crawler detection in 2017 [

114

]. We also make the hypothesis that websites that try

to block bots using advanced approaches, such as traffic shape analysis, are also likely

to use lists of crawlers. Indeed, these lists have no false positives and enable to block

crawlers before they can even load a single page. Thus, it protects websites against large

distributed campaigns where each crawler visit only a few pages.

Crawler 2: verify errors.

In parallel, on a second machine that uses another IP

address, a second crawler (crawler 2) visits the homepage of each website of the Alexa

Top 10K. Contrary to crawler 1, this crawler modifies its user agent to pretend to be a

vanilla Chrome browser. We use this second crawler to verify, in case of errors encountered

by crawler 1, that the website is lying about the error in order to stop the crawler, or if

it really exhibits an error. On each home page, crawler 2 waits for the page to be loaded

(using the same criteria as the first crawler), waits 3seconds, dumps the HTML and

takes a screenshot.

Labeling screenshots.

Then, we label websites as blocking crawlers or not. To do

so, we developed a web interface that displays the screenshots taken by crawler 1 and

crawler 2, side by side, for each visited URL. Each URL is assigned 1of 3possible labels:

"not blocked": we consider that the crawler has not been blocked if the page does

not show any sign of blocking, such as an explicit message or a CAPTCHA;

"blocked": we consider that the crawler has been blocked if the page reports some

obvious symptom of blocking, such as a CAPTCHA not linked to any form or a

message indicating that we are blocked.Moreover, in the case an error reports that

the page is not available or that the website is down in the screenshot taken by

crawler 1, we verify if it is actually the case using the screenshot taken by crawler 2.

If only crawler 1 has an error, then we consider it has been blocked;

"unknown": corresponds to cases where we cannot assess with certainty the crawler

has been blocked. This situation can occur because the page timed-out. Moreover,

5.1 Detecting Crawler Blocking and Fingerprinting Websites 137

sometimes both crawler 1 and crawler 2 report a

403

error. In this case, we verify

that the website always returns a

403

error for human users. To do so, we manually

visit the website using a computer with a residential IP address, which has not been

used for the experiment and we verify if the website keeps returning the

403

error.

If it still returns a

403

error, then we classify the URL as "unknown" as it blocks

all users, and not only crawlers. Otherwise, the URL is classified as "blocked".

We consider that a website blocks crawlers if a crawler has been blocked on at least one

of its pages.

5.1.2 Detecting Websites that Use Fingerprinting

In the second phase, we focus on the websites we labeled as "blocked" because they block

crawlers. We crawl these websites to classify them as either using fingerprinting for

crawler detection or not.

Crawler modifications.

We apply two modifications to the crawler’s fingerprint to

escape detection based on HTTP headers. Indeed, in order to detect if a website uses

fingerprinting, the crawler needs to load and execute the JavaScript included in the pages.

First, we modify the user agent to look like a user agent from a vanilla Chrome. Second,

we add an

accept-language

header field as it is not included by default in Chrome

headless [115].

The crawler visits each home page of the websites identified as blocking crawlers and, for

each website, visits up to 3randomly selected links on the same domain. We visit only 3

links as the goal of this crawl is to detect websites using fingerprinting on popular pages

that can easily be reached from their home page. Thus, it does not aim at detecting if

the website use fingerprinting on specific sensitive pages, such as login or payment pages.

We consider the same heuristics as in the previous step to check if a page is loaded. If

the page fails to load, it is added to a queue of URLs that are retried at the end of the

crawl. For each visited URL, the crawler records the attributes commonly accessed in

browser fingerprinting [

]. To record the access, we inject a JavaScript snippet

executed in each page to override the default behavior of getters and functions, and

that stores, for each script of the page, when it accesses an attribute or calls a function.

We override attributes commonly associated to browser fingerprinting in the literature,

such as the properties of the

navigator

and the

screen

objects. We also override

functions related to canvas, audio, and WebGL fingerprinting. Finally, we monitor access

138

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

to attributes considered by security fingerprinting scripts, such as

window._phantom

navigator.webdriver

, which are known to belong to crawlers. We explain the role of

these attributes in more details in the next subsection. We monitor these attributes to

detect if a fingerprinting script tries to detect crawlers or if it uses fingerprinting for

tracking purposes. A complete list of the attributes and functions monitored is available

in Appendix A.

Finally, we consider that a website uses fingerprinting for crawler detection if:

There is at least one script on one of its pages that called one or more functions

related to canvas, WebGL, audio or WebRTC fingerprinting;

This script also accesses one crawler-related attribute, such as

window._phantom

or navigator.webdriver;

3. This script also retrieves at least 12 fingerprinting attributes.

We adopt this definition since there is no clear agreement on how to characterize

fingerprinting, in particular when used in the context of crawler detection. For example,

Acar et al. [

] consider font enumeration as a good indicator of fingerprinting. However,

as we show in the next section, it is not the most discriminant feature for crawler detection.

Englehardt et al. [

] do not define fingerprinting in general, but detect if functions used for

complex fingerprinting attributes, such as canvas, canvas font enumeration or WebRTC

fingerprinting are used. Thus, our definition of fingerprinting ensures that a script accesses

a sufficient number of fingerprinting attributes, in particular attributes considered as

strong indicators of fingerprinting, such as canvas. As we study fingerprinting for crawler

detection, we also add a constraint to check that the script must access at least one

crawler-related attribute, given that these are widely known and show intent to block

crawlers [116].

5.2 Analyzing Fingerprinting Scripts

In this section, we first study the ratio of websites that adopt browser fingerprinting for

crawler detection. Then, we present the detection techniques implemented by the main

fingerprinters present on the Alexa Top 10K.

5.2 Analyzing Fingerprinting Scripts 139

5.2.1 Describing our Experimental Dataset

The different crawls to detect websites that block crawlers and websites using fingerprint-

ing were conducted in December 2018.

Sites blocking crawlers.

Among the 10,000 crawled websites, we identified 291 websites

that block crawlers (2

91%). The median Alexa’s rank of websites blocking crawlers is

946, against 5

001 for websites that do not block crawlers. If we compare the two

distributions of the Alexa’s rank among websites that block crawlers and those that do

not, using a Kolmogorov Smirnov test [

117

], we obtain a p-value of 0

324. This statistical

test indicates that there is no significant difference in the distribution of the rank of

websites that block crawlers and websites that do not block crawlers.

Fingerprinting attributes.

For each website that blocks crawlers, we study the number

of fingerprinting attributes they access. For a given website, we look at the script that

accesses the maximum number of distinct fingerprinting attributes. The median number

of distinct fingerprinting attributes accessed is 12.10% of the websites access more

than 33 distinct fingerprinting attributes. Concerning crawler-related attributes, such as

navigator.webdriver

window._phantom

,51

38% of the websites do not access even

one of such attributes, while 10% of them access 10 crawler attributes. Based on our

definition of browser fingerprinting, we classified 93 websites as using fingerprinting for

crawler detection, which represents 31.96% of the websites that block crawlers.

Diversity of fingerprinting scripts.

We group fingerprinting scripts by the combina-

tion of attributes they access. In total, we observe 20 distinct groups among websites

blocking crawlers. While each group is constituted of scripts from the same company,

we also observe that some companies are present in different clusters, as they have

multiple versions of their script. In order to analyze the main fingerprinting techniques

used for crawler detection, we focus on the scripts of 4fingerprinting companies as

they represent more than 90% of the scripts among the websites that block crawlers.

Since these companies have multiple versions of their script, we chose the script that

accesses the greatest distinct number of fingerprinting attributes. We decided not to

disclose the names of these companies since it does not contribute to the understanding

of fingerprinting and our findings could be used by crawler developers to specifically

target some websites. Moreover, for copyright reasons, we cannot distribute the original

scripts nor their deobfuscated version. We discuss this point and other ethical issues in

Section 5.4.3.

140

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

In the remainder of this Section, we present the techniques used by the 4main finger-

printing scripts to detect crawlers. These scripts collect fingerprinting attributes and

then either perform a detection test directly in the browser or transmit the fingerprints

to a server that applies heuristics to perform the detection test. Table 5.1 provides an

overview of the different attributes and tests. In particular for each given attribute and

script, there are three possible values:

1. ✓

indicates that the script collects the attribute and test it in the browser—i.e.

its value is explicitly verified or we know the attribute is used for the detection

because of the evaluation conducted in Section 5.3;

2. ∼

indicates that the script collects the attribute, but no test is run directly in the

script. This means that the fingerprinter may use the value collected to run the

test on the server side. The empirical evaluation we conduct in Section 5.3 help

us to understand if some of the attributes are used on the server side to detect

inconsistencies;

3. The absence of symbol indicates that the attribute is not collected by the script.

It should be noted that the 4scripts we analyze are obfuscated. We then cannot use

the names of the variables or functions to infer their purpose. Thus, we use different

techniques, such as open source fingerprinting libraries [

], the state-of-the-art literature,

as well as an empirical evaluation conducted in Section 5.3 to explain how different

combinations of collected attributes are used to detect crawlers.

5.2.2 Detecting Crawler-Specific Attributes

The first and most widely used detection technique in the 4fingerprinting scripts relies

on the presence of specific attributes injected into the JavaScript execution context or in

the HTML DOM by headless browsers or instrumenting frameworks, such as Selenium.

For example, in the case of Google Chrome or Firefox, it is possible to detect if

a browser instance is automated by checking if the attribute

webdriver

is included in

the

navigator

object and if its value is equal to

true

. The scripts test for the presence

of the following properties, which used to be added to the

document

object by older

versions of Selenium: 1.

__fxdriver _unwrapped

, 2.

__selenium _unwrapped

, and

3. __webdriver_script_fn.

Older versions of Selenium and its associated plugins also used to inject variables into

the

window

object, such as

_Selenium_IDE_Recorder

callSelenium

, or

_selenium

While all the scripts test for the presence of these variables, they are not added anymore

5.2 Analyzing Fingerprinting Scripts 141

Table 5.1 Different fingerprinting tests associated with the scripts that use them. The

symbol

✓

indicates the attribute is collected and that a verification test is run directly

in the script. The

∼

symbol indicates that the attribute is collected but there is no

verification test in the script.

Scripts

Name of the test 1 2 3 4

Crawler-related attributes ✓ ✓ ✓ ✓

Browser

productSub ∼∼∼✓

eval.toString() ✓ ✓

Error properties ✓ ✓

Browser-specific/prefixed APIs ✓ ✓ ✓ ✓

Basic features ✓ ✓ ✓

Different feature behaviour ✓

CSS features ✓

Codecs supported ✓

HTTP headers ∼✓∼✓

Touch screen support ∼∼∼✓

Oscpu and platform ∼∼∼✓

WebGL vendor and renderer ∼ ∼ ∼

List of plugins ∼∼∼✓

List of fonts ∼ ∼

Screen dimensions ✓∼ ∼ ✓

Overriden attributes/functions ✓ ✓ ✓

Other

Events ∼ ∼

Crawler trap ✓

Red pill ✓

Audio fingerprint ∼

Canvas fingerprint ∼∼∼

WebGL fingerprint ∼ ∼

WebRTC ∼

142

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

by the most recent versions of Selenium [

118

] or its associated plugins, such as Selenium

IDE [

119

]. Besides Selenium, the scripts also detect headless browsers and automation

libraries, such PhantomJS and NightmareJS, by checking the presence of specific

attributes in the window object: _phantom,callPhantom,phantom or __nightmare.

While the presence of any of these attributes provides a straightforward heuristic to

detect crawlers with certainty, these attributes can be easily removed, or renamed to

escape detection. Thus, we describe more robust detection techniques, based on browser

fingerprint inconsistencies, to overcome this limitation. We structure the presentation of

the inconsistencies searched by fingerprinters into four categories and we present a fifth

family of common non-fingerprinting tests found in the fingerprinting scripts: 1. browser

and version inconsistencies, 2. OS inconsistencies, 3. screen inconsistencies, 4. overridden

functions inconsistencies, 5. other tests.

5.2.3 Checking Browser Inconsistencies

Commercial fingerprinting scripts also leverage the notion of fingerprint inconsistency

presented in Chapter 4to detect automated browsers. The first set of inconsistency

verifications found across the 4fingerprinting scripts aim at verifying if the browser

claimed in the user agent has been altered. Before we present the different tests used to

verify the nature of a browser, we provide a brief overview of what inconsistencies are

and how they can help reveal crawlers.

Fingerprint inconsistencies.

Similarly to the detection test suite I proposed in

the previous chapter, fingerprinters also use inconsistencies to detect combinations of

attributes that cannot be found in the wild for non-headless or non-automated browsers.

In the case of crawlers, inconsistencies can occur in different situations:

•When crawler developers alter the user agent of their crawlers to not be detected.

By doing this, it may introduce inconsistencies between the browser and the OS

claimed in the user agent, and the different attributes composing the fingerprint;

•

When a crawler is based on a headless browser. Because headless browsers do not

always implement all the features of modern browsers, or they implement them

differently, they can introduce inconsistencies between the browser claimed in the

user agent and the features it should expose.

5.2 Analyzing Fingerprinting Scripts 143

5.2.3.1 Explicit browser consistency tests

One of the scripts implements tests similar to the function

getHasLiedBrowser

proposed

by the open source FingerprintJS2 library [65]:

productSub.

First, it extracts the browser from the user agent and verifies if it has a

consistent

navigator.productSub

value. While originally this attribute returned

the build number of the browser, it always returns

20030107

on Chromium-based

browsers or Safari and it returns 20100101 on Firefox;

eval.toString.

Then, it runs the following snippet of code:

eval.toString().length

which returns the length of the string representation of the native

eval

function.

While on Safari and Firefox it is equal to 37, on Internet Explorer it is equal to 39

and it is equal to 33 on Chromium-based browsers;

Error properties.

It throws an exception and catches it to analyze the properties of

the error. While some of the properties of the

Error

objects, such as

message

and

name

are standard accross different browsers, some of them such as

toSource

exist

only in Firefox. Thus, the script verifies that if the

toSource

property is present

in the error, then the browser is Firefox.

5.2.3.2 Feature Detection

The 4scripts test the presence of different features. The previous set of tests was the

same as the

getHasLiedBrowser

function of the FingerprintJS2 library. Thus, we did

not need to infer its purpose. Here, we present how different features tested across the 4

fingerprinting scripts can be used to reveal inconsistencies in the nature of the browser

and its versions, even when the tests are not executed in the fingerprinting scripts.

Browser-specific and browser-prefixed APIs.

Instead of relying on the user agent

string to infer the nature of a browser, it is possible to test for the presence of specific

features linked to certain browsers and certain versions [

]. In particular, all the scripts

test for the presence of the

window.chrome

object, a utility for extension developers

available on Chromium-based browsers. This feature can also help to reveal Chrome

headless since it does not have this object, even though it is based on Chrome.

To verify if a browser is Safari, one of the scripts tests for the presence of the

pushNotification

function in

window.safari

. For Opera, the script verifies the presence of

window.opera

Finally, for Firefox, it verifies if the

InstallTrigger

variable is defined, and for Internet

Explorer, it verifies the value returned by

eval("/*@cc_on!@*/false")

. The latter test

144

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

relies on conditional compilation, a feature available in old versions of Internet Explorer

but not in recent browsers. Thus, the code returns true on Internet Explorer and false

on modern browsers.

One of the scripts tests for the presence of features whose name depends on the browser

vendor. For example, it verifies that the function

requestAnimationFrame

is present

together with msRequestAnimationFrame or webkitRequestAnimationFrame.

Basic features.

Two of the studied scripts verify the presence of the

bind

function.

While this test does not help in detecting recent headless browsers, it used to detect

PhantomJS [

120

] as it did not have this function. Another script collects the first

100 properties of the

window

object, returned by

Object.keys

to verify if they are

consistent with the browser claimed. Finally, one of the scripts tests a set of 18 basic

features, such as creating or removing event listeners using

addEventListener

and

removeEventListener

. It also tests other APIs that have been available in mainstream

browsers for a long time, such as

Int8Array

[

121

], which have been included since Internet

Explorer 10, or the

MutationObserver

[

122

] API, available since Internet Explorer 11.

Since, the majority of these features are present in all the recent versions of mainstream

browsers such as Chrome, Safari or Firefox, they can be used to detect non-standard or

headless browsers that do not implement these features.

Different feature behavior.

Even when a feature is present, its behavior may depend

on the browser. For example, in a blog post [

123

], I showed that Chrome Headless

fails to handle permissions [

124

] in a consistent way. When requesting permissions

using two different techniques, as showed in listing 5.1—

Notification.permission

and

navigator.permissions.query

—Chrome Headless returns conflicting values, which

differs from a vanilla Chrome behavior. One of the scripts exploits this inconsistency to

detect crawlers based on Chrome headless.

n av i ga to r . p er m is s io n s .q u er y ({ n am e : ’ no t i fi c a ti o n s ’ })

. t he n ( function(permissionStatus) {

if ( N o ti f ic a ti o n . pe r mi ss i on = == ’denied ’ &&

permissionStatus.state === ’prompt’) {

c on s o le . l og ( ’ T hi s i s C h ro m e h ea d l e ss ’ )

}else {

c on s o le . l og ( ’ T hi s i s n ot C h ro m e h e ad l e s s ’ )

}

}) ;

Listing 5.1 Checking if permissions are consistent.

5.2 Analyzing Fingerprinting Scripts 145

Another feature whose behavior depends on the browser and whether or not it is in

headless mode is the image error placeholder. When an image cannot be loaded, the

browser replaces it with a placeholder. Nevertheless, in the early versions of Chrome

headless, there was no placeholder [

123

]. Thus, these versions of Chrome headless can be

detected because the width and the height of the placeholder are equal to 0 pixels. One

of the scripts detects this by creating an image whose

src

attribute points to a random

URL that does not exist and then measures the size of the placeholder. Since the size

also depends on the browser, it can be used to verify its nature:

•

On Chromium-based browsers it measures

16x16

pixels and its size does not depend

on the zoom level,

•On Safari it measures 20x20 pixels and its size depends on the zoom level,

•

On Firefox it measures

24x24

pixels and its size does not depend on the zoom level.

CSS features.

One of the scripts we studied also collects the CSS properties applied to

the body element using the

getComputedStyle

function. Similarly to JavaScript features

that are browser-dependent, such as

webkit/ms RequestAnimationFrame

, it is possible

to test if some of the browser-dependent CSS features are consistent with the browser

claimed in the user agent.

5.2.3.3 Audio & Video Codecs

One of the scripts tests for the presence of different audio and video codecs. To do so,

it creates an audio and a video element on which it applies the

canPlayType

method

to test the availability of audio and video codecs presented in Tables 5.2 and 5.3. The

canPlayType function returns 3possible values:

1. "probably", which means that the media type appears to be playable,

2. "maybe"

indicates that it is not possible to tell if the type can be played without

playing it,

3. "", an empty string indicating that the type cannot be played.

Tables 5.2 and 5.3 report on the audio and video codecs supported by vanilla browsers.

Tables are based on the dataset from Caniuse [

125

–

131

], as well as data collected on

the personal website of the thesis author. We can observe that some codecs are not

supported by all browsers, which means that they can be used to check the browser

claimed in the user agent.

146

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

Table 5.2 Support of audio codecs for the main browsers.

Audio codec Chrome Firefox Safari

ogg vorbis probably probably ""

mp3 probably maybe maybe

wav probably maybe maybe

m4a maybe maybe maybe

aac probably maybe maybe

Table 5.3 Support of video codecs for the main browsers.

Video codec Chrome Firefox Safari

ogg theora probably probably ""

h264 probably probably probably

webm vp8 probably probably ""

5.2.3.4 HTTP headers

Contrary to JavaScript and CSS features, which are collected in the browser, HTTP

headers are collected on the server side. Thus, we cannot directly observe if fingerprinters

collect these headers. Nevertheless, because of side-effects, such as being blocked, we

observe that all fingerprinters collect at least the user agent header. Moreover, we also

detect that 2of the fingerprinters test for the presence of the

accept-language

header.

Indeed, by default, Chrome headless does not send this header. In the evaluation, we

show that its absence enables some of the fingerprinters to block crawlers based on

Chrome headless.

5.2.4 Checking OS Inconsistencies

Only one script among the four performs an explicit OS verification. Nevertheless, it does

not mean that others do not conduct such tests on the server side using attributes collected

by the fingerprinting script or using other techniques, such as TCP fingerprinting [

However, while users with fingerprinting countermeasures, such as spoofers, may be

tempted to lie about their OS in order to increase their privacy [

], it is not required by

crawlers, which only need to hide their nature to escape detection. Thus, even though

verifying OS consistency can catch crawlers that modified the OS displayed in the user

agent, it does not help against crawlers that only modify the nature of the browser.

5.2 Analyzing Fingerprinting Scripts 147

5.2.4.1 Explicit OS consistent tests

The set of tests conducted by the only fingerprinter that verifies the OS in its script is

similar to the

getHasLiedOs

function of the library FingerprintJS2 [

]. It extracts the

OS claimed in the user agent to use it as a reference and then runs the following set of

tests:

1. Touch screen verification.

It tests if the device supports touch screen by

verifying the following properties: the presence of the

ontouchstart

property in the

object

window

and

navigator.maxTouchPoints

navigator.msMaxTouchPoints

are greater than

. If the device claims to have touch support, then it should be

running one of the following operating systems: Windows Phone, Android or iOS;

2. Oscpu and platform. Oscpu

is an attribute, only available on Firefox, that

returns a string representing the platform on which the browser is executing.

The script verifies that the OS claimed in the user agent is consistent with the

navigator.oscpu

attribute. For example, if a platform attribute indicates that

the device is running on

arm

, then the OS should be Android or Linux. Similarly,

if the platform is

iPad

iPhone

, then the OS should be

iOS

. They also conduct

similar tests with the navigator.platform attribute.

Only one fingerprinter runs the above set of tests directly in its script. Nevertheless,

the other three fingerprinting scripts also collect information about the presence of a

touch screen,

navigator.platform

and

navigator.oscpu

. Thus, they may run similar

verifications on the server side.

WebGL information.

Three of the scripts use the WebGL API to collect information

about the vendor and the renderer of the graphic drivers. As we explained in Chapter 4,

these values are linked to the OS and can be used to verify OS consistency [

]. For

example, a renderer containing

"Adreno"

indicates the presence of an Android device,

while a renderer containing

"Iris OpenGL"

reveals the presence of MacOS. One of the

scripts also verifies if the renderer is equal to

"Mesa OffScreen"

, which is one of the

values returned by the first versions of Chrome headless [132,123].

List of plugins.

The four scripts collect the list of plugins using the

navigator.plugins

property. While some of the plugins are browser dependent and can be used to verify the

claimed browser, they can also be used to verify the OS [

]. Indeed, the extension of

the filename of a plugin provides indications about the current OS. Plugin file extensions

should be .so on Linux, .plugin on Mac and .dll on Windows.

148

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

List of fonts.

Two of the fingerprinting scripts collect a list of fonts using JavaScript font

enumeration [

]. While it can be used to increase the uniqueness of the fingerprint [

it can also be used to reveal the underlying OS [

105

] since some fonts are only found

on specific OSes by default.

5.2.5 Checking Screen Inconsistencies

The four scripts collect information related to the screen and window sizes. In particular,

they all collect the following attributes:

•screen.width/height

•screen.availWidth/Height

•screen.colorDepth

•window.innerWidth/Height

•window.devicePixelRatio

For example, the

screen.width

and

screen.height

represent the width and the height of

the web-exposed screen, respectively. The

screen.availWidth

and

screen.availHeight

attributes represent the horizontal and vertical space in pixels available to the window,

respectively. Thus, one of the scripts verifies that the available height and width are

always less than (in case there is a desktop toolbar) or equal to the height and the

width. Another property used to detect some headless browsers is the fact that, by defini-

tion,

window.outerHeight/Width

should be greater than

window.innerHeight/Width

Nevertheless, one should be careful when using this test since it does not hold on iOS

devices [133] where the outerHeight is always equal to 0.

5.2.5.1 Overriden Inconsistencies

Crawler developers may be aware of the detection techniques presented in this section

and try to hide such inconsistencies by forging the expected responses—i.e., providing a

fingerprint that could come from a vanilla browser, and thus not be detected as a crawler.

To do so, one solution is to intercept the outgoing requests containing the fingerprint

to modify them on the fly, however, this cannot always be easily done when scripts are

carefully obfuscated and randomized. Another solution is to use JavaScript to override

the functions and getters used to collect the fingerprint attributes. However, when doing

5.2 Analyzing Fingerprinting Scripts 149

this, the developer should be careful to hide the fact she is overriding native functions

and attributes. If not, checking the string representation of the functions will reveal

that a native function has been intentionally overridden. While a standard execution

functionName.toString()

returns a string containing

native code

in the case of a

native function, it returns the code of the new function if it has been overridden.

We observe that all the scripts check fingerprinting functions, such as

getImageData

used to obtain a canvas value or the

WebRTC

class constructor, have been overridden.

In particular, they verify functions, such as

setTimeout

requestAnimationFrame

and

bind

. Beyond native functions, one of the scripts also checks if native objects, such as

navigator.geolocation

, have been overridden. By default, the

toString

representation

of the geolocation object is

"[object Geolocation]"

. Nevertheless, if a developer

overrides the function that returns a geolocation object, its string representation will be

similar to any other object: "[object Object]".

Detection using side effects.

Besides looking at the string representation of native

functions and objects, one script goes further by verifying the value returned by a native

function. It verifies that the

getImageData

function used to collect the value of a canvas

has been overridden by looking at the value of specific pixels.

5.2.6 Other Non-fingerprinting Attributes

Events.

Crawlers may programmatically generate fake mouse movements and fake

clicks to simulate human behavior and fool behavioral analysis detection systems. To

detect such events, two of the fingerprinting scripts check that events originate from

human actions. If an event has been generated programmatically, the browser sets its

isTrusted

property to

false

. Nevertheless, this approach does not help in detecting

crawlers automated using Selenium or the Chrome DevTools protocol, since the events

they generate are considered trusted.

Crawler trap.

One script creates a crawler trap using an invisible link with the

"nofollow"

property and appends a unique random identifier to the URL pointed to

by the link. Thus, if a user selects the link or loads the URL, it can be identified as a

crawler that does not respect the nofollow policy.

Red pill.

One script collects a red pill similar to the one presented by Ho et al. [

] to

test if a browser is running in a virtual machine or an emulated device. The red pill

exploits performance differences caused by caching and virtual hardware.

150

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

WebRTC.

WebRTC is an API that enables Real Time Communication in the browser.

One of the scripts uses the WebRTC API to collect the IP address of the user. Indeed, it

can be used to obtain the public IP address of a user, even when she is behind a proxy

or a VPN [6].

Other fingerprinting techniques.

Beyond the features reported in this section, all

the scripts collect common attributes identified in the browser fingerprinting literature,

such as WebGL, canvas and audio fingerprinting. Fingerprinters only collect a hash of

these values to save the cost of storing raw values. Besides one of the scripts that tests the

value of a given pixel in a canvas to verify if it has been altered, none of the other scripts

extract any features that could be used to assess some sort of consistency. In the next

section, we show that these attributes do not influence the detection of crawlers. Our

main hypothesis is that these attributes may be used for short-term tracking purposes in

order to identify crawlers that may change their IP address [134].

In this section, we showed that 291 sites from the Alexa Top10K block crawlers using

the user agent. Among these, 93 websites (31

96%) that use fingerprinting for crawler

detection. They use different techniques that leverage attributes added by automated

browsers or fingerprint inconsistencies to detect crawlers.

5.3 Detecting Crawler Fingerprints

In this section, we first evaluate the effectiveness of browser fingerprinting to detect

crawlers. Then, we study the resilience of browser fingerprinting against an adversary

who alters its fingerprint to escape detection.

5.3.1 Experimental Protocol

Ground truth challenge.

The main challenge to evaluate crawler detection approaches

is to obtain ground truth labels to assess the evaluation. The typical approach to obtain

labels is to request experts from the field who check raw data, such as fingerprints and

HTTP logs, and use their knowledge to label these samples. The main problem of this

approach is that labels assigned by the experts are as good as the current knowledge

of the experts labeling the data. Similarly to machine learning models that struggle to

generalize to new data, these experts may be good at labeling old crawlers they have

already encountered, but not at labeling new kinds of crawlers they are unaware of,

5.3 Detecting Crawler Fingerprints 151

which may artificially increase or decrease the performance of the approach evaluated. To

address this issue, we decide to use ourselves different kinds of crawlers on websites that

have been identified as blocking crawlers. Thus, no matter how the crawler tries to alter

its fingerprint, we can always assert that it is a crawler because it is under our control.

Then, in order to measure the effectiveness of fingerprinting for crawler detection, we

rely on the fact that the crawled websites have been identified as websites that block

crawlers. We consider that, whenever they detect a crawler, they will block it. We use

this blocking information as an oracle for the evaluation. We discuss the limits of this

hypothesis in Section 5.4.3. A solution to obtain the ground truth would have been to

subscribe to the different bot detection services. Nevertheless, besides the significant

cost, bot detection companies verify the identity of their customers to ensure it is not

used by competitors trying to reverse engineer their solution or by bot creators trying to

obtain an oracle to maximize their ad-fraud incomes for example.

5.3.1.1 Crawler Family

In order to evaluate the resilience of fingerprinting, we send 7different crawlers that

incrementally modify their fingerprints to become increasingly more difficult to detect.

Table 5.4 presents the crawlers and the attributes they modify. The first six crawlers are

based on Chrome headless for the following reasons:

It has become a popular headless browser for crawling. Since its first release, the

once popular PhantomJS stopped being maintained [?];

It implements the majority of features present in popular non-headless browsers,

making therefore its detection more challenging compare to older headless browsers;

Had we use older headless browsers such as PhantomJS (not maintained since March

2018) and SlimerJS (works only with Firefox version

59 released in 2017), crawlers

would have been easily detected because of the lack of modern web features [120].

The last crawler is based on a vanilla Chrome browser. We use this crawler to better

understand why blocking occurs, and to assess that crawlers are blocked because of their

fingerprint. Indeed, since this crawler is based on a vanilla Chrome, the only difference

in its fingerprint is the

navigator.webdriver

attribute. Once this attribute is removed,

it can no longer be detected through fingerprinting.

We restrict the evaluation to 7different crawlers. Ideally, a perfect protocol would

randomly mutate fingerprint attributes to provide a fine-grained understanding. However,

this was not feasible in practice, as our evaluation requires residential IP addresses of

152

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

Table 5.4 List of crawlers and altered attributes.

Crawler Attributes modified

Chrome headless based

Crawler 1 User agent

Crawler 2 Crawler 1 + webdriver

Crawler 3 Crawler 2 + accept-language

Crawler 4 Crawler 3 + window.chrome

Crawler 5 Crawler 4 + permissions

Crawler 6 Crawler 5 + screen resolution +

codecs + touch screen

Vanilla Chrome based

Crawler 7 webdriver

which we have a limited supply, as well as the exponential complexity resulting from

testing all attribute permutations on the set of evaluated websites. While we could have

used residential proxy services to acquire more residential IP addresses, this approach

still has several drawbacks:

1. Ethical issues.

Mi et al. [

135

] showed that a majority of the devices proposed

by residential proxy services, such as Luminati, did not give their consent, which

raises ethical issue since we may pollute the reputation of their IP address;

2. Inconsistencies.

Since these residential proxy services do not provide mechanisms

to ensure the nature of the device that will act as a proxy, there can be inconsistencies

between the browser fingerprint of our crawlers and the TCP, TLS, and HTTP

fingerprints of the proxy, making it more difficult to understand why a crawler was

blocked.

Details of the modified attributes.

Crawlers2 to 6 build on the previous one, adding

new modifications each time to increase the difficulty of detection. For example, crawler 4

implements the changes made by crawlers 1, 2 and 3.

Crawler1 is based on Chrome headless with a modified user agent to look like a

vanilla Chrome user agent;

2. In the case of Crawler2, we delete the navigator. webdriver property;

By default, Chrome headless does not add an

accept-language

header to its

requests. Thus, for Crawler3, we add an

accept-language

header whose value is

set to "en-US" for all the requests sent by the crawler;

4. Crawler 4 injects a chrome property to the window object;

5.3 Detecting Crawler Fingerprints 153

For Crawler 5, we override the management of the permissions for the notifications

to hide the inconsistency exposed by Chrome headless [

123

]. Since we override

the behavior of native functions, we also override their

toString

method, as well

Function.prototype.toString

—i.e., the

toString

of the

Function

type in

order to hide our changes;

For Crawler 6, we apply modifications related to the size of the screen, the

availability of touch support and the codecs supported by the browser. First,

we override the following properties of the

window

object:

innerWidth/Height

outerWidth/Height

and

window.screenX/Y

. We also modify properties of the

screen

object:

availWidth/Height

and

width/height

. By default, Chrome

headless simulates touch screen support even when Chrome headless is running

on a device that does not support it. To emulate a desktop computer with-

out touch support, we override the

document.createEvent

function so that it

throws an exception when trying to create a

TouchEvent

. We also override

navigator.maxTouchPoints

to return

and we delete the

ontouchstart

property

of the

window

object. We also lie about the codecs supported to return the same

value as a vanilla Chrome, by overriding the

canPlayType

function for both the

HTMLAudioElement

and

HTMLVideoElement

. In order to hide changes made to

native functions, we override their toString;

Contrary to the first six crawlers, Crawler 7 is based on a vanilla Chrome—i.e.,

non-headless. Thus, we only remove the

webdriver

attribute from the

navigator

object.

5.3.1.2 Evaluation Dataset

We present how we select websites used for the evaluation.

Cross-domain detection.

Since we want to evaluate fingerprinting for crawler detection,

we try to eliminate other detection factors that could interfere with our evaluation. One

such factor is cross-domain detection. This occurs when a company provides a crawler

detection service that is present on multiple domains being crawled. In this situation,

the company can leverage metrics collected on different domains, such as the number

of requests, to classify traffic no matter the website. In order to minimize the risk that

cross-domain detection interferes with our evaluation, we need to decrease the number

of websites that belong to the same company in the evaluation dataset. Thus, there is

a tradeoff between the number of websites in the evaluation dataset and the capacity

to eliminate other factors, such as cross-domain detection. While to our knowledge, no

154

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

research has been published on cross-domain detection, we encountered this phenomenon

during the different crawls we conducted. Moreover, during informal discussions with a

crawler detection company, engineers also mentioned this practice.

Selection of websites.

We group websites identified as blocking crawlers and using

fingerprinting (as defined in Section 5.1) based on the combination of fingerprinting

attributes they access. We obtain 20 groups of fingerprinting scripts and, for each of

the groups, we randomly select one website. We argue this selection is a good tradeoff.

Even though it does not totally eliminate cross-domain detection since, as shown in

Section 5.2, fingerprinters can have different scripts, it still enables to evaluate all the

different fingerprinting scripts present in the dataset. Then, we randomly select 20

websites that block crawlers without using fingerprinting to compare fingerprinting-based

detection against other approaches.

Crawling protocol.

For each of the 7crawlers, we run 5crawls on the previously

selected websites. Each crawl is run from a machine with a residential IP address that

has not been used for crawling for at least 2days in order to limit the risk of being

blocked because of a bad IP address reputation. Studying how long crawler detection

systems block crawlers and how/if the previous history of the IP address has an impact

on the duration of the blocking is out of the scope of this study and is left as future work.

A crawl consists of the following steps:

We randomly shuffle the order of the websites in the evaluation dataset. It enables

to minimize and measure the side effects that can occur because of cross-domain

detection;

For each website, the crawler visits the home page and then visits up to 10 randomly-

selected pages from the same domain. As explained later in this section, we crawl only

10 links to ensure we evaluate the effectiveness of browser fingerprinting detection

and not the effectiveness of other state-of-the-art detection approaches;

Once a page is loaded, the crawler takes a screenshot and stores the HTML of the

page for further analysis;

Between two consecutive crawled pages, the crawler waits for 15 seconds plus a random

time between 1and 5seconds.

5.3.1.3 Crawler behaviors

In the previous subsection, we explain how we select websites in a way that minimizes

cross-domain detection. Here, we present how we adapt the behavior of the 7crawlers

5.3 Detecting Crawler Fingerprints 155

so that other approaches, such as rate limiting techniques or behavioral analysis do not

interfere with our evaluation. Thus, the crawlers should not be detected by state-of-the-art

techniques presented in Section 2.4.2 that rely on the following features:

1. Number of HTTP requests,

2. Number of bytes requested from the server,

3. Number and percentage of HTML requests,

4. Percentage of PDF requests,

5. Percentage of image requests,

6. Duration of a session,

7. Percentage of 4xx error requests,

8. Robots.txt file request,

9. Page popularity index,

10. Hidden links.

To address points (1) to (6), crawlers request few pages so that it looks like the requests

originate from a human. Moreover, we do not block any resources, such as images or

PDFs, nor do we ask for these resources in particular. The crawlers visit only up to 10

pages for a given website. Since the attributes used in fingerprinting are constant on

short time periods, such as a crawling session, a fingerprinter does not need multiple

pages to detect if a fingerprint belongs to a crawler, which means that this criterion

should not affect our evaluation. Moreover, the navigation delay between 2 pages is

15 seconds plus a random delay between 1 and 5seconds. We chose a mean value of

15 seconds since it has been observed that a majority of users do not stay more than

15 seconds on a page on average [

136

]. We add some randomness so that if a website

measures the time between two requested pages, it does not look deterministic. Points

(7) and (9) are addressed by only following internal links exposed from the home page or

pages directly linked by the home page, which is more likely to point to both popular

and existing pages. To address point (8), the crawlers never request the

Robots.txt

file, which means that we do not take into account the policy of the website concerning

crawlers. Nevertheless, since we crawl only a few pages, it should have little impact. We

discuss this point in more detail in Section 5.4.3 about ethical considerations.

156

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

5.3.2 Experimental Results

5.3.2.1 Presentation of the dataset

In total, we crawl 40 different websites, randomly selected from the list of websites

blocking crawlers between December 2018 and January 2019. 22 of them use browser

fingerprinting and 18 do not use fingerprinting. Initially, we selected two equal sets of

20 websites using and not using fingerprinting. Nevertheless, we noticed that 2of the

websites had been misclassified. We did not detect fingerprinting on these websites but

we observed the side-effects of cross-domains fingerprinters. Since the crawler used for

fingerprinting detection had been detected on some other websites, its IP address was

blacklisted. Thus, when the crawler visited other websites with the fingerprinter that

blocked it earlier, it was blocked at the first request because of its IP address, without

having the possibility to load and execute the JavaScript present on the page. In total,

we run 35 crawls—i.e.,5per crawler—each with a residential IP address that has not

been used for at least two days for crawling.

5.3.2.2 Blocking results

Figure 5.2 reports on the results of the crawls for the 7crawlers. The results have been

obtained by labeling the data using the same web interface we used in Section 5.1. For

each crawler, we present the average number of times per crawl it is blocked by websites

that use fingerprinting and websites that do not use fingerprinting.

Influence of fingerprint modifications.

We see that the more changes are applied

to the crawler’s fingerprint, the less it gets blocked. While Crawler1 gets blocked 11

times on average, the detection falls to 1

0time for Crawler6 that applies more extensive

modifications to its fingerprint. We also observe an important decrease in the number of

times crawlers are blocked between crawlers 1 and 2. It goes from 11.8for Crawler 1 to

6for Crawler2. The only difference being the removal of the

webdriver

attribute from

the navigator object, which means that fingerprinters heavily rely on this attribute to

detect crawlers.

Blocking speed.

We also analyze the speed at which crawlers get blocked—i.e., after

how many pages crawled on a given website a crawler is blocked. Except for Crawler5

that gets blocked after 3

1pages crawled on average, crawlers are blocked after they have

crawled less than 3 pages of a website on average.

5.3 Detecting Crawler Fingerprints 157

1234567

Type of crawler

Average number of websites

blocking crawler

Websites with fingerprinting

Websites without fingerprinting

Figure 5.2 For each kind of crawler, we report on the average number of times per crawl

it is blocked by websites that use and that do not use fingerprinting.

Fingerprinters detect more crawlers.

We also observe that, on average, websites

using fingerprinting block more crawlers than websites without fingerprinting. For

example, on average, 93

2% (11

0) of websites blocking Crawler1 use fingerprinting. The

only exception is Crawler7, where 75 % of the time it gets blocked, it is by a website

not using fingerprinting. This is the expected result since Crawler 7 is based on a vanilla

Chrome, which means that its fingerprint is not different from the one of a standard

browser.

Analysis of other detection factors.

The fact that Crawler7 still gets blocked despite

the fact it has a normal fingerprint raises the question of other detection factors used

in addition to fingerprinting. Even though we take care to adapt the behavior of the

crawlers to minimize the chance they get detected by other techniques, we cannot exclude

that it occurs. Thus, we verify if crawlers are detected because of their fingerprint or

because of other state-of-the-art detection techniques.

First, we investigate if some of the crawlers have been blocked because of cross-domain

detection. To do so, we manually label, for each fingerprinting script in the evaluation

dataset, the company it belongs to. Whenever we cannot identify the company, we

assign a random identifier. We identify 4fingerprinters present on more than 2websites

in the evaluation dataset and that could use their presence on multiple domains to do

158

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

cross-domain detection. We focus only on websites that blocked Crawlers4, 5 and 6.

Indeed, only one fingerprinting company succeeds to detect Crawlers 4, 5 and 6. Thus,

we argue that Crawlers1, 2 and 3 detected by websites using fingerprinting, are indeed

detected because of their fingerprint. If their detection had relied on other techniques,

then some of the Crawlers4, 5, 6 and 7 would have also been blocked by these websites.

Moreover, the analysis of the fingerprinting scripts we conduct in Section 5.2 shows that

some of these fingerprinters have the information needed to detect Crawlers 1, 2 and 3,

but not to detect more advanced crawlers using fingerprinting.

We analyze in more details the only fingerprinter that detected Crawlers 4, 5 and 6. At

each crawl, the order of the websites is randomized. Thus, for each crawler and each

crawl, we extract the rank of each of the websites that have a fingerprinting script from

this company. Then, we test if the order in which websites from this fingerprinter are

crawled impact the chance of a crawler to be detected. We observe that crawlers get

blocked on websites independently of their rank. For example, Crawler4 is blocked on

the first website crawled where the fingerprinter is present and Crawler 6 on the second

one.

Non-stable blocking behavior.

We also notice that websites that use the fingerprint-

ing scripts provided by the only fingerprinter that blocked crawlers 4, 5 and 6 do not all

behave the same way. Indeed, depending on the website, some of the advanced crawlers

have never been blocked. It can occur for several reasons: 1. The websites have different

versions of the scripts that collect different attributes; 2. On its website, the fingerprinter

proposes different service plans. While some of them are oriented towards blocking

crawlers, others only aim at detecting crawlers to improve the quality of the analytics

data.

Even on the same website, the blocking behavior is not always stable over time. Indeed,

some of the websites do not always block a given crawler. Moreover, some of the websites

able to block advanced crawlers do not block crawlers easier to detect. For example, the

only website that is able to block both crawlers5 and 6, only blocked 13 times over the

35 crawls made by all the crawlers. It means that 37

1% of the time, this website did

not block crawlers, even though it could have done so. In particular, this website never

blocked Crawlers1 and 2 even though they are easier to detect than Crawlers 5 and 6.

Undetected crawlers.

We also observe that some websites could have detected

Crawlers3 and 4 using the information they collected. Indeed, these websites ver-

ify the consistency of the notification permission, which as we show in Section

, enables

5.4 Discussion 159

to detect crawlers based on Chrome headless. A possible explanation to why the fin-

gerprinter present on these websites was blocking Crawlers 1, 2, but not Crawlers 3 and

4 is because the first two crawlers can be detected solely using information contained

in the HTTP headers (lack of

accept-language

header). However, Crawlers3 and 4

require information collected in the browser, which may be handled differently by the

fingerprinter.

In this section, we showed that fingerprinting helps to detect more crawlers than non-

fingerprinting techniques. For example, 93

2% (11 websites) of the websites that have

detected crawler 1 use fingerprinting. Nevertheless, the important decrease in the average

number of times crawlers are blocked between crawlers 1 and 2, from 11.8 websites to 3.6,

indicates that websites rely on simple features such as the presence of the

webdriver

attribute to block crawlers. Finally, we show that only 2

5% of the websites detect crawler

6 that applied heavier modifications to its fingerprint to escape the detection, which

shows one of the main flaws of fingerprinting for crawler detection: its lack of resilience

against adversarial crawlers.

5.4 Discussion

5.4.1 Limits of Browser Fingerprinting

The analysis of the major fingerprinting scripts shows that it is heavily used to detect

older generations of headless browsers or automation frameworks, such as PhantomJS.

These browsers and frameworks used to be easily identifiable because of the attributes

they injected in the

window

document

objects. In addition to these attributes, older

headless browsers lacked basic features that were present by default in mainstream

browsers, making them easily detectable using feature detection. Since 2017, Chrome

headless has proposed a more realistic headless browser that implements most of the

features available in a vanilla Chrome. Even though we show that fingerprinters use

differences between vanilla Chrome and headless Chrome for detection, it is much harder

to find such differences compared to older headless browsers. One of the implications

is that, since there are fewer differences, it makes it easier for an adversarial crawler

developer to escape detection by altering the fingerprint of her crawlers. Indeed, these

changes require few lines of code (less than 300 lines in the case of Crawler6) and can

be done directly using JavaScript without the need to modify and compile a whole

Chromium browser.

160

FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block

Adversarial Crawlers

Unsurprisingly, we also show that fingerprinting-based approaches are totally ineffective

against non-headless automated browsers (like Crawler 7), since the fingerprint of such

crawlers is the same as the one of a vanilla browser (modulo the

navigator.webdriver

attribute which is easy to remove).

5.4.2 Threats to Validity

While the goal of our study is to evaluate the effectiveness of browser fingerprinting

for crawler detection, a threat lies in the possibility that we may have missed external

techniques, other than browser fingerprinting and the techniques presented in Section 2.4.2,

that could have contributed to the detection of crawlers. A second threat lies in the

choice of our oracle—i.e., being blocked by a website when a crawler is detected. While

we ensured that all the websites used in the evaluation block crawlers upon detection, it

may have been caused by some user agent blacklisting. Thus, we make the hypothesis

that, if fingerprinting was also used for crawler detection, then the website would be

consistent in its strategy against the crawlers. However, it is possible that a website

adopts fingerprinting not against crawlers, but against credit card fraudsters or to label

crawlers in its analytics reports, and thus does not focus on blocking crawlers. Finally,

a possible threat lies in our experimental framework. We did extensive testing of our

code, and we manually verified the data from our experiments. However, as for any

experimental infrastructure, there may be bugs. We hope that they only change marginal

quantitative results and not the quality of our findings.

5.4.3 Ethical Considerations

Concerning the crawled websites, we asked for permission from our IRB to conduct

our study, and we used the data collected only for research purposes—i.e., verifying if

we were blocked. Even though we may not have respected the policy of the robots.txt

file, since we did not read it to avoid triggering crawler detection techniques that could

have interfered with our research, we visited only a few pages of each website ranked

in the Alexa Top 10K. Thus, we consider it had a negligible impact on their resources.

Moreover, we decided not to disclose the name of the websites, as well as the name of the

fingerprinters, since we think the benefits to the reader do not outweigh the fact that our

findings could be used specifically against these domains in order to bypass their crawler

detection.

5.5 Conclusion 161

5.5 Conclusion

Crawler detection has become widespread among popular websites to protect their data.

While existing approaches, such as CAPTCHAs or trafic shape analysis, have been shown

to be effective, they either require the user to solve a difficult problem, or they require

enough data to accurately classify the traffic.

In this chapter, we show that, beyond its adoption for tracking, browser fingerprinting

is also used as a crawler detection mechanism. Browser fingerprinting exploits the lack

of basic features in some headless browsers or the inconsistencies introduced by crawler

developers to detect if a fingerprint belongs to a crawler. We analyze the scripts from

the main fingerprinters present in the Alexa Top 10K and show that they exploit the

lack of browser features, errors or overridden native functions to detect crawlers. Then,

using 7crawlers that apply different modifications to their fingerprint, we show that

websites with fingerprinting are better and faster at detecting crawlers compared to

websites that use other state-of-the-art detection techniques. Nevertheless, while 29

of the evaluated websites are able to detect our most naive crawler that applies only one

change to its fingerprint, this rate decreases to 2

5% for the most advanced crawler that

applies more extensive modifications to its fingerprint. We also show that fingerprinting

does not help detecting crawlers based on standard browsers since they do not expose

inconsistent fingerprints.

These findings demonstrate the strengths and the limits of fingerprinting for crawler

detection. While fingerprinting can help to quickly detect crawlers based on headless

or non-standard browsers, it remains unable to detect standard automated browsers.

Moreover, we also showed it requires few efforts to develop crawlers with modified

fingerprints capable to escape detection. Thus, we argue that fingerprinting provides

clear benefits for crawler detection but it should be used in a layered approach, in addition

to other crawler detection techniques, such as crawler traps or rate limiting techniques.

Part IV

Final Remarks

Chapter 6

Conclusion

Browsers have become increasingly more complex. Applications that were once possible

only as desktop clients can now be accessed from a browser. Moreover, the diversity of

devices capable of browsing the web has also steadily increased. While browsers used

to run only on computers, they can now run on a wide range of devices ranging from

smartphones to connected TVs. This diversity of devices associated with the possibility

for browsers to obtain information about it using JavaScript APIs made it possible to

identify devices based on their characteristics.

Building a browser fingerprint consists in gathering a set of attributes that can be accessed

by any website when a user visits it. Thus, fixing browser fingerprinting tracking is

challenging since contrary to cookies that can be simply blocked or deleted, fingerprints

rely on APIs that are also used genuinely by websites. Browser fingerprinting is also a

constantly evolving field. As new APIs are added to the browser, there is a risk they get

used by commercial fingerprinters to create more robust, unique and stable fingerprints.

On the other hand, when browser vendors take actions, such as decreasing the precision

of an API or blocking its access, this can render certain forms of fingerprinting unusable.

Because of its stateless nature, detecting fingerprinting is also challenging. While

detecting stateful tracking mechanisms that rely on cookies or other storage APIs is

more straightforward since the tracking identifier is stored in the browser, detecting if a

website fingerprints users is more complicated:

•

First, there is no formal definition of browser fingerprinting. Does simple analytics

become fingerprinting after more than 5, 6, 7 or more attributes have been accessed?

While some techniques, such as font enumeration or canvas fingerprinting, are clear

166 Conclusion

markers of fingerprinting, there is a blurry line where it is unclear if a script collects

data such as the user agent, the screen resolution and the number of cores of the

device for analytical purposes, or if it is for fingerprinting;

•

Secondly, while monitoring the use of known fingerprinting techniques at scale has

become easier with modern crawling tools, detecting new and unknown techniques

is more challenging as any API or any data obtained from a sensor can be used to

create a fingerprint.

Fortunately, browser vendors have started to take privacy issues more into account when

creating new features. Saying that it has been only caused thanks to the research on

web privacy may be overemphasized. It is also likely the results of other factors, such as

the increasing number of data breaches

and important privacy scandals like Facebook

and Cambridge analytica. These scandals have made of privacy a strong commercial

argument for browser vendors and tech companies in general

. Nevertheless, studies

on privacy and browser fingerprint also played a role in the way browsers have evolved.

For example, two days after Acar et al. [

] published that a popular social widget was

using canvas fingerprinting for tracking, the company behind it decided to stop using

canvas fingerprinting.

When browser vendors design new APIs, their privacy impact,

in particular, the way they could be used for fingerprinting, is also more taken into

account. Even though browser vendors take more into account the privacy impact of

new APIs, this does not mean new fingerprintable APIs are not added anymore to

browsers. For example, in 2017, Chrome and other Chromium-based browser added

a new property,

deviceMemory

, to the

navigator

object that provides the memory of

the device. Nevertheless, evaluating how a new API could be used for fingerprinting

is challenging. Indeed, even APIs that do not explicitly return information about the

device can be misused to extract information because of side-effects, as it is the case with

the canvas, the audio, and the WebGL APIs. Even increasing the accuracy of timing

measurement or adding thread support can be used for fingerprinting:

Timers can be used to conduct timing attacks that extract information about the

device.,

Threads can be used to use run costly operations, such as 3D scenes rendering,

without blocking the main thread and, therefore, extract more unique fingerprints.

1https://breachlevelindex.com/

https://www.theverge.com/2019/3/14/18266276/apple-iphone-ad-privacy-

facetime-bug

https://www.addthis.com/blog/2014/07/23/the-facts-about-our-use-of-a-canvas-

element-in-our-recent-rd-test/#.XLcfbpPRZQI

6.1 Contributions 167

The state of browser fingerprinting has evolved these past years, from techniques heavily

relying on plugins such as Flash and Java to techniques that leverage HTML5 APIs and

exploit their side effects, such as canvas and audio fingerprinting. In this thesis, I tried

to provide an up-to-date picture of browser fingerprinting, both in terms of its impact on

privacy, as well as its application to detect crawlers.

6.1 Contributions

6.1.1 FP-Stalker: Tracking Browser Fingerprint Evolutions

I collected a dataset of more than 120K browser fingerprints from 2,346 distinct browsers

over two years using the AmIUnique browser extensions. First, I analyzed the stability

of browser fingerprints and the different attributes constituting it. My results confirm

Eckersley’s findings, the majority of fingerprints change frequently. More than half of

the browsers in our dataset displayed at least one change in their fingerprint after five

days. Nevertheless, I also showed that not all browser fingerprints change at the same

pace. Through a two-year study, I also measured the stability of attributes that did not

exist at the time Eckersley’s study was conducted. In particular, I showed that besides

being highly unique, canvas fingerprints are also highly stable. For half of the browsers

in our dataset, the canvas remained stable for more than 300 days.

I also evaluated how long could browsers be tracked using only their fingerprint. I

proposed two linking algorithms and showed that while a significant fraction of browsers

are immune to fingerprinting, either because their fingerprint is not unique or is too

similar with fingerprints of other browsers, more than 32% of the browsers can be tracked

more than 100 days.

6.1.2 FP-Scanner: The Privacy Implications of Browser Fin-

gerprint Inconsistencies

Several fingerprinting countermeasures claim to generate consistent browser fingerprints,

i.e., fingerprints that could be found in the wild. The reason countermeasures aim at

generating consistent fingerprints is to avoid being detected, which would make their

users more unique and trackable. In Chapter 4, I proposed Fp-Scanner, a test suite

that aims at detecting the presence of fingerprinting countermeasures because of the

168 Conclusion

inconsistencies they introduce. I used it to evaluate 7 countermeasures, ranging from

browser extensions that lie about the user device, to more complex peer-reviewed forked

browsers that modify the value of canvas and audio fingerprints. Fp-Scanner was able to

spot inconsistencies for all the countermeasures, even those claiming to generate consistent

fingerprints, making, therefore, their presence detectable. Moreover, since different APIs

or techniques can be used to obtain certain attributes, I showed we can correlate them

to infer the real OS and browser, even though it has been modified by a countermeasure.

Thus, my findings show the difficulty of designing effective countermeasures that do not

end up being counterproductive. I argue that while it may not be possible to design

undetectable countermeasures, countermeasures’ defense strategies should avoid leaking

unnecessary information. In particular, the defense strategy should not be tuned since

then, the defense strategy could be used as a fingerprinting feature. Moreover, I argue

that for countermeasures to be effective, they should be used by enough users, so that

knowing that a user has a countermeasure installed does is not discriminatory in itself.

Thus, a possible way to make a countermeasure available to many users is to integrate

it natively into a browser, the software used to access websites. Since my paper on

fingerprinting countermeasures has been published in 2018, several browser vendors, such

as Brave, Firefox and Safari, have either added fingerprinting countermeasures natively

in their browser or made them more easily available to their users.

6.1.3 FP-Crawlers: Evaluating the Resilience of Browser Fin-

gerprinting to Block Adversarial Crawlers

The use of browser fingerprinting in a security context has often been overlooked. I focus

on crawler detection and show that among the 3% of websites of the Alexa Top10K that

block crawlers, around 30% of them use browser fingerprinting. I analyze the detection

techniques of the most popular scripts encountered during my crawls and show that

while some of the techniques, such as canvas or font enumeration, are similar to the

ones used for tracking, these scripts rely on techniques specifically developed for crawler

detection. In particular, crawler detection scripts look for traces left by instrumentation

frameworks such as Selenium, as well as for inconsistent fingerprints that could reflect

the use of a headless browser. To evaluate the effectiveness of fingerprinting for crawler

detection, I developed multiple crawlers, each gradually more difficult to detect and show

that browser fingerprinting can quickly detect crawlers.

6.2 Future work 169

An important challenge of using browser fingerprinting for security lies in the fact that

fingerprints are collected on the client-side and that they can, therefore, be modified by an

attacker. Thus, I evaluate the resilience of fingerprinting against an adversarial attacker

that tries to hide its presence by lying about its fingerprint. My results show that while

fingerprinting can detect crawlers with few modifications applied to their fingerprints, it

cannot detect crawlers that apply more changes to their fingerprint or crawlers based on

non-headless browsers, hence the need of using fingerprinting in addition to other crawler

detection approaches.

6.2 Future work

6.2.1 Automating Crawler Detection Rules Learning

Detecting crawlers and bots can benefit several applications ranging from website content

protection to ad-fraud detection. In particular, improving fingerprinting-based detection

could help to protect against certain forms of attacks that cannot be fully addressed

using only other existing detection techniques:

•Traffic-shape approaches.

In a context of ad-fraud, one cannot wait for bots to

visualize many ads from the same IP address before they get blacklisted. Indeed,

bots may use residential proxies to frequently change their IP addresses [135];

•CAPTCHAs.

In a context of ad-fraud, it is also unrealistic to require users to

solve CAPTCHAs for them to see ads.

While fingerprinting can help to detect bots, there is a continuous arms race between

bot developers trying to make their bots look more human and the fingerprinters that

try to detect them. I argue that automating the learning of detection rules could help

fingerprinters win the arms race. Indeed, whenever new features are added to a browser,

there may a short time window during which the behavior differs between the normal

browser and its headless counterpart. During this time window, bot developers may not

be aware of these new detection rules and are less likely to have modified the fingerprints

of their bots to lie consistently. Thus, I propose to extend the approach proposed by

Schwarz et al. [

] to automatically learn rules capable of distinguishing non-headless

browsers from headless browsers. In particular, I plan to address one of the main

challenges they identified: exploring properties hidden behind function calls. To call

JavaScript functions with correct parameters, I propose the following two strategies:

170 Conclusion

1. Use fuzzing techniques to generate valid parameters [137],

Crawl specialized programming websites, such as GitHub and StackOverflow, to

gather code snippets that use specific APIs, or crawl any websites to monitor the

execution of several functions and obtain the value of their arguments at runtime.

6.2.2 Investigate New Fingerprinting Attributes

Due to the difficulty of collecting unbiased fingerprints over a long period, it is challenging

to accurately measure the entropy and the stability of browser fingerprints. The latest

large scale study was conducted in 2016 [

] on more than 2 million browser fingerprints

collected on a French popular website, and focuses only on fingerprint uniqueness. As

I explained in the state-of-the-art, the study did not examine several fingerprinting

techniques, such as audio fingerprinting and

navigator.enumerateDevices

, that were

available at the time the study was conducted. More recently, new APIs that can be used

for fingerprinting, such as

navigator.deviceMemory

, have also been added to browsers.

Moreover, approaches that did not apply to real-world traffic because of performance

issues may have become applicable. For example, Cao et al. [

] proposed to generate

complex 3D scenes to create stable fingerprints, even across different browsers of the

same device. While a few years ago, the rendering of the 3D scenes was blocking the

main JavaScript execution threads for more than 10 seconds, now it could leverage the

offscreenCanvas

API

to generate the scenes in web workers.

Therefore, evaluating

the entropy and the stability of these different fingerprinting techniques on real-world

traffic would provide valuable information on the current state of browser fingerprinting.

6.2.3 Studying Fingerprinting for Authentication

With an ever-increasing number of data breaches

, there is a need for additional mecha-

nisms to protect against credential stuffing.

Different studies proposed to use browser

fingerprinting as a semi-transparent second authentication factor. While the guarantees

it provides are not as strong as the one provided by traditional second factors such as

OffscreenCanvas API: https://developer.mozilla.org/en-US/docs/Web/API/

OffscreenCanvas

Web Workers API: https://developer.mozilla.org/en-US/docs/Web/API/Web_

Workers_API

6https://breachlevelindex.com/

7https://www.owasp.org/index.php/Credential_stuffing

6.2 Future work 171

U2F,

it can still be used to enhance security in a relatively transparent manner for the

user. Moreover, fingerprinting is also convenient for companies since it does not require

to install authentication applications on the employees’ mobiles, or to buy external

devices, such as Yubikeys that can be costly at the scale of a company. Nevertheless,

studies on browser fingerprinting for authentication are limited. While they propose

several approaches ranging from the use of the accelerometer sensor to the generation

of dynamic canvas, none of them has been evaluated against real-world authentication

traffic. Besides providing a more accurate estimation of the security gain provided by

browser fingerprinting, evaluating it in a more realistic context would also enable to

better understand its usability for end-users, which as shown by several studies [

138

often impacts the security.

6.2.4 Developing Web Red Pills

Crawler detection approaches can act at different levels:

1. Behavioral.

Approaches such as traffic shape analysis and CAPTCHAs analyze a

user behavior to determine if it is a human or a bot;

2. Execution environment.

Detection using device and browser fingerprinting

targets the browser or the system the bot is running on to distinguish between

humans and bots.

I propose to investigate red pills, sequences of instructions that aim at detecting if a

browser is running in a virtualized environment, for bot detection. Indeed, since crawling

at scale requires a large infrastructure, a significant fraction of crawlers likely runs on

virtual machines from public cloud providers. Ho et al. [

] proposed several red pills

capable of detecting if the host system is a virtual machine from within the browser.

Nevertheless, the paper has been published in 2014 and has not been evaluated on popular

public cloud providers. Moreover, the underlying implementations of some of the APIs

used in the red pills may have evolved, which can impact the accuracy of the red pills.

Thus, I argue there is a need for evaluating these red pills on the main public cloud

providers and developing new red pills techniques.

8https://www.yubico.com/solutions/fido-u2f/

172 Conclusion

6.3 Future of Browser Fingerprinting

Recent privacy scandals have made of privacy a strong commercial argument. Firefox

and Safari, two popular browsers now natively integrate fingerprinting protections. New

privacy-friendly browsers, such as Brave and Cliqz, have also emerged and propose native

fingerprinting countermeasures. As I explained in Chapter 4, having enough users is one

of the conditions required to have more effective fingerprinting countermeasures that do

not make their users more vulnerable to tracking. Thus, this shift from fingerprinting

countermeasures running as browser extensions, to countermeasures natively integrated

into the browser is a positive change from a privacy point of view. Moreover, new

mechanisms, such as the

Feature-Policy

header,

also help websites control the APIs

that can be accessed in different frames. However, these features do not help to protect

against websites that willingly allow third-party to track their users. Browser vendors

also tend to more take into account fingerprinting when designing new APIs. Thus,

before releasing new APIs, browser vendors thoroughly evaluate how it could be used for

fingerprinting. As showed by Gomez et al. , these changes helped to decrease the entropy

of fingerprinting techniques used in the past.

Nevertheless, fingerprintable APIs are still added to browsers, e.g. the

navigator.

deviceMemory

that provides information about the device memory was added to Chrome

in December 2017.

Moreover, even benign APIs, such as canvas and WebGL, can be

misused to create highly unique and stable fingerprints. I believe this is likely to occur

again as browser vendors keep on adding new features to their browsers.

From static to dynamic fingerprints.

For a long time, fingerprinters have built

fingerprints constituted of static attributes such as the user agent, the list of plugins or

canvas whose value is the same between consecutive executions. With recent progress in

machine learning, I argue there is a risk fingerprinters could shift towards more dynamic

and less stable attributes, such as the approach proposed by Sanchez et al. [

] that

measures the performance to execute different cryptographic operations. Due to their

instability, these attributes cannot simply be hashed and transformed into a fingerprint.

Nevertheless, one can extract features that are later integrated into machine learning

models used to link fingerprints of the same browser. This shift also raises challenges

concerning the measure of fingerprint uniqueness. Indeed, while entropy and anonymity

9https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Feature-Policy

deviceMemory: https://developer.mozilla.org/en-US/docs/Web/API/Navigator/

deviceMemory

6.3 Future of Browser Fingerprinting 173

set are meaningful metrics for stable attributes, it is not necessarily the case for attributes

whose value constantly changes.

Fingerprinting for security.

Current headless browsers used for crawling will probably

become more and more similar to their non-headless counterparts. For example, until

September 2018, there was no mechanism to manage permissions in Chrome headless

using the Chrome DevTools protocol. It was due to the novelty of Chrome headless rather

than to a deliberate choice of the developers [

139

]. Thus, these kinds of inconsistencies

will likely be fixed in the future. Nevertheless, I argue that even if headless browsers

become more realistic, there is some space for fingerprinting to be used in addition to

other crawler detection techniques. Indeed, since browser vendors keep on adding new

features, during a certain lapse of time, these new features may not be implemented in

the headless browser or they may be implemented in a different way, which could be

used for detection.

References

[1]

Yinzhi Cao, Song Li, Erik Wijmans, et al. (cross-) browser fingerprinting via os

and hardware level features. In NDSS, 2017.

[2]

Jonathan R Mayer. Any person... a pamphleteer”: Internet anonymity in the age

of web 2.0. Undergraduate Senior Thesis, Princeton University, page 85, 2009.

[3]

Peter Eckersley. How unique is your web browser? In International Symposium on

Privacy Enhancing Technologies Symposium, pages 1–18. Springer, 2010.

[4]

Nick Nikiforakis, Alexandros Kapravelos, Wouter Joosen, Christopher Kruegel,

Frank Piessens, and Giovanni Vigna. Cookieless monster: Exploring the ecosystem

of web-based device fingerprinting. In Security and privacy (SP), 2013 IEEE

symposium on, pages 541–555. IEEE, 2013.

[5]

Gunes Acar, Christian Eubank, Steven Englehardt, Marc Juarez, Arvind Narayanan,

and Claudia Diaz. The web never forgets: Persistent tracking mechanisms in the

wild. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and

Communications Security, pages 674–689. ACM, 2014.

[6]

Steven Englehardt and Arvind Narayanan. Online tracking: A 1-million-site

measurement and analysis. In Proceedings of the 2016 ACM SIGSAC Conference

on Computer and Communications Security, pages 1388–1401. ACM, 2016.

[7]

Nick Nikiforakis, Wouter Joosen, and Benjamin Livshits. Privaricator: Deceiv-

ing fingerprinters with little white lies. In Proceedings of the 24th International

Conference on World Wide Web, pages 820–830. International World Wide Web

Conferences Steering Committee, 2015.

[8]

Peter Baumann, Stefan Katzenbeisser, Martin Stopczynski, and Erik Tews. Dis-

guised chromium browser: Robust browser, flash and canvas fingerprinting protec-

tion. In Proceedings of the 2016 ACM on Workshop on Privacy in the Electronic

Society, pages 37–46. ACM, 2016.

[9]

Pierre Laperdrix, Benoit Baudry, and Vikas Mishra. Fprandom: Randomizing core

browser objects to break advanced device fingerprinting techniques. In International

Symposium on Engineering Secure Software and Systems, pages 97–114. Springer,

2017.

[10]

Gunes Acar, Marc Juarez, Nick Nikiforakis, Claudia Diaz, Seda Gürses, Frank

Piessens, and Bart Preneel. Fpdetective: dusting the web for fingerprinters. In

176 References

Proceedings of the 2013 ACM SIGSAC conference on Computer & communications

security, pages 1129–1140. ACM, 2013.

[11]

Thomas Unger, Martin Mulazzani, Dominik Fruhwirt, Markus Huber, Sebastian

Schrittwieser, and Edgar Weippl. Shpf: Enhancing http (s) session security with

browser fingerprinting. In Availability, Reliability and Security (ARES), 2013

Eighth International Conference on, pages 255–261. IEEE, 2013.

[12]

Furkan Alaca and Paul C van Oorschot. Device fingerprinting for augmenting web

authentication: classification and analysis of methods. In Proceedings of the 32nd

Annual Conference on Computer Security Applications, pages 289–301. ACM, 2016.

[13]

Tom Van Goethem, Wout Scheepers, Davy Preuveneers, and Wouter Joosen.

Accelerometer-based device fingerprinting for multi-factor mobile authentication.

In International Symposium on Engineering Secure Software and Systems, pages

106–121. Springer, 2016.

[14]

Davy Preuveneers and Wouter Joosen. Smartauth: dynamic context fingerprinting

for continuous user authentication. In Proceedings of the 30th Annual ACM

Symposium on Applied Computing, pages 2185–2191. ACM, 2015.

[15]

Elie Bursztein, Artem Malyshev, Tadek Pietraszek, and Kurt Thomas. Picasso:

Lightweight device class fingerprinting for web clients. In Proceedings of the 6th

Workshop on Security and Privacy in Smartphones and Mobile Devices, pages

93–102. ACM, 2016.

[16]

Antoine Vastel, Pierre Laperdrix, Walter Rudametkin, and Romain Rouvoy. Fp-

stalker: Tracking browser fingerprint evolutions. In IEEE S&P 2018-39th IEEE

Symposium on Security and Privacy, pages 1–14. IEEE, 2018.

[17]

Antoine Vastel, Pierre Laperdrix, Walter Rudametkin, and Romain Rouvoy. Fp-

scanner: The privacy implications of browser fingerprint inconsistencies. In Pro-

ceedings of the 27th USENIX Security Symposium, 2018.

[18]

Antoine Vastel, Walter Rudametkin, and Romain Rouvoy. Fp-tester: Automated

testing of browser fingerprint resilience. In 2018 IEEE European Symposium on

Security and Privacy Workshops (EuroS&PW), pages 103–107. IEEE, 2018.

[19]

Antoine Vastel, Peter Snyder, and Benjamin Livshits. Who filters the filters:

Understanding the growth, usefulness and efficiency of crowdsourced ad blocking.

arXiv preprint arXiv:1810.09160, 2018.

[20]

Antoine Vastel. Repository of the code used in fp-stalker., 2017. URL https:

//github.com/Spirals-Team/FPStalker.

[21]

Antoine Vastel. Repository of the code used in fp-scanner., 2018. URL https:

//github.com/Spirals-Team/FP-Scanner.

[22]

Antoine Vastel. Repository of the code used in fp-crawler., 2018. URL https:

//github.com/antoinevastel/exp_fp_bot.

[23]

Antoine Vastel. Open source implementation of a picasso-like canvas fingerprint-

ing algorithm., 2019. URL https://github.com/antoinevastel/picasso-like-canvas-

fingerprinting.

References 177

[24]

Antoine Vastel. Fp-collect: Fingerprinting script of the fp-scanner library., 2018.

URL https://github.com/antoinevastel/fp-collect.

[25] Antoine Vastel. Fp-scanner: a browser fingerprinting-based bot detection library.,

2018. URL https://github.com/antoinevastel/fpscanner.

[26]

Jean-Samuel Beuscart and Kevin Mellet. Business models of the web 2.0: Ad-

vertising or the tale of two stories. Communications & Strategies, Special Issue,

2008.

[27]

Pierre Laperdrix, Walter Rudametkin, and Benoit Baudry. Beauty and the beast:

Diverting modern web browsers to build unique browser fingerprints. In Security

and Privacy (SP), 2016 IEEE Symposium on, pages 878–894. IEEE, 2016.

[28]

Alejandro Gómez-Boix, Pierre Laperdrix, and Benoit Baudry. Hiding in the crowd:

an analysis of the effectiveness of browser fingerprinting at large scale. In WWW

2018: The 2018 Web Conference, 2018.

[29] Michal Zalewski. p0f v3, 2019. URL http://lcamtuf.coredump.cx/p0f3/.

[30]

Andreas Kurtz, Hugo Gascon, Tobias Becker, Konrad Rieck, and Felix Freiling.

Fingerprinting mobile devices using personalized configurations. Proceedings on

Privacy Enhancing Technologies, 2016(1):4–19, 2016.

[31]

Wenjia Wu, Jianan Wu, Yanhao Wang, Zhen Ling, and Ming Yang. Efficient

fingerprinting-based android device identification with zero-permission identifiers.

IEEE Access, 4:8073–8083, 2016.

[32]

Internet Engineering Task Force. Rfc 7231: Hypertext transfer protocol (http/1.1):

Semantics and content, 2014. URL https://tools.ietf.org/html/rfc7231.

[33]

Internet Engineering Task Force. Rfc 7231: semantics and content of the user-agent

header, 2014. URL https://tools.ietf.org/html/rfc7231#section-5.5.3.

[34]

Internet Engineering Task Force. Rfc 7231: semantics and content of the accept-

language header, 2014. URL https://tools.ietf.org/html/rfc7231#section-5.3.5.

[35]

Ting-Fang Yen, Yinglian Xie, Fang Yu, Roger Peng Yu, and Martin Abadi. Host

fingerprinting and tracking on the web: Privacy and security implications. In

NDSS, volume 62, page 66, 2012.

[36]

Lukasz Olejnik, Steven Englehardt, and Arvind Narayanan. Battery status not

included: Assessing privacy in web standards. In 3rd International Workshop on

Privacy Engineering (IWPE’17), 2017.

[37]

Martin Mulazzani, Philipp Reschl, Markus Huber, Manuel Leithner, Sebastian

Schrittwieser, Edgar Weippl, and FC Wien. Fast and reliable browser identification

with javascript engine fingerprinting. In Web 2.0 Workshop on Security and Privacy

(W2SP), volume 5. Citeseer, 2013.

[38]

Keaton Mowery and Hovav Shacham. Pixel perfect: Fingerprinting canvas in html5.

Proceedings of W2SP, pages 1–12, 2012.

178 References

[39]

David Fifield and Serge Egelman. Fingerprinting web users through font metrics.

In International Conference on Financial Cryptography and Data Security, pages

107–124. Springer, 2015.

[40]

Keaton Mowery, Dillon Bogenreif, Scott Yilek, and Hovav Shacham. Fingerprinting

information in JavaScript implementations. In Helen Wang, editor, Proceedings of

W2SP 2011. IEEE Computer Society, May 2011.

[41]

Iskander Sanchez-Rola, Igor Santos, and Davide Balzarotti. Clock around the

clock: Time-based device fingerprinting. In Proceedings of the 2018 ACM SIGSAC

Conference on Computer and Communications Security, pages 1502–1514. ACM,

2018.

[42]

Oleksii Starov and Nick Nikiforakis. Xhound: Quantifying the fingerprintability

of browser extensions. In 2017 IEEE Symposium on Security and Privacy (SP),

pages 941–956. IEEE, 2017.

[43]

Alexander Sjösten, Steven Van Acker, and Andrei Sabelfeld. Discovering browser

extensions via web accessible resources. In Proceedings of the Seventh ACM on

Conference on Data and Application Security and Privacy, pages 329–336. ACM,

2017.

[44]

Alexander Sjösten, Steven Van Acker, Pablo Picazo-Sanchez, and Andrei Sabelfeld.

Latex gloves: Protecting browser extensions from probing and revelation attacks.

Power, page 57, 2018.

[45]

Amin FaizKhademi, Mohammad Zulkernine, and Komminist Weldemariam. Fp-

guard: Detection and prevention of browser fingerprinting. In IFIP Annual Con-

ference on Data and Applications Security and Privacy, pages 293–308. Springer,

2015.

[46]

Enric Pujol, Oliver Hohlfeld, and Anja Feldmann. Annoyed users: Ads and ad-block

usage in the wild. In Proceedings of the 2015 Internet Measurement Conference,

pages 93–106. ACM, 2015.

[47]

Firefox. Firefox public data report, 2019. URL https://data.firefox.com/dashboard/

usage-behavior.

[48] Eyeo GmbH. Adblock plus, 2018. URL https://adblockplus.org/.

[49]

Raymond Hill. ublock origin - an efficient blocker for chromium and firefox. fast

and lean., 2018. URL https://github.com/gorhill/uBlock.

[50] EasyList. About easylist, 2018. URL https://easylist.to/pages/about.html.

[51]

EasyPrivacy. Easyprivacy, 2018. URL https://easylist.to/easylist/easyprivacy.txt.

[52] AdBlock. Adblock, 2018. URL https://getadblock.com/.

[53] Brave Software Inc. Brave browser, 2018. URL https://brave.com/.

[54] Cliqz International GmbH. Ghostery, 2018. URL https://www.ghostery.com.

[55]

Electronic Frontier Foundation. Privacy badger, 2019. URL https://www.eff.org/

fr/node/99095.

References 179

[56]

Umar Iqbal, Zubair Shafiq, Peter Snyder, Shitong Zhu, Zhiyun Qian, and Benjamin

Livshits. Adgraph: A machine learning approach to automatic and effective

adblocking. arXiv preprint arXiv:1805.09155, 2018.

[57]

Georg Merzdovnik, Markus Huber, Damjan Buhov, Nick Nikiforakis, Sebastian

Neuner, Martin Schmiedecker, and Edgar Weippl. Block Me if You Can: A

Large-Scale Study of Tracker-Blocking Tools. Proceedings - 2nd IEEE European

Symposium on Security and Privacy, EuroS and P 2017, pages 319–333, 2017. doi:

10.1109/EuroSP.2017.26.

[58] Giorgio Maone. Noscript firefox extension, 2018. URL https://noscript.net/.

[59]

Raymond Hill. umatrix: Point and click matrix to filter net requests according to

source, destination and type, 2019. URL https://github.com/gorhill/uMatrix.

[60]

The Tor Project. Tor browser, 2018. URL https://www.torproject.org/projects/

torbrowser.html.en.

[61]

Zhonghao Yu, Sam Macbeth, Konark Modi, and Josep M Pujol. Tracking the

trackers. In Proceedings of the 25th International Conference on World Wide Web,

pages 121–132. International World Wide Web Conferences Steering Committee,

2016.

[62]

Smart Software. Ultimate user agent switcher, url sniffer, 2019. URL http:

//iblogbox.com/chrome/useragent/alert.php.

[63]

dillbyrne. Random agent spoofer, 2019. URL https://github.com/dillbyrne/random-

agent-spoofer.

[64]

sereneblue. Chameleon, a webextension port of random agent spoofer, 2019. URL

https://github.com/sereneblue/chameleon.

[65]

Valentin Vasilyev. Modern and flexible browser fingerprinting library, 2019. URL

https://github.com/Valve/fingerprintjs2.

[66]

Christof Ferreira Torres, Hugo Jonker, and Sjouke Mauw. Fp-block: usable web

privacy by controlling browser fingerprinting. In European Symposium on Research

in Computer Security, pages 3–19. Springer, 2015.

[67]

ECMA international. Ecmascript

2016 language specification, 2016. URL http:

//www.ecma-international.org/ecma-262/7.0/index.html.

[68]

Multilogin. Canvas Defender browser extension (canvas fingerprint blocker). URL

https://multiloginapp.com/canvasdefender-browser-extension/.

[69]

Pierre Laperdrix, Walter Rudametkin, and Benoit Baudry. Mitigating browser

fingerprint tracking: multi-level reconfiguration and diversification. In Proceedings

of the 10th International Symposium on Software Engineering for Adaptive and

Self-Managing Systems, pages 98–108. IEEE Press, 2015.

[70]

Grant Ho, Dan Boneh, Lucas Ballard, and Niels Provos. Tick tock: Building

browser red pills from timing side channels. In WOOT, 2014.

[71]

kkapsner. CanvasBlocker: A Firefox Plugin to block the <canvas>-API, July 2017.

URL https://github.com/kkapsner/CanvasBlocker.

180 References

[72]

Michael Schwarz, Florian Lackner, and Daniel Gruss. Javascript template attacks:

Automatically inferring host information for targeted exploits. In NDSS, 2019.

[73]

OWASP. Cross-site scripting (xss), 2018. URL https://www.owasp.org/index.php/

Cross-site_Scripting_(XSS).

[74]

Pierre Laperdrix. Browser Fingerprinting : Exploring Device Diversity to Augment

Authentification and Build Client-Side Countermeasures. Theses, INSA de Rennes,

October 2017. URL https://tel.archives-ouvertes.fr/tel-01729126.

[75]

Luis Von Ahn and Laura Dabbish. Labeling images with a computer game. In

Proceedings of the SIGCHI conference on Human factors in computing systems,

pages 319–326. ACM, 2004.

[76]

Gregoire Jacob and Christopher Kruegel. PUB CRAWL : Protecting Users and

Businesses from CRAWLers. Protecting Users and Businesses from CRAWLers

Gregoire, 2009.

[77]

Derek Doran and Swapna S Gokhale. Web robot detection techniques: overview

and limitations. Data Mining and Knowledge Discovery, 22(1-2):183–210, 2011.

[78]

Weigang Guo, Shiguang Ju, and Yi Gu. Web robot detection techniques based on

statistics of their requested url resources. In Computer Supported Cooperative Work

in Design, 2005. Proceedings of the Ninth International Conference on, volume 1,

pages 302–306. IEEE, 2005.

[79]

Xiaozhu Lin, Lin Quan, and Haiyan Wu. An automatic scheme to categorize user

sessions in modern http traffic. In Global Telecommunications Conference, 2008.

IEEE GLOBECOM 2008. IEEE, pages 1–6. IEEE, 2008.

[80]

Anália Lourenço and Orlando Belo. Applying clickstream data mining to real-time

web crawler detection and containment using clicktips platform. In Advances in

Data Analysis, pages 351–358. Springer, 2007.

[81]

Athena Stassopoulou and Marios D Dikaiakos. Crawler detection: A bayesian

approach. In Internet Surveillance and Protection, 2006. ICISP’06. International

Conference on, pages 16–16. IEEE, 2006.

[82]

Pang-Ning Tan and Vipin Kumar. Discovery of web robot sessions based on their

navigational patterns. In Intelligent Technologies for Information Analysis, pages

193–222. Springer, 2004.

[83]

Stephen Beveridge and Charles R Nelson. A new approach to decomposition of

economic time series into permanent and transitory components with particular

attention to measurement of the ‘business cycle’. Journal of Monetary economics,

7(2):151–174, 1981.

[84]

Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John Langford. Captcha:

Using hard ai problems for security. In International Conference on the Theory

and Applications of Cryptographic Techniques, pages 294–311. Springer, 2003.

[85]

2Captcha. Online captcha solving and image recognition service., 2018. URL

https://2captcha.com/.

References 181

[86]

Anti CAPTCHA. Anti captcha: captcha solving service. bypass recaptcha, fun-

captcha, image captcha., 2018. URL https://anti-captcha.com/mainpage.

[87]

Suphannee Sivakorn, Jason Polakis, and Angelos D Keromytis. I’m not a human:

Breaking the google recaptcha.

[88]

Kevin Bock, Daven Patel, George Hughey, and Dave Levin. uncaptcha: a low-

resource defeat of recaptcha’s audio challenge. In Proceedings of the 11th USENIX

Conference on Offensive Technologies, pages 7–7. USENIX Association, 2017.

[89]

Zi Chu, Steven Gianvecchio, Aaron Koehl, Haining Wang, and Sushil Jajodia. Blog

or block: Detecting blog bots through behavioral biometrics. Computer Networks,

57(3):634–646, 2013.

[90]

Ah Reum Kang, Jiyoung Woo, Juyong Park, and Huy Kang Kim. Online game

bot detection based on party-play log analysis. Computers & Mathematics with

Applications, 65(9):1384–1395, 2013.

[91]

Binh Nguyen, Bryan D Wolf, and Brian Underdahl. Detecting and preventing bots

and cheating in online gaming, January 29 2013. US Patent 8,360,838.

[92]

Zi Chu, Steven Gianvecchio, Haining Wang, and Sushil Jajodia. Detecting automa-

tion of twitter accounts: Are you a human, bot, or cyborg? IEEE Transactions on

Dependable and Secure Computing, 9(6):811–824, 2012.

[93]

Gianluca Stringhini, Christopher Kruegel, and Giovanni Vigna. Detecting spammers

on social networks. In Proceedings of the 26th annual computer security applications

conference, pages 1–9. ACM, 2010.

[94]

Gang Wang, Tristan Konolige, Christo Wilson, Xiao Wang, Haitao Zheng, and

Ben Y Zhao. You are how you click: Clickstream analysis for sybil detection. In

USENIX Security Symposium, volume 9, pages 1–008, 2013.

[95]

Jing Zhang, Ari Chivukula, Michael Bailey, Manish Karir, and Mingyan Liu. Char-

acterization of blacklists and tainted network traffic. In International Conference

on Passive and Active Network Measurement, pages 218–228. Springer, 2013.

[96] Augur Software, . URL https://www.augur.io/.

[97]

Athena Stassopoulou and Marios D Dikaiakos. Web robot detection: A probabilistic

reasoning approach. Computer Networks, 53(3):265–278, 2009.

[98]

Dusan Stevanovic, Aijun An, and Natalija Vlajic. Feature evaluation for web

crawler detection with data mining techniques. Expert Systems with Applications,

39(10):8707–8717, 2012.

[99]

Andoena Balla, Athena Stassopoulou, and Marios D Dikaiakos. Real-time web

crawler detection. In Telecommunications (ICT), 2011 18th International Confer-

ence on, pages 428–432. IEEE, 2011.

[100] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[101]

Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts. Understand-

ing variable importances in forests of randomized trees. In Advances in neural

information processing systems, pages 431–439, 2013.

182 References

[102]

Octavio Loyola-González, Milton García-Borroto, Miguel Angel Medina-Pérez,

José Fco Martínez-Trinidad, Jesús Ariel Carrasco-Ochoa, and Guillermo De Ita.

An empirical study of oversampling and undersampling methods for lcmine an

emerging pattern based classifier. In Mexican Conference on Pattern Recognition,

pages 264–273. Springer, 2013.

[103]

Simon Bernard, Laurent Heutte, and Sébastien Adam. Influence of hyperparameters

on random forest accuracy. In International Workshop on Multiple Classifier

Systems, pages 171–180. Springer, 2009.

[104]

Naoki Takei, Takamichi Saito, Ko Takasu, and Tomotaka Yamada. Web browser

fingerprinting using only cascading style sheets. In 2015 10th International Con-

ference on Broadband and Wireless Computing, Communication and Applications

(BWCCA), pages 57–63. IEEE, 2015.

[105]

Takamichi Saito, Kazushi Takahashi, Koki Yasuda, Takayuki Ishikawa, Ko Takasu,

Tomotaka Yamada, Naoki Takei, and Rio Hosoi. OS and Application Identifica-

tion by Installed Fonts. 2016 IEEE 30th International Conference on Advanced

Information Networking and Applications (AINA), pages 684–689, 2016. doi:

10.1109/AINA.2016.55. URL http://ieeexplore.ieee.org/document/7474155/.

[106]

ua parser. Python implementation of ua-parser. https://github.com/ua-parser/uap-

python, 2018.

[107]

Modernizr. Modernizr: the feature detection library for html5/css3. https://

modernizr.com, 2019.

[108]

Alexis Deveria. Can i use... support tables for html5, css3, etc, 2019. URL

https://caniuse.com/.

[109] Brave Software, . URL https://brave.com/.

[110]

Aniko Hannak, Gary Soeller, David Lazer, Alan Mislove, and Christo Wilson.

Measuring price discrimination and steering on e-commerce web sites. In Proceedings

of the 2014 conference on internet measurement conference, pages 305–318. ACM,

2014.

[111]

MDN Web Docs. Mutationobserver api, 2018. URL https://developer.mozilla.org/

en-US/docs/Web/API/MutationObserver.

[112]

Distil Networks. Distil’s bad bot report 2018: The year bad bots went main-

stream. https://resources.distilnetworks.com/all-blog-posts/bad-bot-report-now-

available, 2018.

[113] Google. Puppeteer, 2019. URL https://pptr.dev/.

[114]

Martin Monperrus. Crawler-user-agents, 2019. URL https://github.com/

monperrus/crawler-user-agents.

[115]

Google. Issue 775911 in chromium: missing accept languages in request for headless

mode. https://groups.google.com/a/chromium.org/forum/#!topic/headless-dev/

8YujuBps0oc, October 2017.

References 183

[116]

Erti-Chris Eelmaa. Can a website detect when you are using selenium with

chromedriver? https://stackoverflow.com/questions/33225947/can-a-website-

detect-when-you-are-using-selenium-with-chromedriver/41220267#41220267,

2016.

[117]

Frank J Massey Jr. The kolmogorov-smirnov test for goodness of fit. Journal of

the American statistical Association, 46(253):68–78, 1951.

[118] Selenium HQ. What is selenium?, 2019. URL https://www.seleniumhq.org.

[119]

Selenium HQ. Selenium ide, 2019. URL https://www.seleniumhq.org/projects/ide/.

[120]

Sergey Shekyan. Detecting phantomjs based visitors, 2015. URL https://blog.

shapesecurity.com/2015/01/22/detecting-phantomjs-based-visitors/.

[121]

Alexis Deveria. Support of typed arrays, 2019. URL https://caniuse.com/#search=

Int8Array.

[122]

Alexis Deveria. Support of mutation observers, 2019. URL https://caniuse.com/

#search=MutationObserver.

[123]

Antoine Vastel. Detecting chrome headless, 2017. URL https://antoinevastel.com/

bot%20detection/2017/08/05/detect-chrome-headless.html.

[124]

MDN Web Docs. Permissions api, 2018. URL https://developer.mozilla.org/en-

US/docs/Web/API/Permissions_API.

[125]

Alexis Deveria. Support of ogg vorbis audio format, 2019. URL https://caniuse.

com/#search=ogg.

[126]

Alexis Deveria. Support of mp3 audio format, 2019. URL https://caniuse.com/

#feat=mp3.

[127]

Alexis Deveria. Support of waveform audio file format, 2019. URL https://caniuse.

com/#search=wav.

[128]

Alexis Deveria. Support of advanced audio coding format, 2019. URL https:

//caniuse.com/#feat=aac.

[129]

Alexis Deveria. Support of ogg/theora video format, 2019. URL https://caniuse.

com/#feat=ogv.

[130]

Alexis Deveria. Support of mpeg-4/h.264 format, 2019. URL https://caniuse.com/

#search=h264.

[131]

Alexis Deveria. Support of webm video format, 2019. URL https://caniuse.com/

#search=webm.

[132]

Chromium Bug Tracker. Support webgl in headless, 2016. URL https://bugs.

chromium.org/p/chromium/issues/detail?id=617551.

[133]

Apple Inc. ios sdk release notes for ios 8.0 gm, 2014. URL https://developer.apple.

com/library/archive/releasenotes/General/RN-iOSSDK-8.0/.

184 References

[134]

Distil Networks. Can comic sans detect cyber attacks? https://resources.

distilnetworks.com/all-blog-posts/can-comic-sans-detect-cyber-attacks, September

2016.

[135]

Xianghang Mi, Ying Liu, Xuan Feng, Xiaojing Liao, Baojun Liu, XiaoFeng Wang,

Feng Qian, Zhou Li, Sumayah Alrwais, and Limin Sun. Resident evil: Understanding

residential ip proxy as a dark service. In Resident Evil: Understanding Residential

IP Proxy as a Dark Service, page 0. IEEE, 2019.

[136]

Imperva Incapsula. Bot traffic report 2016. http://time.com/12933/what-you-

think-you-know-about-the-web-is-wrong/, March 2014.

[137]

Renáta Hodován and Ákos Kiss. Fuzzing javascript engine apis. In International

Conference on Integrated Formal Methods, pages 425–438. Springer, 2016.

[138]

Christina Braz, Ahmed Seffah, and David M’Raihi. Designing a trade-off between

usability and security: a metrics based-model. In IFIP Conference on Human-

Computer Interaction, pages 114–126. Springer, 2007.

[139]

Google. Devtools: Make it possible to control permissions via protocol, 2016. URL

https://bugs.chromium.org/p/chromium/issues/detail?id=631464.

Appendix A

List of fingerprinting attributes

collected

This appendix presents the list of fingerprinting attributes monitored to detect scripts

that use browser fingerprinting for crawler detection.

A.1 Navigator properties

•userAgent,

•platform,

•plugins,

•mimeTypes,

•doNotTrack,

•languages,

•productSub,

•language,

•vendor,

•oscpu,

•hardwareConcurrency,

186 List of fingerprinting attributes collected

•cpuClass,

•webdriver,

•chrome.

A.2 Screen properties

•width,

•height,

•availWidth,

•availHeight,

•availTop,

•availLeft,

•colorDepth,

•pixelDepth.

A.3 Window properties

•ActiveXObject,

•webdriver,

•domAutomation,

•domAutomationController,

•callPhantom,

•spawn,

•emit,

•Buffer,

•awesomium,

•_Selenium_IDE_Recorder,

A.4 Audio methods 187

•__webdriver_script_fn,

•_phantom,

•callSelenium,

•_selenium.

A.4 Audio methods

•createAnalyser,

•createOscillator,

•createGain,

•createScriptProcessor,

•createDynamicsCompressor,

•copyFromChannel,

•getChannelData,

•getFloatFrequencyData,

•getByteFrequencyData,

•getFloatTimeDomainData,

•getByteTimeDomainData.

A.5 WebGL methods

•getParameter,

•getSupportedExtensions,

•getContextAttributes,

•getShaderPrecisionFormat,

•getExtension,

•readPixels,

188 List of fingerprinting attributes collected

•getUniformLocation,

•getAttribLocation.

A.6 Canvas methods

•toDataURL,

•toBlob,

•getImageData,

•getLineDash,

•measureText,

•isPointInPath.

A.7 WebRTC methods

•createOffer,

•createAnswer,

•setLocalDescription,

•setRemoteDescription.

A.8 Other methods

•Date.getTimezoneOffset,

•SVGTextContentElement.getComputedTextLength

Appendix B

Overriding crawlers fingerprints

This appendix presents the code used to override the fingerprints of the 7 crawlers used

for the evaluation in Chapter 5. The

page

variable used in the code snippet refers to an

instance of a Page object of the Puppeteer library.

B.1 Overriding the user agent

The seven crawlers override their user agent to appear like a non-headless Chrome. The

code modifies both the user agent sent in the HTTP headers and the user agent contained

in the navigator object.

a wa it p ag e . s et U se r Ag e nt ( u s er Ag e nt ) ;

Listing B.1 Setting the user agent of the crawler to a value contained in a variable

userAgent.

B.2 Deleting the webdriver property

Crawlers 2 to 7 delete the

webdriver

property of the

navigator

object. Since the

navigator

object is handled differently by the browser, we cannot directly delete it.

Instead, we create a reference to the navigator prototype. Then, we delete the

webdriver

property from this reference. Finally, we assign the new prototype reference that does

not contain the webdriver property to the navigator object.

190 Overriding crawlers fingerprints

a wa i t p ag e . e v al u a te O nN e w Do c u me n t ( () = > {

c on st n e wP r ot o = n av i ga to r . _ _p r ot o __ ;

d el et e n ew P ro t o. w e bd r iv e r ;

n av i ga to r . _ _p r ot o __ = n e wP r ot o ;

}) ;

Listing B.2 Delete the webdriver property from the navigator object.

B.3 Adding a language header

Crawlers 3 to 6 add an

Accept-Language

header field to all the HTTP requests they

send.

a wa i t p ag e . s e tE x tr a H TT P H ea d e rs ( {

’ Ac ce pt - La n gu a ge ’ :’ en - U S ’

}) ;

Listing B.3 Adding an

Accept-Language

header field to all the HTTP requests sent by

the crawler.

B.4 Forging a fake Chrome object

Crawlers 4 to 6 add a realistic chrome property to the window object.

a wa i t p ag e . e v al u a te O nN e w Do c u me n t ( () = > {

window.chrome = {

a pp : {

isInstalled: false ,

w eb s to re : {

onInstallStageChanged: {},

o nD o w n lo a d P ro g r e ss : { } ,

runtime: {

P la t fo r mO s : {

M AC : ’ m ac ’ ,

W IN : ’ w in ’ ,

ANDROID: ’android’,

CROS: ’ c r os ’ ,

L IN UX : ’ l in u x ’ ,

B.5 Overriding permissions behavior 191

OPENBSD: ’openbsd’,

PlatformArch: {

A RM : ’ a rm ’ ,

X86_32: ’ x8 6 - 32 ’ ,

X86_64: ’ x8 6 - 64 ’ ,

PlatformNaclArch: {

A RM : ’ a rm ’ ,

X86_32: ’ x8 6 - 32 ’ ,

X86_64: ’ x8 6 - 64 ’ ,

RequestUpdateCheckStatus: {

T HR O TT LE D : ’ t h r ot t le d ’ ,

N O_ U PD AT E : ’ n o _ up d at e ’ ,

UPDATE_AVAILABLE: ’update_available’,

OnInstalledReason: {

INSTALL: ’install’,

UPDATE: ’update’,

C HR O ME _ UP DA T E : ’ c hr o m e_ u p da t e ’ ,

SHARED_MODULE_UPDATE: ’shared_module_update’,

OnRestartRequiredReason: {

A PP _ UP D AT E : ’ a pp _ u pd a t e ’ ,

O S_ U PD AT E : ’ o s _ up d at e ’ ,

PERIODIC: ’ p e r io d ic ’ ,

};

}) ;

Listing B.4 Adding a chrome property to the window object.

B.5 Overriding permissions behavior

Crawlers 5 and 6 override the way permissions for notifications are handled.

a wa i t p ag e . e v al u a te O nN e w Do c u me n t ( () = > {

c on st o r i gi n al Q ue r y = w in d ow . n a vi g at o r . pe r m is s io n s . qu e ry ;

w in do w . n av i ga t or . p er m is s io n s . __ p ro t o_ _ . qu e ry = p a ra m et e rs = >

p ar a me t e rs . n a me = == ’ n ot i f ic a t io n s ’

192 Overriding crawlers fingerprints

? P r om i se . r es o lv e ( { st a te : N ot i fi c at i on . p e rm i ss i on } )

: o r ig i na l Qu e ry ( p a ra m et er s ) ;

}) ;

Listing B.5 Override the ways permissions are handled.

B.6 Overriding window and screen dimensions

Crawler 6 override several attributes related to the size of the screen and the window

using consistent values collected on a non-headless browser.

a wa i t p ag e . e v al u a te O nN e w Do c u me n t ( () = > {

O bj e ct . d e fi n e Pr o p er t y ( w in do w , ’ i nn e r Wi d th ’ , {

g et : function( ) { return 1919;}

}) ;

O bj e ct . d e fi n e Pr o p er t y ( w in do w , ’innerHeight’, {

g et : function( ) { return 1007;}

}) ;

O bj e ct . d e fi n e Pr o p er t y ( w in do w , ’ o ut e r Wi d th ’ , {

g et : function( ) { return 1 9 1 9; }

}) ;

O bj e ct . d e fi n e Pr o p er t y ( w in do w , ’outerHeight’, {

g et : function( ) { return 1 0 0 7; }

}) ;

O bj e ct . d e fi n e Pr o p er t y ( w in do w , ’pageXOffset’, {

g et : function( ) { return 0 ; }

}) ;

O bj e ct . d e fi n e Pr o p er t y ( w in do w , ’pageYOffset’, {

g et : function( ) { return 0 ; }

}) ;

O bj e ct . d e fi n e Pr o p er t y ( w in do w , ’screenX’, {

g et : function( ) { return 1 6 8 0; }

}) ;

O bj e ct . d e fi n e Pr o p er t y ( w in do w , ’screenY’, {

g et : function( ) { return 0 ; }

B.7 Overriding codecs support 193

}) ;

O bj e ct . d e fi n e Pr o p er t y ( s cr ee n , ’ a va i l Wi d th ’ , {

g et : function( ) { return 1 9 2 0; }

}) ;

O bj e ct . d e fi n e Pr o p er t y ( w in do w , ’availHeight’, {

g et : function( ) { return 1 0 8 0; }

}) ;

O bj e ct . d e fi n e Pr o p er t y ( s cr ee n , ’ w id t h ’ , {

g et : function( ) { return 1 9 2 0; }

}) ;

O bj e ct . d e fi n e Pr o p er t y ( s cr ee n , ’height’, {

g et : function( ) { return 1 0 8 0; }

}) ;

d oc u me nt . a d d Ev e nt L is t en e r ("DOMContentLoaded", () => {

O bj ec t . d ef i ne P ro p er t y ( do c um e nt . b od y , ’clientWidth ’, {

g et : function( ) { return 1 9 1 9; }

}) ;

O bj ec t . d ef i ne P ro p er t y ( do c um e nt . b od y , ’clientHeight ’, {

g et : function( ) { return 541; }

}) ;

Listing B.6 Override properties related to the size of the screen and the window using

real values collected on a non-headless browser.

B.7 Overriding codecs support

Crawler 6 overrides the support for several audio and video codecs.

a wa i t p ag e . e v al u a te O nN e w Do c u me n t ( () = > {

HTMLAudioElement . prototyp e . canPlayTyp e = ( t ) = > {

c on st t T ri m ed = t . re p la c e (/ [ " ’ ;] / g , ’ ’ );

if ( t Tr i me d = == ’ a ud io / o g gc o de cs = v o rb i s ’) {

return "p r ob a bl y ";

} e l se i f ( t T ri m e d = = = ’ a u di o / m p eg ’ ) {

194 Overriding crawlers fingerprints

return "p r ob a bl y ";

} e ls e if ( t Tr i me d == = ’ a u di o / w av c od e cs = 1 ’) {

return "p r ob a bl y ";

} e l se i f ( t T ri m e d = = = ’ a u di o / x - m 4a ’ ) {

return "m a yb e ";

} e l se i f ( t T ri m e d = = = ’ a u di o / a ac ’ ) {

return "p r ob a bl y ";

}

return ’’;

};

HTMLVideoElement . prototyp e . canPlayTyp e = ( t ) = > {

c on s t t T r im e d = t . r ep l a ce ( / [ " ’ ; ]/ g , ’ ’ ) ;

if ( t T ri m e d = = = ’ v id eo / o g gc o de c s = th eo r a ’) {

r et ur n " p r ob a bl y " ;

} e l se i f ( t T ri m e d = = = ’ v id e o/ m p 4c o de c s = av c1 . 42 E 01 E ’) {

r et ur n " p r ob a bl y " ;

} e l se i f ( t T ri m e d = = = ’ v id e o/ w e bm c od e cs = vp 8 , v or b is ’ ) {

r et ur n " p r ob a bl y " ;

} e l se i f ( t T ri m e d = = = ’ v id e o/ m p 4c o de c s = mp 4v . 2 0. 8 m p4 a .4 0 .2 ’ ) {

return "";

} e l se i f ( t T ri m e d = = = ’ v id eo / mp4c o de cs = mp 4v . 20 .2 40 , m p4 a .40 .2 ’)

{

return "";

} e l se i f ( t T ri m e d = = = ’ v i de o / x - m a tr o s k ac o d e cs = t h eo r a , v o r bi s ’ )

{

return "";

}

return ’’;

};

}) ;

Listing B.7 Override the

canPlayType

function for

HTMLAudioElement

and

HTMLVideoElement.

B.8 Removing traces of touch support

Crawler 6 removes attributes and events that indicates the presence of touch support on

the device.

a wa i t p ag e . e v al u a te O nN e w Do c u me n t ( () = > {

d oc u me nt . c r ea t eE v en t = ( function ( o ri g ) {

B.9 Overriding toStrings 195

r e t u r n f u n c t i o n () {

l et ar g s = a r gu m en ts ;

if ( a r gs [ 0 ] = == ’ T ou c h Ev e n t ’ ) {

t h r o w ’ e r r or ’ ;

}

return o ri g . ap p ly ( this , a r gs ) ;

};

}( d o cu m en t . cr e at e Ev e nt ) ) ;

O bj e ct . d e fi n e Pr o pe r t y ( na v ig at o r , ’maxTouchPoints’, { g et : () = > 0

};

d el et e w in d ow . o n to u ch s ta r t ;

}) ;

Listing B.8 Remove events and functions related to touch screen.

B.9 Overriding toStrings

Crawlers 5 and 6 override the behavior of native functions. In order to hide their changes,

they both override the

toString

function of the functions they override. Moreover, they

also override Function.prototype.toString, the toString of the Function class.

a wa i t p ag e . e v al u a te O nN e w Do c u me n t ( () = > {

cons t o l d C all = Function. p r ot o ty p e .c a ll ;

function c al l () {

return o ld C al l . a pp ly ( this , a rg u me n ts ) ;

}

Function. p r ot o ty p e .c a ll = c al l ;

c on s t n at i v e T oS t r i n gF u n c t io n S t r in g = E r r or . t o St r i n g () . r e pl a c e ( /

E rr o r / g , " t oS tr i ng " ) ;

cons t o l d T o S t r i n g = Function . p ro to t yp e . t oS t ri n g ;

function functionToString() {

if (this = = = w in d ow . n a vi g at o r . p er m is s io n s . q ue ry ) {

return " f u n c tion quer y () { [ n a t i v e code ] } " ;

}

if (this = = = H TM L Vi d e oE l em e nt . p r ot ot y pe . c a nP l ay T yp e ) {

return " f u n c tion canPlayType ( ) { [ nati v e c ode ] } " ;

196 Overriding crawlers fingerprints

}

if (this = = = H TM L Au d i oE l em e nt . p r ot ot y pe . c a nP l ay T yp e ) {

return " f u n c t ion canPlayType ( ) { [ nati v e c ode ] } " ;

}

if (this = = = d oc u me nt . c r ea t eE v en t ) {

return " f u n c t ion createEvent ( ) { [ nati v e c ode ] } " ;

}

if (this === functionToString) {

return nativeToStringFunctionString;

}

return o ld C al l . c al l ( o ld T oS t ri ng , this );

}

Function. p r ot o ty p e . to St r in g = f un c ti o nT o S tr i ng ;

}) ;

Listing B.9 Override toString functions of functions overridden.