
Data-Centric Study of the Adequacy of Browser Fingerprints for Web Authentication 7
news and weather. The probe collected fingerprints from December 7, 2016, to June
7, 2017. Only the visitors that consented to cookies were fingerprinted, in compli-
ance with the European directives 2002/58/CE and 2009/136/CE in effect at the
time. To differentiate browsers, we assigned them a unique identifier (UID) as a
6-months cookie. Similarly to [3, 7], we coped with cookie deletion by storing a
one-way hash of the IP address, computed by a secure cryptographic hash function.
Previous datasets were collected through dedicated websites, and are biased to-
wards privacy-aware and technically-skilled persons [3, 7]. Our population is more
general audience oriented, but the website audience is mainly French-speaking
users. This leads to a bias towards this population. The timezone is set to −1 for
98.48% of browsers, 98.59% of them have daylight saving time enabled, and fr is
present in 98.15% of the Accept-Language HTTP header value.
3.2 Dataset Filtering and Preprocessing
Given the experimental aspect of fingerprints and the scale of our collection, the raw
dataset contained erroneous or irrelevant samples. We remove 70,460 entries entries
that have a wrong format (e.g., empty or truncated data), that are duplicated, or that
come from a robot.
Cookies are an unreliable identification method, hence we perform a resynchro-
nization similar to [3]. We consider the entries that have the same (fingerprint, IP
address hash) pair to come from the same browser, and assign them the same UID.
Similarly to [3], we do not synchronize the interleaved UIDs, being the pairs that
have UID values b1,b2, then b1again. We replace 181,676 UIDs with 116,708
replacement UIDs using this method.
To avoid counting multiple entries of identical fingerprints coming from the same
browser, the usual way is to ignore them during collection [3, 7]. Our probe collects
fingerprint on each visit, and to stay consistent with common methodologies we
deduplicate the fingerprints afterward. For each browser, we hold the first entry hav-
ing a given fingerprint, and ignore the following entries if they have this fingerprint.
For example, if a browser bhas the entries {(f1,b,t1),(f2,b,t2),(f2,b,t3),(f1,b,t4)},
we only hold the entries {(f1,b,t1),(f2,b,t2),(f1,b,t4)}. The deduplication consti-
tutes the biggest cut in our dataset, with 2,420,217 entries filtered out.
We extract 46 additional attributes from 9 original attributes, which are of two
types. The first type consists in extracted attributes composed of parts of original
attributes, like the screen resolution that is split into the values of width and height.
The second type consists of information sourced from an original attribute, like the
number of plugins extracted from the list of plugins.