
You Are How You Query 349
to evaluate DNS cache algorithms [6] and to train statistical models for malicious
domain detection [7,8]. Although the specific type and granularity of informa-
tion extracted from DNS traces may vary for different applications, the demand
for DNS traces is generally increasing.
Despite their practical values, DNS traces may introduce significant privacy
concerns. For example, DNS queries that are triggered by the prefetching mech-
anisms of popular browsers can leak users’ search engine queries [9]; DNS queries
can also reveal the types of operating systems [10]. In this project, we study a
new privacy risk introduced by passively collected DNS traffic: to which extent
network users can be uniquely identified merely based on the way they issue DNS
queries? In other words, we intend to derive behavioral fingerprints from DNS
traces, where each behavioral fingerprint targets at uniquely identifying its corre-
sponding user and being immune to the change of time. Such DNS-based behav-
ioral fingerprints, once successfully derived, have strong privacy implications. For
example, they can be used to de-anonymize the DNS traces with anonymized
sources. To be more specific, when DNS traces are shared, the source (e.g., the
IP address) that issues the DNS query is usually anonymized (e.g., by obscur-
ing the IP address using hash functions). However, one can learn behavioral
fingerprints from un-anonymized DNS traces and use the acquired fingerprints
to reveal the presence of specific users in (other) anonymized traces. In addi-
tion, if one can get access to DNS traces collected from multiple access networks
(e.g., through open DNS services or collecting traces from multiple networks),
he/she can track users’ locations across different networks by using behavioral
fingerprints to reveal users in DNS traces.
This paper aims at investigating the extent to which behavioral fingerprints
can be derived and measuring their accuracy on identifying the presence of cor-
responding network users. As a means towards this end, we have proposed a set
of new patterns, which collectively form behavioral fingerprints. We also built a
distributed, scalable system, namely DNSMiner, to automatically derive DNS-
based behavioral fingerprints from a massive amount of DNS traces. Specifically,
we make the following contributions in this paper.
– We have designed five new patterns including domain set,domain sequence,
window-aware domain sequence,period behavior,andhourly behavior,which
collectively form behavioral fingerprints. These patterns systematically char-
acterize DNS behaviors from three aspects including the domain name, the
inter-domain relationship, and the temporal behavior. Although more pat-
terns might be discovered to enhance behavioral fingerprints, our proposed
patterns serve as a lower bound of the capabilities to use DNS behaviors to
fingerprint network users.
– We have built a system, namely DNSMiner, to automatically mine behav-
ioral fingerprints from a massive amount of DNS traces. The design of the
system leverages the MapReduce distributed infrastructure to scale up the
system performance. After being deployed in a 15-nodes Hadoop platform,
DNSMiner can process more than 467 million DNS queries using approxi-
mately 4 hours.