
patterns are likely to be consistent across these periods [
8
,
10
]. As
a matter of fact, the online visiting behaviors of each individual
user are recurrent and consistent during a short continuous period
such as days or weeks. For example, users who have been highly
frequent in some services such as “healthcare.com” are likely to
remain frequent in the near future. Secondly, domain name look
ups in a period are unique from users to users according to their
personal behaviors. For example, some always prefer movies and
shopping while others focus on stock market news and banking.
A few methods [
8
,
10
] have been proposed to generate DNS-
based fingerprints from DNS queries and use the extracted fin-
gerprints to match and attribute DNS queries to their originators.
Specifically, Herrmann et al. [
8
] employed top-visited domains as
fingerprints whereas Kim et al. [
10
] leveraged temporal behaviors
of domain queries (e.g., the order in which an arbitrary pair of
domains are visited). Both methods follow the supervised learning
paradigm, where a large set of DNS queries with labeled users are
required to generate fingerprints for each user. Unfortunately, it is
extremely challenging to acquire such pre-labeled dataset in real
networks, which fundamentally limit the practical application of
these types of research methods.
On the other hand, unsupervised learning such as clustering can
be used to cluster DNS queries of different sessions into groups, each
of which is considered being generated by a unique user. Specifically,
we assume that queries issued from the same IP in a session are
from the same user, where a session length is a continuous duration
such as a day. As illustrated in Figure 1, there are assumed three IP
addresses and each issues a set of DNS queries in four consecutive
sessions. The goal of clustering is to group DNS query sessions into
distinct groups, where sessions in the same group belong to the
same user. For example, we believe that sessions labeled in orange
belong to the same user since they frequently access particular types
of domains such as “Yahoo” and “Learning”. Another group linking
all sessions labeled in blue tends to be a user who likes particularly
shopping and online TV sites, and the last group labeled in green
belongs to a user who likes news from “sina.com” and social media
actives via “qq.com”.
Most existing work such as Kmeans and constrained Kmeans [
11
]
has been proposed under this framework, where top-visited do-
mains are used as features for clustering user sessions. These meth-
ods need the number of clusters to bootstrap, which is ideally
the number of distinct users in the network. However, due to the
bursty nature of human activities, such a design fundamentally
undermines the applicability of this method since accurately esti-
mating the number of users is extremely challenging in practice,
if not entirely unrealistic. On the other hand, traditional Dirichlet
multinomial mixture (DMM) and its variations [
2
,
7
,
22
] frequently
used in Bayesian nonparametric modeling can automatically adjust
the number of clusters. However, each cluster may contain many
sessions generated from multiple users, which does not fit into the
scenario of DNS tracking since we need to identify pure clusters
and each corresponds to a particular user.
As a means towards systematically solving these challenges, we
propose the constrained Dirichlet multinomial mixture (CDMM)
clustering model, which is a generalization of the traditional Dirich-
let multinomial mixture clustering (DMM) model. The key inno-
vation of the proposed model is to sample a session’s cluster label
only from the qualified candidate clusters whose current sizes are
smaller than a threshold, which is the maximum possible number
of sessions a user can contribute during the observation period. For
example, if session length is defined as a day and all session data
is collected in a week, the threshold is 7assuming each user only
contributes one session each day. The constrained model fits the
scenarios of DNS clustering better and can potentially improve the
clustering performance. To our knowledge, this is the first research
about imposing cluster size constraint on nonparametric Bayesian
clustering methods. To summarize, we have made the following
contributions:
•
We have designed a new clustering model without requiring
the specification of the number of clusters and meanwhile
enforcing the maximum number of sessions in each cluster,
which fits the DNS tracking problem nicely.
•
Experimental results on DNS queries collected from real
networks demonstrate that our method accomplishes a high
clustering accuracy and outperforms the existing methods.
2 RELATED WORK
Active research has been conducted to infer users’ behaviors through
collected network traffic including DNS queries. The focuses of
existing methods generally fall into two categories including i)
attributing network events to individual network users and ii) in-
ferring users’ demographic information. Our method focuses on
attributing sessions of DNS queries to their corresponding users
(and thus falls into the first category).
A few methods have been proposed to attribute network events
to individual users [
6
,
8
–
11
,
17
]. Most of these methods [
6
,
8
–
10
,
17
]
first generate behavioral fingerprints for network users and then
use these fingerprints to attribute network events whose users are
unknown, which requires a significantly large set of events whose
users are pre-labeled. This fundamentally limits the practical us-
age of these systems. In contrast, our method does not require
any DNS sessions to be pre-labeled. In addition, methods includ-
ing [
6
,
9
,
17
] generate user fingerprints using data sources (e.g.,
encrypted wireless traffic, HTTP traffic, and search engine traffic)
that offer much finer-grained information compared to DNS traffic.
In comparison with our method, Herrmann et al. [
8
] and Kim et
al. [
10
] generate user fingerprints using the same data source (i.e.,
DNS queries). However, both methods follow the supervised learn-
ing paradigm, where a long period (multiple days) of DNS queries
need to be labelled for bootstrapping. The work closest to ours
is [
11
]. Similar to [
11
], our method uses clustering method to aggre-
gate DNS sessions into clusters, where each cluster is expected to
contain sessions generated by one user. In spite of such similarity,
the method in [
11
] mandates the number of clusters. Unfortunately,
it is extremely challenging, if not entirely impossible, to specify
the number of clusters in practice, fundamentally limiting the ap-
plication of [
11
]. In comparison, our method does not require the
specification of the number of clusters.
A substantial body of studies [
1
,
3
–
5
,
12
–
16
,
19
–
21
,
24
,
25
] have
focused on inferring users’ application-level activity information