Network-based HTTPS Client Identification Using
SSL/TLS Fingerprinting
Martin Hus´
ak, Milan ˇ
Cerm´
ak, Tom´
aˇ
s Jirs´
ık, Pavel ˇ
Celeda
Institute of Computer Science, Masaryk University, Brno, Czech Republic
{husakm, cermak, jirsik, celeda}@ics.muni.cz
Abstract—The growing share of encrypted network traffic com-
plicates network traffic analysis and network forensics. In this
paper, we present real-time lightweight identification of HTTPS
clients based on network monitoring and SSL/TLS fingerprinting.
Our experiment shows that it is possible to estimate the User-
Agent of a client in HTTPS communication via the analysis of the
SSL/TLS handshake. The fingerprints of SSL/TLS handshakes,
including a list of supported cipher suites, differ among clients
and correlate to User-Agent values from a HTTP header. We built
up a dictionary of SSL/TLS cipher suite lists and HTTP User-
Agents and assigned the User-Agents to the observed SSL/TLS
connections to identify communicating clients. We discuss host-
based and network-based methods of dictionary retrieval and
estimate the quality of the data. The usability of the proposed
method is demonstrated on two case studies of network forensics.
I. INTRODUCTION
The rising popularity of encrypted network traffic is a
double-edged sword. On the one hand, it provides secure data
transmission, protects against eavesdropping, and improves
the trustworthiness of communicating hosts. On the other
hand, it complicates the legitimate monitoring of network
traffic, including traffic classification and host identification.
Nowadays, we are able to monitor, identify, and classify plain-
text network traffic, such as HTTP, but it is hard to analyse
encrypted communication. The more secure the connection is,
from the point of view of communicating partners, the harder
it is to understand the network traffic and identify anomalous
and malicious activity. Furthermore, the attackers can evade
disclosure by hiding malicious network behaviour in encrypted
connections, where it is invisible to detection mechanisms.
In this paper, we discuss the most common encrypted
network traffic, HTTPS - HTTP over SSL/TLS protocols. In
communication encrypted by SSL/TLS, the hosts have to first
agree on encryption methods and their parameters. Therefore,
the initial packets contain unencrypted messages containing
information about the client and server. The fingerprint of a
SSL/TLS handshake varies among different clients and their
versions. One similar client identifier is a User-Agent value
in a HTTP header, which is commonly used for identifying
the client and classifying traffic. However, only the SSL/TLS
handshake can be observed in a HTTPS connection without
decrypting the payload. Therefore, we approach the problem of
identifying the SSL/TLS client and classifying HTTPS traffic
by building up a dictionary of SSL/TLS handshake fingerprints
and their corresponding User-Agents.
A. Motivation
The motivation for our work came from two case studies of
HTTPS client identification in the field of network forensics.
The first example addresses client identification in network
traffic destined to a HTTPS server. The second case study
addresses a HTTPS-based client identification in a Network
Address Translation (NAT) environment.
In the first case study, a HTTPS server is facing malicious
network traffic. The goal is to analyse the incoming traffic
and identify malicious connections. We do not have access
to the server, therefore we have to analyse incoming network
traffic. If the server was running HTTP, we could analyse the
User-Agents of the clients and filter out legitimate ones, e. g.,
common web browsers, to get a set of potentially malicious
clients. However, as more and more services move from
HTTP to HTTPS, it is common to see web servers running
HTTPS only. In this case, we have to rely on HTTPS client
fingerprinting as the only source of information which leads
to the identification of clients.
Fig. 1. Network-based HTTPS client identification.
In the second case study, a network of clients located behind
a NAT is monitored. Therefore, we have to evaluate other
identifiers than IP addresses. The first task is to identify
and enumerate distinct clients. The second task is to detect
presence of particular clients. Basic information, such as the
operating system, can be obtained via packet fingerprinting [1].
Other valuable information for client identification is provided
by the User-Agent, which can be obtained from any HTTP
connection. However, we are missing a significant portion of
HTTPS traffic, which would not be a problem if we didn’t
have a limited time for monitoring or we had a more specific
query to process. For example, if we are tracing a mobile
device which is connecting to a particular HTTPS server,
1
such as Gmail, we are again left with only HTTPS client
fingerprinting.
Fig. 2. HTTPS-based identification of clients behind NAT.
B. Research Questions
We set up an experiment, which shall answer three research
questions. To sum up our goals, the research questions are:
(i) Which parameters of a SSL/TLS handshake can be used
for client identification?
(ii) How can we build a dictionary of SSL/TLS handshakes
and HTTP User-Agents?
(iii) How large does the dictionary need to be to cover a
significant portion of network traffic?
First, we aim to observe of real network traffic to gain
insight into contemporary SSL/TLS handshakes. We deployed
network traffic monitoring, filtered HTTPS connections, and
made a list of the SSL/TLS handshakes and their fingerprints.
We focused on analysing information provided during the
handshake by the client, i. e., the ClientHello message contain-
ing the protocol version, list of supported cipher suites, and
other data. Apart from the usable information for identifying
the client, we are particularly interested in the share of old
and vulnerable protocol versions. Recent discoveries of severe
vulnerabilities, such as POODLE [2], might have significantly
changed the proportion of protocol versions in use.
Second, we correlated selected parts of SSL/TLS hand-
shakes and HTTP headers. We suppose that the list of sup-
ported cipher suites (declared by the client in the ClientHello
message) can be used as an identifier similarly to a User-Agent
in a HTTP header. However, it is not possible to get the User-
Agent from the HTTPS request without decryption. We use
two approaches to obtain pairs of cipher suites and evaluate
User-Agents. The host-based approach is based on advanced
logging on the server side. The novel network-based method
is based on simultaneously monitoring HTTP and HTTPS
connections. We assume that clients mostly communicate on
both protocols. Therefore, we look for HTTP and HTTPS
connections from the same client over a short time period and
pair cipher suites and User-Agents from such connections.
Third, we used the pairs of SSL/TLS fingerprints and
User-Agents as a dictionary to assign User-Agents to the
HTTPS connections observed during the measurement. We
shall discuss the quality of the obtained pairs with respect to
the dictionary size and accuracy of the User-Agent estimation.
The goal of this part is to estimate the size and accuracy of
a dictionary which could be used for identifying clients on a
large-scale and classifying network traffic.
C. Paper Organization
This paper is divided into seven sections. Section II presents
related work and a brief introduction into SSL/TLS protocols.
The experiment design, measurement tools, and measurement
environment are described in Section III. The results are
presented in Section IV and evaluated in Section V. The
applicability of the results, with respect to the proposed
examples, is discussed in Section VI. Finally, Section VII
concludes the paper.
II. STATE-OF-THE-ART
In this section we shall first present an introduction to
SSL/TLS protocols and a survey of network-based SSL/TLS
analysis. Then we will present a short survey of case studies
in network forensics and related fields.
Transport Layer Security (TLS) [3] is a new version of
the Secure Sockets Layer version 3 (SSLv3) protocol [4],
which is no longer recommended for use due to its secu-
rity vulnerabilities. It provides confidentiality, data integrity,
non-repudiation, replay protection, and authentication through
digital certificates directly on top of the TCP protocol. The
TLS protocol is currently used for securing the most com-
mon network protocols, such as HTTP, FTP, and SMTP, and
is part of Voice over Internet Protocol (VoIP) and Virtual
Private Network (VPN) protocols. In this paper, we shall
focus on SSL/TLS’s use within the HTTP protocol, known
as HTTPS [5], which is the most common use of the TLS.
The TLS connection can be divided into two phases: an
initial handshake and application data transfer, both depicted
in detail in Fig. 3. The initial handshake begins with a
ClientHello message identifying which protocol version is
used, the cipher suite list, and extensions. The full list of
these identifiers is available on the IANA web page [6].
The following messages of the initial handshake are used for
peer authentication using X.509 certificates [7] and shared
secret establishment based on agreed parameters. All TLS
messages exchanged during the initial handshake are not
encrypted until shared keys are established and confirmed by
Finished messages. The TLS protocol consists of configurable
cryptographic algorithms and different sub-protocols which
form a layered design [8]. The main part of this design is
the Record Protocol [3], which is used as an envelope for all
TLS messages and encrypted application data.
The long-term monitoring and measurement of SSL/TLS
traffic in the Internet was presented by Levillain et al. [9]
for the period 2010–2011. They observed a lot of servers
which were intolerant to some cipher suite lists, and detected
certificate chains which did not comply with the standard.
Another study of SSL traffic was conducted by Holz et al. [10],
who focused on SSL/TLS certificate properties. They revealed
a great number of invalid certificates and certificates shared
among a large number of hosts. The work of Holz et al. was
2
followed by the work of Durumeric et al. [11] which focused
on an assessment of certification authorities.
Fig. 3. A SSL/TLS handshake.
The SSL/TLS protocol and its applications are continuously
analysed by Qualys SSL Lab [12]. In addition to SSL/TLS
applications testing, they present the idea of HTTP client
fingerprinting using an analysis of the SSL/TLS handshake.
The idea is represented by the SSLhaf [13] proof-of-concept
tool for a simultaneous host-based analysis of HTTP and
SSL/TLS connections. The idea was also implemented by
Majkovski [1] as the p0f tool module used for fingerprinting
operating systems. Another idea was presented by Bernaille
and Teixeira [14], who identify underlying applications in a
SSL encrypted connection based on the first SSL/TLS packet
size.
The case studies of network forensic analysis, including
analysis based on User-Agent identification, were presented by
Raftopoulos and Dimitropoulos [15]. They cite Win32/Hotbar
as an example of malware, whose activity can be detected
by searching for HTTP requests with a specific User-Agent.
One related problem is the identification of network address
translation (NAT) in the network. A traffic flow-based method
was proposed by Gokcen et al. [16] and Krm´
ıˇ
cek et al. [17].
III. EXP ER IM EN T DESIGN
We designed a three-phase experiment to answer our re-
search questions and verify the idea of using HTTPS client
identification with SSL/TLS fingerprinting. In the first phase,
we set up measurement of live network traffic in the campus
network of Masaryk University. The monitoring was primarily
focused on SSL/TLS connections. In the second phase, we
created a dictionary of SSL/TLS fingerprints and HTTP User-
Agents, based on an analysis of the captured network traffic.
In the third phase, we applied this dictionary to assign User-
Agents to the measured traffic and verify the capabilities of
HTTPS client identification.
A. SSL/TLS Traffic Measurement
We measured live network traffic in the campus network of
Masaryk University. The network has more than 40,000 users
and 15,000 active IP addresses on average per day. We used
a flow-based network probe deployed in a 10 Gbps link that
connects the university and the network of CESNET, Czech
National Research and Education Network (NREN).
The measured network flow represents one-way network
communications defined as a series of packets with a shared
set of L3/L4 parameters: protocol, IP addresses, and port
numbers [18]:
F={proto, srcI P, dstI P, srcP ort, dstP ort }
Since the standard flow does not contain detailed informa-
tion about HTTP and HTTPS traffic, we used two extensions
for flow measurement, which add new elements to the flow
record. The first extension adds a User-Agent (ua) element to
the flow, based on the analysis of the HTTP traffic [19]:
FHT T P ={F∪ {ua} | dstP ort = 80 ua 6=∅ }
The second flow measurement extension adds elements from
the ClientHello message exchanged during the initial SSL/TLS
handshake of the HTTPS connection. We measured only those
elements which do not change with each client connection,
namely the SSL/TLS protocol version (vr), cipher suite list
(cs), compression (cm) and TLS extensions (ex):
Hello ={vr, cs, cm, ex }
FHT T P S ={FHello |dstP or t = 443 cs 6=∅ }
The aim of the measurement is to get the base data for the
subsequent phases of the experiment. In the next phase, we
created a dictionary wchich allows us to transform elements
of the SSL/TLS fingerprint into HTTP User-Agents. This
dictionary was then applied to all the measured data to verify
its usability and gain more information about HTTPS clients.
The secondary aim of the measurement is to get a closer look
at SSL/TLS connections and to obtain basic statistics about
network traffic, primarily focused on SSL/TLS traffic.
B. Pairing Cipher Suite Lists and User-Agents
To identify HTTPS clients, it is necessary to create a
dictionary containing pairs of SSL/TLS handshake elements
and User-Agents. This represents the second phase of our
experiment. We decided to use only a cipher suite list from
the ClientHello message to build up a dictionary. Cipher suite
lists are the most varied elements of the SSL/TLS handshake
and we suppose that they should be sufficient for identifying
clients. Other elements of the handshake only have a few
different values, therefore we do not plan to include them
in the dictionary. However, we assume they could clarify
ambiguous results.
We take two approaches, host-based and flow-based, to pair
a cipher suite list to a User-Agent. The host-based method
uses the information from a single HTTPS connection on the
server side, where the unencrypted data including the HTTP
header, are available. This method is very accurate, but it
requires clients to visit the server where the monitoring is
deployed. We set up a HTTPS server running Apache web
server and SSLhaf plugin [13]. SSLhaf enables us to log
the SSL/TLS parameters of a HTTPS connection. We logged
3
SSL/TLS connection parameters, including the cipher suite list
in a ClientHello message, and the User-Agent from the HTTP
header for each incoming connection. The dictionary is created
with a simple combination of the cipher suite list and the User-
Agent from a single connection.
Fig. 4. A flow-based cipher suite list and User-Agent pairing.
The flow-based method is based on network monitoring,
the extraction of cipher suite lists and User-Agents, and the
correlation of HTTP and HTTPS connections from a single
client, see Fig. 4. We assume that web clients commonly
communicate via both HTTP and HTTPS protocols. SSL/TLS
connections monitoring, as well as HTTP monitoring, is
utilized in this phase of the experiment. The method of pairing
cipher suite lists and User-Agents is described as follows:
Dict ={(cs, ua)| ∃ FH T T P S,FH T T P :
FHTTPS.cs =cs FHT T P .ua =ua
FHTTPS.srcIP =FH T T P .srcI P }
We searched for HTTP and HTTPS connections with the same
source IP address. We selected a cipher suite list from the
HTTPS connections and paired it to the User-Agent from the
HTTP connection which was the closest in time. We assume
that the flow-based approach would better reflect the structure
of live network traffic and allow us to cover more cipher suite
lists observed in the network.
We didn’t expect that the dictionary would provide an
unambiguous translation of one cipher suite list to one User-
Agent, but there would be the one cipher suite list with more
User-Agents and vice versa. However, we assume that User-
Agents assigned to the one cipher suite list will only have
slight differences, such as the software version. Therefore, it
does not affect the identification of general properties, e. g.,
the operating system or web browser. We also expected there
to be some significant deviations caused by ambiguity in the
flow-based approach or, for example, by forged connections
by a malicious crawler pretending to be a legitimate search
engine. In this case, we took only the most similar User-Agents
substrings.
C. Assigning User-Agents to Measured HTTPS Flows
The third phase of our experiment is a conjunction of the
results from the previous phases, as depicted in Fig. 5. We
combined both types of pairs of cipher suite lists and User-
Agents which were generated in the second phase. Then we
applied them to the measured data from the first phase. The
result was HTTPS flows extended by information about the
corresponding User-Agent or list of User-Agents:
F0
HTTPS ={FHT T P S ∪ {ua}| (FHT T P S .cs, ua)Dict}
We plan to validate the results of the assignment and
verify our idea of HTTPS client identification using SSL/TLS
fingerprinting. We are interested in the share of SSL/TLS
traffic for which we are able to assign a correlating User-
Agent. The connections containing cipher suite lists for which
we failed to assign a User-Agent are potentially relevant from
a security perspective. We expected most of the traffic to be
initiated by common web browsers, for which we are able
to easily get the pair of a cipher suite list and User-Agent
using the host-based method. However, we expected that a
combination of host-based and flow-based dictionaries are
needed to cover the majority of the traffic.
Fig. 5. Assigning User-Agents to cipher suite lists.
If multiple User-Agents which correspond to a single cipher
suite list are found, we plan to evaluate what the minimal
set of information provided by the set of User-Agents is.
For example, when multiple User-Agents, which correspond
to the same cipher suite list, point to different versions of
a client application. Similarities in grouped User-Agents can
indicate at least if the client is web browser, what its operating
system is, or if it is a mobile device. Major differences would
otherwise indicate an error in the pairing method.
IV. EXP ER IM EN T RES ULTS
In this section, we shall present the results of the experi-
ment. First, we will sum up the results of measuring SSL/TLS
connections in the campus network. Second, we will describe
the set of User-Agents and cipher suite list pairs obtained via
the host-based and flow-based methods. The section closes
with the results of assigning User-Agents to the SSL/TLS
connection obtained in the first phase of the experiment.
A. Measurement
We conducted measurements over a 7 day period in January
2015. We filtered the HTTPS connections, processed the
SSL/TLS handshakes, and saved the content of ClientHello
messages. The SSL/TLS version, cipher suite list, compres-
sion, and extensions were recorded for each connection. In
total, we processed 85,250,090 HTTPS connections.
TABLE I
DISTRIBUTION OF SSL/TLS VERSIONS
Version Number of Connections
TLS 1.2 49,140,929
TLS 1.0 33,827,182
SSL 3.0 1,365,409
TLS 1.1 913,014
other 3,556
The observed versions are listed in Table I. Over 57 % of
connections used the TLS 1.2 protocol followed by almost
40 % for TLS 1.0. Only 1.6 % of connections used the older
and more vulnerable SSL 3.0 protocol. TLS 1.1 represented
4
around 1 % of connections. The remaining connections were
unrecognized. However, the number of such connections is
insignificant.
We can confirm that only a small number of cipher suites
and cipher suite lists cover the majority of live network traffic.
As we can see in Fig. 6, the top 10 cipher suite lists represent
68.5 % of live network traffic and top 31 (out of 1598) cipher
suite lists are enough to cover 90 % of the traffic.
0%
20%
40%
60%
80%
90%
100%
0 10 20 40 60 80 100 120 140
Portion of traffic
Top X cipher suite lists
Fig. 6. Network traffic represented by Top X cipher suite lists.
B. Pairing Cipher Suite Lists and User-Agents
First, we created a base set of pairs using the host-based
method and SSLhaf. We manually contacted the monitoring
server with common available clients, such as web browsers
and tools such as curl [20], to create an initial dataset. We then
made the server publicly accessible and spread the links to lure
more clients, such as web crawlers. In total, we obtained 72
unique cipher suite lists and 293 unique User-Agents, forming
307 pairs. Multiple User-Agents, with the same cipher suite
list, were similar in most cases. The differences were usually
in the version of the client in the User-Agent.
Then, we moved on to the flow-based method, i. e., com-
bined monitoring of cipher suite lists from SSL/TLS hand-
shakes in HTTPS connections and the User-Agents from
HTTP connections. We analysed a 1-hour sample of peak
network traffic from our campus network and selected the
hosts which initiated both HTTP and HTTPS connections.
User-Agents from HTTP connections, and cipher suite lists
(from HTTPS connections), from the same client created a
new pair. We observed 10,890 clients communicating on both
protocols in a short period of time, 305 unique cipher suite
lists, and 5,043 unique User-Agents. In total, we derived
12,832 unique pairs during the measurement.
Following this, we investigated the relationship between
cipher suite lists and User-Agents by determining the cardi-
nality of the relationship. Both methods provided more User-
Agents which correspond to one cipher suite list, i. e., a 1:n
relation. After a manual inspection, we discovered, that these
User-Agents differ mostly in the system versions while the
information about the client, e. g. browser type, stays more
or less constant. Therefore, it is possible to identify a client
with high accuracy. The flow-based method also generated
a single User-Agent which corresponds to more cipher suite
lists. However, this is most likely caused by inaccuracy in the
method which cannot distinguish more clients communicating
at the same time.
C. Assigning User-Agents to Measured HTTPS Flows
We first used the dictionary provided by the host-based
method and then filled in the supplemented results with a
dictionary provided by the flow-based method. The host-
based dictionary contains only 72 unique cipher suite lists,
which represents 4.5 % of all cipher suite lists measured
during the first phase. However, we observed that those unique
cipher suite lists cover 78.0 % of all the measured HTTPS
flows. When we combined the host-based and flow-based
dictionaries, we obtained 316 unique cipher suite lists (19.8 %
of all) covering 99.6 % of the measured HTTPS connections.
Therefore, we assigned a User-Agent to almost all observed
HTTPS connections using a combined dictionary based on
data from a single server, and the correlations from a 1-hour
sample of network traffic.
As we already mentioned, multiple User-Agents were as-
signed to one cipher suite list, which causes ambiguity in
the translation. Even hundreds of different User-Agents were
found to correspond to a single cipher suite list. However,
we discovered that multiple User-Agents assigned to a single
cipher suite list differ only in details, such as the version of
the software used. Fig. 7 shows the results of the User-Agent
assignment to the measured HTTPS connections according to
the level of certainty. Certainty represents the number of User-
Agents per one cipher suite list.
0
100
200
300
400
500
600
700
800
012345678910ALL
0%
20%
40%
60%
80%
100%
Number of unique pairs
Cumulative portion of traffic identified
Number of User-Agents per cipher suite list
Portion of traffic identified
Number of unique pairs
Fig. 7. Relation of dictionary size and covered portion of network traffic.
The results show that in the event of unambiguous pairing,
i. e., one User-Agent per one cipher suite list, the dictionary
contains 104 unique pairs and covers only 6.3 % of all the
HTTPS flows measured. If we gradually decrease the level of
certainty, we are able to cover more cipher suite lists and a
greater proportion of HTTPS flows. If we use up to 10 User-
Agents per one cipher suite list, we are able to cover 66.0 % of
all HTTP flows, using 704 unique pairs with 253 unique cipher
suite lists. In this case, User-Agents are relatively different,
nevertheless, we are able to derive a general identification from
the client, e. g., if it is a web browser, mobile device, or web
crawler.
V. EX PE RI ME NT EVAL UATIO N
In this section, we shall discuss the experiment’s circum-
stances and results. First, we evaluate the measurement phase
5
and shares of protocol versions and cipher suite lists in the ob-
served network traffic. Second, the quality of dictionaries used
for client identification is discussed. Methods are proposed
which can be used to increase the accuracy of the dictionary.
Finally, we shall evaluate the structure of the network traffic
according to the estimated User-Agents and compare our
results to the related work to estimate the credibility of our
results.
A. Measurement
The plain measurement results were similar to our expecta-
tions. We analysed the shares of SSL/TLS versions and cipher
suite lists in the monitored connections. It is not surprising that
the majority of the HTTPS connections use the latest TLS 1.2
protocol. However, such a high share of TLS 1.0 should not
remain unnoticed. We further confirmed that the majority of
SSL/TLS connections are represented only by a small number
of cipher suites and cipher suite lists.
The interesting figure is the 1.6 % share of SSL 3.0 in the
observed HTTPS connections. The SSL 3.0 protocol is no
longer considered safe due to serious vulnerabilities discovered
in 2014, such as the POODLE attack [2]. We naturally wonder
if the discovery of serious vulnerability in a protocol leads to a
decrease in its usage. In comparison to earlier results, the share
of connections initiated over SSL is decreasing. For example,
Levillain et al. [9] reported 5 % share of SSL 3.0 in 2011.
B. Pairing Cipher Suite Lists and User-Agents
The host-based and flow-based method of pairing cipher
suite lists and User-Agents provided diverse, but complemen-
tary, results. The host-based method was more precise and
feasible in a controlled environment. However, the quantity
of data depends on the popularity of the server where the
monitoring is deployed. It could be interesting to perform
host-based monitoring on a popular web server with a variety
of clients. However, even the high attractiveness of a server
does not guarantee capturing traffic from all the common
clients in the network. Client applications like Spotify and
Instagram, which communicate only to specific servers, are
typical examples.
The flow-based method, focusing directly on clients, pro-
vided more pairs than the host-based method but at the cost
of uncertainty. However, clients like web crawlers, uncommon
clients and even suspicious ones, are hard to capture without
access to a live network. The 1-hour sample of network traffic
was sufficient to capture connections from almost all the
different clients observed over a longer period of time.
The combination of both methods provided a usable dictio-
nary which was sufficient for the needs of our experiment. The
pairs obtained via the host-based method were also obtained
via the flow-based method, which suggests that the flow-based
method can provide acceptable results.
C. Assigning User-Agents to the Measurement Results
The assignment of User-Agents to the observed HTTPS
connections was successful. The size of the dictionary was
sufficient to describe almost every cipher suite list observed
during the measurement. The number of connections with an
unknown cipher suite list was negligible. The accuracy of
the assignment can be disputed because more User-Agents
corresponded to a single cipher suite list. However, multiple
User-Agents with the same cipher suite list are typically
similar.
To improve the quality and accuracy of the assignment, we
have to improve the dictionary. We do not expect to gain more
results from the host-based method without distributing the
measurement among more attractive HTTPS servers. However,
we can improve the quality of results in the flow-based
method. First, we can manually process the multiple User-
Agents assigned to a single cipher suite list and select the
minimal common identifier in the event of their similarity, e. g.,
only the name of the web browser instead of its name, version,
and operating system. Second, we can repeat the measurement
to get a larger set of pairs and enrich the dictionary by the
most statistically significant ones, e. g., the pairs that appear
repeatedly.
VI. DISCUSSION
In this section we shall discuss the data which can be
derived from a cipher suite list (and its corresponding User-
Agent) and their application. We will also present a breakdown
of the dictionary according to identifiers found in User-Agents,
e. g., types of a client application or a device. An application
of the results is demonstrated on the case studies outlined in
Section I.
A. Grouping of Client Types
As our experiment shows, we can assign a User-Agent to a
known cipher suite list. However, the assignment is not exact
as there are typically multiple User-Agents which correspond
to a single cipher suite list. The multiple User-Agents are
typically similar and we can extract common information, e. g.,
a type of client, from them.
Fig. 8. Shares of HTTPS client types in the dictionary.
We extracted two pieces of information from the User-
Agents, device type and application type. First, we distin-
guished three device types: desktop, mobile, and unknown
device. Second, we distinguished five application types: web
6
browser, GUI application, command line application, auto-
matic update, and unknown application. In both cases, the
unknown value means that we are not able to determine the
device or application type from the given User-Agent. The
final client type is a concatenation of device and application
type. Major client types, such as desktop browsers, could
have been further split into subtypes for particular browsers,
however, the results of such fine-grained classification are not
presented.
The share of client types in the dictionary is presented in
Fig. 8. This figure represents only the structure of a dictionary,
not the relevance of particular client types. However, we
can see significant shares of client types which are hard to
detect using a host-based pairing method. Desktop and mobile
applications typically communicate only to specific servers
with a specific service. This demonstrates the contribution of
the flow-based pairing method.
The shares of client types in the live network traffic are
presented in Fig. 9. As we can see, more than half of the
connections were initiated by desktop browsers. Desktop,
mobile, and unknown device browsers together stand for the
vast majority of the network traffic, which is as could be
expected. One interesting figure is the relatively high amount
of traffic initiated by mobile applications.
Fig. 9. Shares of HTTPS client types in a live network traffic.
B. Application of SSL/TLS Fingerprinting on Case Studies
In the first example, we have to identify clients which
access a server. We do not have access to the server and the
server provides only HTTPS. Therefore, the only option is to
fingerprint the HTTPS connections. Although the correlation
between a cipher suite list and a User-Agent is not exact, we
can identify common legitimate traffic and highlight poten-
tially malicious connections. For example, if a server is hosting
a common web page, legitimate clients are web browsers and
other GUI clients. The presence of connections initiated by
command line tools indicates unusual traffic, e. g., connections
initiated by malware and an automated communication (click
fraud). Conversely, a GUI client that connects to dedicated de-
vices, e. g., SIP devices and hosts in a technological network,
indicate potentially malicious traffic.
In the second example we have to enumerate and identify
clients behind a NAT. Enumeration and identification of oper-
ating systems behind a NAT is possible via TCP/IP fingerprint-
ing [21]. However, an analysis of User-Agents provides better
results, which may even lead to the identification of individual
users. As in the previous example, a significant share of
HTTPS traffic makes the User-Agent analysis unsuitable. The
clients do not have to use HTTP at all, or the monitoring time
window can be so small that there is no HTTP traffic. Using
our proposed approach, we can assign the User-Agent to a
HTTPS cipher suite list or at least estimate which type of a
client it is. We are thus able to detect the activity of users with
a specific web browser or detect rogue devices, e.g., mobile
devices in a network where only desktops should appear.
VII. CONCLUSION
In this paper, we have shown that it is possible to estimate
the User-Agent of a client in HTTPS communication. This is
done for further identifying the client using network monitor-
ing and fingerprinting the SSL/TLS handshake, which is the
main contribution of this paper. We designed an experiment in
which we measured HTTPS traffic in a campus network. We
processed only the initial SSL/TLS handshake in which the
client and server negotiate the parameters of the encryption.
Therefore, our approach is lightweight and avoids decrypting
traffic.
First, we investigated the parameters of the SSL/TLS hand-
shake, which can be used to identify the client. The client
identifies itself in a ClientHello message during the handshake.
The most varied part of the ClientHello was the list of cipher
suite lists supported by the client. The cipher suite list differs
among various client applications and their versions, which
makes them suitable for further identification. In total, we
observed 305 unique cipher suite lists during our measurement.
The other parts of the ClientHello message, such as the
SSL/TLS version, compression, and supported extensions, are
interesting for analysis, but unusable for client identification
due to the limited number of distinct values.
Second, we studied the relationship between SSL/TLS ci-
pher suite lists and HTTP User-Agents. The User-Agent is
a common client identifier in HTTP. However, in HTTPS, it
is not accessible without decrypting the transferred data. We
deployed two methods for monitoring SSL/TLS handshakes
and HTTP headers simultaneously in order to pair cipher
suite lists and User-Agents. The host-based method, i. e.,
measurement on the server side, provided accurate results.
However, this method is limited by the set of clients accessing
the monitoring server and we obtained a smaller number of
pairs. The flow-based method uses network monitoring and
is not limited to a single server. We were looking for clients
communicating on HTTP and HTTPS protocols over a short
period of time, and paired the observed cipher suite lists
and User-Agents from both connections. We gained a large
dictionary of more than 12,000 pairs. However, this method is
less accurate compared to the host-based method.
Third, we assigned the corresponding User-Agents from
the dictionary to the results from monitoring the SSL/TLS
connections and discussed the required size and accuracy of
7
the dictionary. We found that we need a dictionary of about
300 cipher suite lists with assigned User-Agents. Therefore,
the dictionary which was created using the host-based method
was not sufficient to cover all the distinct cipher suite lists
which appear in network traffic. On the other hand, only a
1-hour sample of the HTTPS traffic contained almost all the
cipher suite lists which were observed over the week-long
measurement. This led us to use the dictionary created via
the flow-based method. However, many cipher suite lists were
paired with more than one User-Agent. We were able to assign
a User-Agent to almost every observed cipher suite list with a
certain level of probability. Fortunately, in many cases a lot of
User-Agents which corresponded to a single cipher suite list
share the same client identifier, and differ only in their version
or a similarly attainable value.
In conclusion, our work enhances the capabilities of network
forensics by introducing the network-based identification of
HTTPS clients. Our network-based approach is lightweight,
not limited to a single server, and does not approach the en-
crypted data. Therefore, we can identify clients while preserv-
ing the communication’s privacy. Our results are applicable for
identifying clients in the network, detecting the activity of a
specific client, and breaking down the structure of HTTPS
traffic in a whole network. This was demonstrated in the
experiment and two case studies of network forensics.
ACK NOW LE DG EM EN T
This material is based on work supported by the Security
Research for the Needs of the State 2010–2015 programme
funded by the Ministry of the Interior of the Czech Republic.
REFERENCES
[1] M. Majkowski, “SSL fingerprinting for p0f,” Web page, June 2012,
accessed 2015-01-28. [Online]. Available: https://idea.popcount.org/
2012-06- 17-ssl- fingerprinting-for-p0f/
[2] B. M¨
oller, T. Duong, and K. Kotowicz, “This POODLE Bites:
Exploiting The SSL 3.0 Fallback,” PDF online, 2014, accessed
2015-01-12. [Online]. Available: https://poodlebleed.com/ssl-poodle.pdf
[3] T. Dierks and E. Rescorla, “The Transport Layer Security (TLS) Protocol
Version 1.2,” RFC 5246 (Proposed Standard), Internet Engineering Task
Force, Aug. 2008, updated by RFCs 5746, 5878, 6176.
[4] A. Freier, P. Karlton, and P. Kocher, “The Secure Sockets Layer (SSL)
Protocol Version 3.0,” RFC 6101 (Historic), Internet Engineering Task
Force, Aug. 2011.
[5] E. Rescorla, “HTTP Over TLS,” RFC 2818 (Informational), Internet
Engineering Task Force, May 2000, updated by RFCs 5785, 7230.
[6] IANA – Internet Assigned Numbers Authority, “Protocol Registries,”
Web page, 2014, accessed 2015-01-28. [Online]. Available: http:
//www.iana.org/protocols
[7] D. Cooper, S. Santesson, S. Farrell, S. Boeyen, R. Housley, and W. Polk,
“Internet X.509 Public Key Infrastructure Certificate and Certificate
Revocation List (CRL) Profile,” RFC 5280 (Proposed Standard), Internet
Engineering Task Force, May 2008, updated by RFC 6818.
[8] C. Meyer, “20 Years of SSL/TLS Research: An Analysis of
the Internet’s Security Foundation,” Ph.D. dissertation, Ruhr-
University Bochum, February 2014, accessed 2015-01-15. [Online].
Available: http://www-brs.ub.ruhr-uni- bochum.de/netahtml/HSS/Diss/
MeyerChristopher/diss.pdf
[9] O. Levillain, A. ´
Ebalard, B. Morin, and H. Debar, “One Year of SSL
Internet Measurement,” in Proceedings of the 28th Annual Computer
Security Applications Conference, ser. ACSAC ’12. New York, NY,
USA: ACM, 2012, pp. 11–20.
[10] R. Holz, L. Braun, N. Kammenhuber, and G. Carle, “The SSL Land-
scape: A Thorough Analysis of the x.509 PKI Using Active and
Passive Measurements,” in Proceedings of the 2011 ACM SIGCOMM
Conference on Internet Measurement Conference, ser. IMC ’11. New
York, NY, USA: ACM, 2011, pp. 427–444.
[11] Z. Durumeric, J. Kasten, M. Bailey, and J. A. Halderman, “Analysis
of the HTTPS Certificate Ecosystem,” in Proceedings of the 2013
Conference on Internet Measurement Conference, ser. IMC ’13. New
York, NY, USA: ACM, 2013, pp. 291–304.
[12] Qualys SSL Lab, “HTTP Client Fingerprinting Using SSL Handshake
Analysis,” Web page, 2014, accessed 2015-01-23. [Online]. Available:
https://www.ssllabs.com/projects/client-fingerprinting/
[13] I. Risti´
c, “Passive SSL client fingerprinting using handshake analysis,”
GitHub repository, 2014, accessed 2015-01-30. [Online]. Available:
https://github.com/ssllabs/sslhaf
[14] L. Bernaille and R. Teixeira, “Early Recognition of Encrypted Applica-
tions,” in Passive and Active Network Measurement, ser. Lecture Notes
in Computer Science, S. Uhlig, K. Papagiannaki, and O. Bonaventure,
Eds. Springer Berlin Heidelberg, 2007, vol. 4427, pp. 165–175.
[15] E. Raftopoulos and X. Dimitropoulos, “Understanding network foren-
sics analysis in an operational environment,” in Security and Privacy
Workshops (SPW), 2013 IEEE. IEEE, 2013, pp. 111–118.
[16] Y. Gokcen, V. A. Foroushani, and A. Heywood, “Can we identify
NAT behavior by analyzing Traffic Flows?” in Security and Privacy
Workshops (SPW), 2014 IEEE. IEEE, 2014, pp. 132–139.
[17] V. Krm´
ıˇ
cek, J. Vykopal, and R. Krejˇ
c´
ı, “Netflow Based System for NAT
Detection,” in Proceedings of the 5th International Student Workshop
on Emerging Networking Experiments and Technologies, ser. Co-Next
Student Workshop ’09. New York, NY, USA: ACM, 2009, pp. 23–24.
[18] R. Hofstede, P. ˇ
Celeda, B. Trammell, I. Drago, R. Sadre, A. Sperotto,
and A. Pras, “Flow Monitoring Explained: From Packet Capture to Data
Analysis with NetFlow and IPFIX,” Communications Surveys Tutorials,
IEEE, vol. 16, no. 4, pp. 2037–2064, Fourthquarter 2014.
[19] P. Velan, T. Jirs´
ık, and P. ˇ
Celeda, “Design and Evaluation of HTTP Pro-
tocol Parsers for IPFIX Measurement,” in Advances in Communication
Networking, T. Bauschert, Ed., vol. 8115. Heidelberg: Springer Berlin
Heidelberg, 2013, pp. 136–147.
[20] cURL Contributors, “cURL - command line tool and library for
transferring data with URL syntax,” 2015, accessed 2015-01-25.
[Online]. Available: http://curl.haxx.se/
[21] R. Beverly, “A Robust Classifier for Passive TCP/IP Fingerprinting,”
in Passive and Active Network Measurement, ser. Lecture Notes in
Computer Science, C. Barakat and I. Pratt, Eds. Springer Berlin
Heidelberg, 2004, vol. 3015, pp. 158–167.
8