Accurate TLS Fingerprinting using Destination Context and
Knowledge Bases
Blake Anderson
Cisco
blake.anderson@cisco.com
David McGrew
Cisco
mcgrew@cisco.com
ABSTRACT
Network fingerprinting is used to identify applications, provide
insight into network traffic, and detect malicious activity. With
the broad adoption of TLS, traditional fingerprinting techniques
that rely on clear-text data are no longer viable. TLS-specific tech-
niques have been introduced that create a fingerprint string from
carefully selected data features in the
client_hello
to facilitate
process identification before data is exchanged. Unfortunately, this
approach fails in practice because hundreds of processes can map to
the same fingerprint string. We solve this problem by presenting a
TLS fingerprinting system that makes use of the destination address,
port, and server name in addition to a carefully constructed finger-
print string. The destination context is used to disambiguate the set
of processes that match a fingerprint string by applying a weighted
naïve Bayes classifier, resulting in far greater performance.
Our methods are made possible by a data fusion system that
continuously collects and fuses host and network data, building
up-to-date fingerprint knowledge bases that correlate TLS finger-
print strings, processes, and destinations for 50+ million real-world
sessions each day. Using data collected from two geographically
distinct sites and a malware analysis sandbox, we demonstrate
that our solution can achieve an
F1
score of greater than 0.99 for
process identification and high efficacy malware detection with
99.9% precision and 88.7% recall. We provide specific results for
the set of most common processes and a set of cloud orchestration
tools. In the case of no exact fingerprint string matches, we demon-
strate that our system can accommodate approximate fingerprint
string matching with an
F1
score of 0.90. Finally, we have released
an open source tool, mercury [
38
], that implements the proposed
techniques and provide weekly updates to an open source TLS
fingerprint knowledge base to assist reproducibility of our work.
CCS CONCEPTS
Security and privacy Network security
;Malware and its
mitigation;
KEYWORDS
TLS Fingerprinting; Process Identification; Malware
1 INTRODUCTION
Process identification from network traffic aids many use cases
including network segmentation, malware detection, and vulner-
able application detection. The HTTP
User-Agent
[
30
] has been
used as a proxy for process identification, but with the increasing
use of Transport Layer Security (TLS) [
26
,
42
], methods that rely
on clear-text data have become obsolete. Existing solutions have
been proposed for identifying processes that have initiated TLS
connections [
46
,
47
], but these solutions must observe complete
sessions making them unsuitable for real-time enforcement.
TLS fingerprinting has been proposed as a technique to enable
real-time enforcement by providing the initiating process’s identity
after observing the client’s initial TLS packet. Traditional TLS finger-
printing extracts metadata presented in the TLS
client_hello
and
generates a fingerprint string using a pre-defined schema. These
techniques are relevant for all versions of the TLS protocol, in-
cluding TLS 1.3 [
42
] where all needed data features are still pre-
sented unencrypted. Given a fingerprint string, TLS fingerprinting
then maps that string to a process by using a dictionary of known
fingerprint-to-process mappings. Unfortunately, TLS fingerprint
strings are often more indicative of a TLS library than they are of a
specific process, with fingerprint strings often mapping to tens or
hundreds of unique processes.
This limitation can be seen by analyzing publicly available mal-
ware fingerprint feeds. Abuse.ch, which cautions that their feed has
“not been tested against known good traffic yet and may cause a
significant amount of FPs", provides a list of 70 JA3 [
15
] hashes used
by malware. We reverse engineered 68 of the hashes and mapped
them to our data format. 59 of the indicators were more strongly
associated with benign processes, such as Internet Explorer, Python,
and Java, than they were with malware. 12 of the indicators led to
1 million or more false positives during a 10-day period for one of
our testing sites. While the TLS fingerprint string taken by itself is
often a poor indicator, additional contextual information can help
to increase performance.
In this paper, we generalize TLS fingerprinting by incorporat-
ing contextual information contained within the
client_hello
packet. Our approach uses the destination IP address, port, and
server_name
value (if available) to disambiguate potential pro-
cesses. We define equivalence classes for the destination features
to help generalize to unseen destination values. As an example, the
classification system uses both the IP address and the autonomous
system of the IP address. We combine the features using a simple
weighted naïve Bayes classifier, which relies on probability esti-
mates provided by our fingerprint knowledge base. We show that
our approach of simultaneously considering the TLS fingerprint
string and the destination information is a significantly improve-
ment compared to systems based solely on the fingerprint string or
the destination information.
An underlying assumption of TLS fingerprinting is that there
exists a well-curated database that maps TLS fingerprint strings
to process or library names. This is especially true for our work,
where the naïve Bayes algorithm requires a knowledge base that
provides prevalence information for each process associated with a
fingerprint string, along with counts for each destination feature
that a given process visited while using a specific TLS fingerprint
1
arXiv:2009.01939v1 [cs.CR] 3 Sep 2020
string. We built a system that continuously fuses real-world host
and network data, which we use to build a TLS knowledge base
that reflects the most recent TLS usage on the monitored networks
based on billions of connections. Furthermore, the automated nature
of our knowledge base generation ensures that our system stays
current with destination information that frequently changes.
To demonstrate the efficacy of the proposed solution, the clas-
sifier is trained on data collected from a site in GMT+19 during
the month of May 2020 and applied to data collected during the
first ten days of June 2020 from the same site as well as a site in
GMT+8. We provide the process identification
F1
score, as well as
precision/recall results highlighting how the classifier performs on
the top applications and cloud orchestration and processing tools.
Malware’s use of TLS has been well-documented [
19
], with many
malware families prone to using standard libraries as discussed
above. We use the results of a malware analysis sandbox to demon-
strate that our techniques are able to achieve 99.9% precision and
88.7% recall for the malware detection task.
Operational concerns or parameters are often ignored in the
construction and application of TLS fingerprint databases. TLS fin-
gerprint strings and the way applications use TLS is constantly
evolving [
18
,
36
], leading to a steady introduction of new finger-
print strings. Accommodating approximate matching helps manage
unseen fingerprint strings, and we show that an approximate match-
ing scheme based on edit distance can still achieve an
F1
score up
to 0.902 on sessions with an unknown fingerprint string. We also
demonstrate the importance of a well-tuned knowledge base by
looking at the classifier’s performance when it is not kept up to
date. A knowledge base in our format that captures all process and
destination information for hundreds of billions of sessions can be
up to several hundred megabytes, reducing the efficiency of online
classification. We show that incorporating data that is older than
one month reduces classification performance and unnecessarily
increases the size of the database.
To support this work, we developed an open source C/C++ tool,
mercury [
38
], that uses Linux’s AF_PACKET TPACKETv3 zero-copy
shared memory ring buffers to collect and classify data on network
links with capacities of 30Gbps+. We also released a python-based
implementation to facilitate rapid prototyping. We are committed
to releasing up-to-date TLS fingerprint knowledge bases, which
we have currently done weekly during the first seven months of
2020. Some information has been removed from the open source
fingerprint knowledge base relative to the internal knowledge bases,
and we highlight the difference in expected performance in Section
7.
Our novel contributions include:
The introduction of a TLS fingerprinting system that in-
corporates destination information and accommodates ap-
proximate matching to provide state-of-the-art process and
malware identification results.
A study of the impact of different knowledge base configu-
ration options on the classifier’s performance.
An open source tool that implements all presented tech-
niques along with a TLS fingerprint knowledge base that
f:
Version: 0303
CipherSuites: 0a0a. . .
Extensions: 0000. . .
server_name google.com
destination IP 8.8.8.8
destination port 443
Process: chrome.exe
Version: 79.0.3945
SHA-256: 5616. . .
Category: browser
Malware: False
OS: WinNT
OS version: 10.0.18363
OS edition: Enterprise
Figure 1: Classical TLS fingerprinting aims to map parame-
ters extracted from the client_hello to a set of informative
labels such as the process name or operating system. Gray
features are only used in generalized TLS fingerprinting.
is updated weekly and currently has over 8,400 TLS finger-
print strings with detailed process and destination infor-
mation.
2 TLS FINGERPRINTING
Transport Layer Security (TLS) [
26
,
42
] is the most popular proto-
col to secure communications over the Internet. A client begins a
TLS session by sending a
client_hello
message, which contains
a TLS version, a list of supported cipher suites ordered by prefer-
ence, and a list of extensions that provide additional context, e.g.,
the
server_name
extension provides the DNS hostname so that
front-end servers can route connections without having to perform
decryption [
28
]. The server then responds with a
server_hello
message that selects a set of cryptographic parameters offered by
the client and a
certificate
message proving the server’s iden-
tity. The server then initiates the key exchange, after which the
client and server send
finished
messages and begin exchanging
encrypted data.
TLS fingerprinting operates on the initial
client_hello
mes-
sage, allowing for real-time policy enforcement before the TLS
handshake completes or encrypted data is exchanged. Figure 1
shows a simple example of an idealized TLS fingerprinting system,
where the goal is to learn some function,
f
, that maps parameters
offered in the client_hello to a set of informative labels such as
the process name or operating system. In this work, we focus on
identifying the process name.
Several goals are important for the generation of a TLS finger-
print string. The fingerprint string format should be unambiguous
and reversible. It should provide the most discriminating power pos-
sible, taking advantage of every informative data feature. It should
be able to accommodate complex patterns such as TLS GREASE
[
21
]. Lastly, it should be computationally inexpensive, and robust
against Denial of Service (DoS) attacks that aim to exhaust system
resources. Our system uses the following fingerprint schema:
(version)(cipher suites)((ext1)(ext2)...)
where each field is a hex string corresponding to the bytes
observed in the
client_hello
to facilitate reversibility, i.e., it is
possible to determine substrings that were in the packet from
the fingerprint string. To help ensure discriminating power,
all cipher suite and extension orderings are maintained, and
((ext1)(ext2)...)
includes the data for 21 extensions along with
all type codes. The data associated with session-specific extensions
is omitted, e.g.,
server_name
and
key_share
, and data associated
2
Figure 2: Data fusion system that correlates network and
host logs in order to attribute network sessions to processes.
with client-specific parameters is kept, e.g.,
supported_groups
,
supported_versions
, and
compress_certificate
. A full list of
extensions that retain their data in the fingerprint schema is given
in Appendix A. Along with normalizing session-specific extension
data, GREASE [
21
] cipher suites, extension types, and extension
data values are normalized to the value
0a0a
, but their ordering is
preserved. Finally, we avoid any cryptographic computations on
the fingerprint string itself, such as computing an MD5 hash, for
computational efficiency.
Given a TLS fingerprint string, traditional fingerprinting systems
would return a single process or set of processes that have been
observed using that fingerprint string as defined in a fingerprint
database. Unfortunately, TLS fingerprint strings often map to many
processes due to those processes using the same underlying TLS
library. For example, in the data we collected during the month
of May 2020, the median number of unique process names for the
top-100 most prevalent fingerprint strings was 24.5. To provide
actionable intelligence, a TLS fingerprinting system needs signifi-
cantly more specificity than current approaches provide.
3 GENERALIZED TLS FINGERPRINTING
To overcome the limitations of previous systems, we put forward
an approach to generalize TLS fingerprinting as explained in this
section. Our approach is centered around the construction of a TLS
fingerprint knowledge base that relies on continuous, large-scale
data collection, curation, and fusion. We also propose an approxi-
mate matching scheme to accommodate the introduction of new
fingerprint strings for completeness. To further improve the robust-
ness of our system, we introduce equivalence classes on destination
features to better handle unseen destination/process combinations.
Finally, given a match in our knowledge base, we use a weighted
naïve Bayes classifier to provide the most probable process name
given the list of potential processes and their destinations.
packets
protocol
identification
fingerprint
extraction
substring
normalization
exact
matcher
approximate
matcher
Figure 3: Control flow for online matching.
3.1 Knowledge Base Generation
Traditional TLS fingerprint databases provide mappings from TLS
fingerprint strings to processes, libraries, or operating systems,
but lack the context needed to disambiguate the results. The TLS
fingerprint knowledge base provides this context by associating
each TLS fingerprint string with a list of processes observed using
it, along with destination and operating system information. We
have built a data fusion system as shown in Figure 2 to collect this
data. For the host data, we use records sent by the AnyConnect
Network Visibility Module (NVM) [
3
], which contain the network
5-tuple, an event start timestamp, the name of the communicating
process, the SHA-256 hash of the process executable, and the host’s
operating system. For the network data, we use our custom open
source tool, mercury [
38
], which reports the TLS fingerprint string
along with the network 5-tuple and an event start timestamp.
The host and network data are joined daily by the network
5-tuple and event start timestamps. If there are network 5-tuple
collisions, the records with the minimal timestamp delta are joined.
We discard records with timestamp deltas greater than 5 seconds,
which resulted in less than a 0.1% data reduction. The joined records
contain the TLS fingerprint string, destination information such
as the IP address and
server_name
extension value, and process
attribution information.
The fused records are then used to condition a knowledge base.
Records are grouped by their TLS fingerprint string, collecting a
list of all associated processes. For each process, we record the total
number of sessions and all destinations with associated counts. Each
destination is represented as a 3-tuple comprised of the IP address,
port, and
server_name
value (if present). Separate knowledge bases
are generated for each day’s traffic, and then merged to create an
operational knowledge base. We use this flexibility to discard older
data as described in Section 6.
In addition to real-world host and network data, we also generate
knowledge bases from the artifacts of a malware analysis sandbox.
The joining procedure is similar but relies on packet captures and
analysis files to create the joined records.
3
3.2 Approximate Matching
TLS fingerprint strings are constantly evolving. Kotzias et al. [
36
]
demonstrated this evolution and how the disclosure of security
flaws in TLS impacts the TLS versions, cipher suites, and extensions
clients offer. Our full knowledge base, collected between July 2019
to June 2020, contains over 8,000 fingerprint strings with process
information, but we still consistently add 10-20 new fingerprint
strings with process information per day.
Similar to previous work [
31
], we dealt with this issue by imple-
menting a fingerprint string similarity metric based on the Leven-
shtein distance, which measures the number of cipher suite and
extension insertions, deletions, or substitutions that are needed
to transform one fingerprint string into another. When we see a
fingerprint string that is missing from our database during online
analysis, we compute the Levenshtein distance between all known
fingerprint strings and select the fingerprint string with the mini-
mal distance. The fingerprint string’s prevalence is used to break
any ties. Results specific to approximate matching are given in
Section 5.4
Figure 3 demonstrates the system’s control flow from observ-
ing a packet on the wire to reporting the appropriate entry in the
knowledge base. We first perform protocol identification using a
bit mask over the first 8 bytes of application data, matching known
TLS versions and the
client_hello
’s record and message identifier.
We then extract the fingerprint string conforming to the schema
described in Section 2, and normalize session-specific extension
data and GREASE values. We then report the exact match if the
fingerprint string is currently in our knowledge base or use the
approximate matching technique as described above. While approx-
imate matching is computationally expensive, we store the results
in the knowledge base which leads to a low amortized cost.
3.3 Equivalence Classes
Similar to the introduction of new fingerprint strings, the des-
tination information associated with a process can change over
time. The changes often take the form of a new subdomain in the
server_name
or a different IP address within the same public cloud
environment. Our classification system described in Section 3.4
uses probabilities conditioned on real-world observations to select
the most probable process. If a
server_name
has a random compo-
nent in the subdomain but the domain name remains constant, this
would result in a zero probability despite the
server_name
being
obviously related to the known process.
We solve the problem of unseen destination values by intro-
ducing equivalence classes for the observable features, and we use
those equivalence classes in the classifier. Each equivalence relation
partitions the set of features into distinct subsets. There may be
more than one useful equivalence relation for a feature, since there
are multiple ways that addresses, ports, and domain names can be
related. All the addresses within a BGP Autonomous System (AS)
[
48
] are equivalent in some way, as are the addresses in a corporate
offering such as Microsoft Office 365, Azure, AWS, or Cloudflare.
In the current work, we test four equivalence classes. For the
server_name
feature, we extract the domain name and the top-
level domain using Mozilla’s Public Suffix List [
12
]. The IP address
is mapped to the corresponding BGP AS using MaxMind’s GeoLite2
Feature Weight
server_name 0.97192
server_name Domain 0.16200
server_name TLD 0.01044
IP 0.53294
IP AS 0.10343
Port 0.00396
Port Port Class 0.00265
Table 1: Weighted naïve Bayes feature weights as found with
the information gain ratio.
database [11]. The ports are mapped to a known application layer
protocol, e.g., 443
HTTPS and 993/995
email, with unknown
and ephemeral ports mapping to “unknown". While a more inter-
esting set of equivalence classes would undoubtedly improve the
performance of the classification system, it would not change the
mechanics of the classifier and we leave these investigations to
future work as discussed in Section 9.
3.4 Weighted Naïve Bayes Model
Given an exact or approximate fingerprint string match and a set
of destination features along with their equivalence mappings, our
goal is now to select the most probable process from the fingerprint
string’s list of possible processes.
We formalize the system as follows. Each session is associated
with a process
z
, which is not directly observed, as well as the ob-
served destination features and equivalence mappings,
f1, . . . , fn
.
Z
denotes the set of processes previously observed using the
matched fingerprint string as given by the knowledge base. Our
goal is to construct a classifier
c
that given a TLS fingerprint string,
fp
, returns the process that maximizes
P(z|f1, . . . , fn)
for
z∈ Zfp
.
For interpretability and computational efficiency, we chose the
naïve Bayes model:
c(f1, . . . , fn)=argmax
z∈Zfp
P(z|f1, . . . , fn)(1)
=argmax
z∈Zfp
P(z)Ö
i=1,n
P(fi|z)(2)
=argmax
z∈Zfp
log P(z)+Õ
i=1,n
log P(fi|z).(3)
Equation 1 simply defines the classifier as the function that returns
the most probable process given the destination feature set. Equa-
tion 2 applies Bayes’ theorem, removes the irrelevant denominator,
and applies the naïve Bayes assumption that each observed feature
is conditionally independent from all other observed features. Equa-
tion 3 simplifies the computation and helps to prevent underflow
by moving to sums of logarithms.
P(fi|z)
is computed by using the empirical probability estimates
provided by the knowledge base. In cases where
P(fi|z)=
0, we
use a prior probability of 1
/t
where
t
is the total number of sessions
observed using a given fingerprint string in the knowledge base.
Computing
P(fi|z)
directly from the knowledge base and lever-
aging a naïve Bayes model helps to avoid an unreasonably large
4
Algorithm 1 Process identification.
Given: fingerprint, destination_features
proc_list.initialize_to_empty_list()
for zfingerprint.processes do
qlog P(z)
for fdestination_features do
qq+wf·log P(f|z)
for γf.eqv_classes do
eγ(f)
qq+wγ·log P(e|z)
end for
end for
proc_list.append(q,z)
end for
return proc_list.get_maximum()
number of features that would be needed by other machine learn-
ing alternatives such as deep neural networks or support vector
machines.
Because the destination features and equivalence mappings pro-
vide varying levels of information about the process initiating a
TLS session, we opted to modify the naïve Bayes algorithm to use
a weighted combination of the features as described by Zhang and
Sheng [
50
]. The feature weights are computed using the information
gain ratio conditioned on the knowledge base. The weights found
with this method using a knowledge base constructed from data col-
lected during May 2020 is given in Table 1. Both the
server_name
and IP address have high weights and are stronger indicators of a
process given a fingerprint string relative to the port information,
which is heavily biased towards port 443 as shown in Section 4.
Section 5.5 provides results when we consider subsets of destination
features.
With the feature weights,
wf
, computed, the new equation to
compute the most probable process becomes:
c(f1, . . . , fn)=argmax
z∈Zfp
log P(z)+Õ
i=1,n
wf·log P(fi|z).(4)
Algorithm 1 summarizes the procedure to find the most probable
process given a fingerprint string from the knowledge base and the
destination information from the session.
4 DATA
We used mercury [
38
] to collect network data from a site located
within GMT+19, referred to as Site 1 (GMT+19) in Section 5, that also
reports host logs from the AnyConnect Network Visibility Module
[
3
]. The host and network datasets are joined daily as described in
Section 3.1. The data for this paper was collected between 2019-07-
01 and 2020-06-10, where the data from June 2020 was exclusively
used for testing. Over 70,000 unique hosts generated the data from
Site 1.
In total, we observed 9,312 unique fingerprint strings from over
13 billion TLS sessions with associated host data from Site 1 during
our monitoring. Table 2 lists the ten most prevalent TLS ports at
Site 1. 99.4% of the TLS sessions used port 443, the typical port for
HTTPS. The second most prevalent port, 993, is mainly used for
IMAP-over-TLS. This dataset has a large diversity of processes with
Site 1 (GMT+19) Malware Sandbox
Port Sessions Port Sessions
443 13,148,433,441 443 53,082,740
993 40,608,706 465 701,814
5228 3,430,729 9001 70,075
80 2,940,342 80 10,434
995 2,744,560 449 5,443
8443 2,458,762 26 4,928
8080 2,418,693 8443 4,370
5986 2,222,676 993 3,762
465 1,801,507 9002 3,690
5223 1,527,906 8080 3,475
Table 2: Top-10 TLS ports from Site 1 (GMT+19) and the mal-
ware analysis sandbox.
22,969 unique process names and 243,736 unique process executable
SHA-256s. During the first 10 days in June used for testing, there
were 39,768 hosts, 2,320 fingerprint strings, 4,073 process names,
16,474 process executable SHA-256s, and 278,570,891 TLS sessions.
To perform additional validation and to test how well the knowl-
edge base generalizes to new locations, we collected the same joined
data from a geographically distinct site located in GMT+8, referred
to as Site 2 (GMT+8) in Section 5. This site belonged to the same
enterprise; we leave validation on distinct entities for future work
as described in Section 9. We collected this data from 2020-06-01
to 2020-06-10, and it was only used for testing. There were 10,175
hosts, 824 fingerprint strings, 1,471 process names, 5,210 process
executable SHA-256s, and 33,820,842 TLS sessions.
In terms of the operating systems, nearly 70% of the data collected
from Sites 1 and 2 were MacOS 10.14.6 and Windows 10.0.17134.
28% of the data is comprised of other versions of Windows 10
and MacOS 10.15.x. The remaining data is primarily older versions
of Windows and MacOS. Only
0.01% of the data is Linux-based,
mainly Ubuntu 19.04 and 19.10. We further discuss this limitation
in Section 9.
Finally, we collected data from a malware analysis sandbox run-
ning Windows 7 and 10 between 2019-07-01 and 2020-06-10. Simi-
lar to Site 1, the data collected in June 2020 was used exclusively
for testing. The full malware dataset has 53,958,368 TLS sessions,
9,348 unique TLS fingerprint strings, and 37,841 unique process
executable SHA-256s. As Table 2 shows, the TLS ports for the mal-
ware analysis sandbox data were dominated by HTTPS similar to
Site 1, with over 98.4% of the sessions using port 443. There were
over 1,200 unique anti-virus signatures associated with 10 or more
samples. The most common malware families were Troldesh [
20
],
Tofsee [
35
], Emotet [
32
], and DarkComet [
29
]. The testing data
collected in June 2020 had 347 fingerprint strings, 9,117 process
executable SHA-256s, and 298,126 TLS sessions.
5 RESULTS
Throughout this section, we present results related to process iden-
tification, identifying cloud orchestration and processing tools, mal-
ware detection, and the importance of destination feature subsets.
5
Method Process Family Process
F1Score F1Score
W-Naïve Bayes 0.9941 (+/- 0.0004) 0.9650 (+/- 0.0013)
Naïve Bayes 0.9879 (+/- 0.0007) 0.9571 (+/- 0.0012)
Top Process 0.8953 (+/- 0.0043) 0.8860 (+/- 0.0052)
server_name 0.8556 (+/- 0.0073) 0.8215 (+/- 0.0068)
dst_ip 0.8537 (+/- 0.0154) 0.8181 (+/- 0.0154)
Table 3: Process inference results on data collected from Site
1 (GMT+19). The server_name and dst_ip methods do not use
the fingerprint information but rely strictly on the specified
destination information.
We compare the performance of the weighted naïve Bayes, un-
weighted naïve Bayes, and “top process" classification methods,
where top process ignores the destination features and simply se-
lects the process with the most observations given a TLS fingerprint
string. The top process method still takes advantage of the process
prevalence information in the knowledge base. We present results
using the weighted naïve Bayes classification method if not stated
otherwise. We further compare our approach to methods that make
no use of the fingerprint string, instead relying solely on either the
TLS server_name or the destination IP address.
To summarize overall performance, we use the micro-averaged
F1
score where the label set is either a process name or a process
family name. The process name label set has been normalized so
that processes appearing on different platforms, e.g., MacOS and
WinNT, map to the same label. For example,
chrome.exe
on WinNT
and
google chrome
on MacOS both map to
chrome
. The process
family name labels group sets of process that share an underlying
architecture and purpose. The two largest and most diverse process
families are Microsoft Office, including processes like
excel.exe
,
outlook.exe
, and
word.exe
, and Chromium-based web browsers,
including processes like
chrome.exe
,
brave.exe
, and
msedge.exe
.
The process family name labels generalize the process name labels.
We use precision,
tp/(tp +f p)
, and recall,
tp/(tp +f n)
, to highlight
the system’s performance on individual processes and malware. For
the malware detection results, we use a binary label where samples
are considered malware if five or more anti-virus engines labeled
the SHA-256 associated with the process as malicious.
All results in this section use a knowledge base constructed
on data collected during May 2020 from Site 1 (GMT+19) and the
malware analysis sandbox. As we show in Section 6, including the
training data from the months prior to May does not increase the
performance of the system. The testing data is collected during the
first 10 days of June 2020 from Site 1 (GMT+19), Site 2 (GMT+8), and
the malware analysis sandbox as specified throughout this section.
5.1 Process Identification
The core feature of the system described in Section 3 is to infer
the process name from the TLS fingerprint string and destination
information. Table 3 lists an overview of these results when ap-
plied to the first ten days of data collected in June 2020 from Site 1
(GMT+19). The
F1
score is averaged over each day and presented
with its standard deviation. The baseline method of selecting the
Figure 4: Confusion matrix for the top-25 processes on data
collected from Site 1 (GMT+19).
process with the most observations in the knowledge base for a
given fingerprint string resulted in an F1score of 0.8953. Both the
unweighted and weighted naïve Bayes methods improved signif-
icantly on the baseline, with the weighted naïve Bayes method
achieving an
F1
score of 0.9941 for process family identification. A
strategy that ignores the fingerprint information and selects the
process most closely associated with either the TLS
server_name
or destination IP address performed significantly worse. A diverse
set of processes often communicate with the same set of destina-
tions, and the TLS fingerprint string is needed to achieve superior
process identification performance.
The sessions misclassified by the weighted naïve Bayes algo-
rithm were skewed towards a small set of misclassifications be-
tween Microsoft Outlook, Cisco Webex, and Safari. Cisco Webex
has components that integrate with Microsoft Outlook, result-
ing in sessions initiated by Cisco Webex that communicate with
outlook.office365.com
on both WinNT and MacOS, confusing
the classifier. There is an overlap in the fingerprint strings that
Microsoft Outlook, Cisco Webex, and Safari present on MacOS.
When Microsoft Outlook or Cisco Webex communicate with CDNs
or advertising sites using the default CoreTLS library, both the fin-
gerprint string and the destination are more strongly correlated
with Safari in the knowledge base, resulting in misclassifications.
The aforementioned cases account for
40% of the misclassified
sessions. The other major outlier is Chromium-based applications
like Electron and Slack being misclassified as Chromium-based web
browsers. This case is responsible for
20% of the misclassifications.
Figure 4 and Table 4 present a more detailed view of the weighted
naïve Bayes classifier’s performance on the most prevalent pro-
cesses in the test data. Figure 4 presents the confusion matrix for the
top-25 process names in the test data. The “Other" category in the
confusion matrix consists of all remaining processes. In general, we
can correctly identify individual processes using the proposed meth-
ods. The primary weakness is disambiguating processes that share
a common architecture and purpose. For example, Microsoft Office
applications are often confused. The Chromium-based Microsoft
Edge is also misclassified as Chrome. In terms of implementing a
6
Process Sessions W-Naïve Bayes Top Process server_name dst_ip
Precision Recall Precision Recall Precision Recall Precision Recall
Cisco AMP 105,889,766 0.9999 1.0000 0.9851 0.9999 0.9576 0.9999 0.9997 0.9935
Chromium 45,931,514 0.9920 0.9998 0.9781 0.9993 0.6439 0.3059 0.6293 0.8496
Cisco Webex 38,403,011 0.9989 0.9923 0.8141 0.9680 0.9819 0.8972 0.9805 0.9653
Microsoft Office 26,855,990 0.9788 0.9911 0.7887 0.3963 0.9290 0.9745 0.9098 0.9647
Firefox 22,234,838 0.9994 0.9999 0.9992 0.9999 0.6329 0.3059 0.6124 0.2689
Safari 7,633,503 0.9787 0.9903 0.3751 0.9072 0.5094 0.2210 0.4672 0.1861
Internet Explorer 4,373,033 0.9903 0.9969 0.9490 0.8761 0.6521 0.2875 0.5343 0.2731
iCloud 4,328,783 0.9658 0.9803 0.6135 0.2546 0.9242 0.8512 0.8770 0.8230
Creative Cloud 1,891,238 0.9955 0.9950 0.5110 0.1246 0.9852 0.9881 0.9222 0.6824
Box 664,518 0.9992 0.9961 0.9822 0.9239 0.9555 0.9992 0.9508 0.9072
Table 4: Process inference results for the top-10 most prevalent process families on data collected from Site 1 (GMT+19).
Method Process Family Process
F1Score F1Score
W-Naïve Bayes 0.9858 (+/- 0.0019) 0.9702 (+/- 0.0021)
Naïve Bayes 0.9786 (+/- 0.0040) 0.9599 (+/- 0.0038)
Top Process 0.9077 (+/- 0.0057) 0.9035 (+/- 0.0061)
server_name 0.8381 (+/- 0.0108) 0.8297 (+/- 0.0091)
dst_ip 0.8145 (+/- 0.0286) 0.7904 (+/- 0.0271)
Table 5: Process inference results on data collected from Site
2 (GMT+8).
network security policy, process specificity may not be needed, and
general process families may suffice. For example, a security policy
could allow all Microsoft Office applications to bypass the firewall
and communicate directly with Microsoft servers.
Table 4 lists the precision and recall for the 10 most prevalent
process families. These process families all have precision and recall
greater than 0.99 for the weighted naïve Bayes classifier with three
exceptions: Microsoft Office, Safari, and iCloud. As was the case for
specific process categories, the lower performance on these process
families is due to other applications integrating with underlying
services and shared fingerprint strings communicating with generic
CDNs. These top-10 process families accounted for 93.41% of the
total test traffic from Site 1 (GMT+19). If we remove the data associ-
ated with these top-10 families from the testing data, the weighted
naïve Bayes classifier still achieves an
F1
score of 0.9711, with over
60% of the misclassifications due to Chromium-based applications
being misclassified as Chromium-based web browsers. The perfor-
mance of the classifiers based solely on destination information
is significantly worse for browsers that go to a wide variety of
destinations and generally underperforms on all processes.
In order to assess how well the techniques generalize to new
networks, we used the knowledge base trained on data from Site 1
(GMT+19) in May 2020 to test on data collected from Site 2 (GMT+8)
during the first ten days of June 2020. The results are summarized in
Table 5. Selecting the most prevalent process without considering
destination information performed slightly better compared to the
results presented in Table 3, but this is simply because there was
less process diversity on Site 2 (GMT+8).
The weighted naïve Bayes method remained competitive with
a process family
F1
score of 0.9858 (compared with .9941 on Site
1). The reduction in performance is in part due to observing pro-
cesses not seen on Site 1. There were over 300 unknown processes
initiating over 100,000 TLS sessions, or 0.3% of the total number
of TLS sessions collected from Site 2. The remaining discrepancy
is explained by geographic specific sub-domains and IP addresses
that did not appear in the original database, biasing the classifier
to select processes with higher prior probabilities.
As we discuss in Section 9, the knowledge base would ideally
be conditioned on enterprise and geography-specific data before
deployment. In cases where this isn’t feasible, Table 5 demonstrates
that competitive performance is still possible. It remains an open
question how well the knowledge base would translate to data
collected from a distinct enterprise, which we leave for future work.
5.2 Cloud Orchestration and Processing Tools
Tools specifically designed to facilitate cloud orchestration and
processing serve an important role in the current network ecosys-
tem. Identifying and prioritizing network traffic associated with
these tools can provide an enhanced user experience resulting in
increased productivity. Given applications’ reliance on cloud com-
puting resources, simply relying on IP addresses and domain names
to differentiate processes consuming cloud services versus those
responsible for running business critical services is difficult.
Table 6 provides the classification results for six different cloud
and virtualization tools. For these results, we selected processes that
facilitated cloud or virtualized workflows and also were represented
in our datasets. Terraform [
13
] generally manages infrastructure
resources, with terraform-provider-aws explicitly exposing AWS
APIs. AWS Kinesis [
6
] is a streaming data processing platform.
Docker [
8
], Kubernetes [
10
], and Helm [
9
] all support deploying
and managing containerized applications.
As shown in Table 6, the weighted naïve Bayes classifier outper-
formed all competing approaches. Our approach performed well
with respect to precision and recall for most applications, despite
many of these applications connecting to generic AWS services
7
Process Sessions W-Naïve Bayes Top Process server_name dst_ip
Precision Recall Precision Recall Precision Recall Precision Recall
terraform-provider-aws 81,784 0.9634 0.9975 0.5116 1.0000 0.9418 0.9247 0.8459 0.7754
docker 45,412 0.9979 0.9993 0.1882 0.3043 0.5008 0.9914 0.4708 0.8991
terraform 7,382 0.9849 0.6816 0.4038 0.0573 0.6607 0.6140 0.3120 0.2614
kubectl 5,653 0.9684 0.9858 0.8847 0.6474 0.9975 0.6215 0.9285 0.6282
helm3 690 0.9568 0.7696 0.9649 0.6783 0.8500 0.0986 0.8599 0.2580
awskinesistap 425 1.0000 0.9642 0.0000 0.0000 0.8182 0.2084 0.4889 0.0463
Table 6: Process inference results for various cloud orchestration and processing applications found in the data collected from
Site 1 (GMT+19) during June 2020.
Method Precision Recall
W-Naïve Bayes 0.9993 0.8868
Naïve Bayes 0.9992 0.7038
Top Process 0.9731 0.0682
server_name 0.7792 0.6949
dst_ip 0.6359 0.6499
Table 7: Malware detection results for Site 1 (GMT+19) and
the malware analysis sandbox.
like S3, which highlights the importance of incorporating the TLS
fingerprint string into the analysis.
5.3 Malware Detection
Malware has been replacing HTTP with TLS over the past several
years [
19
], and we observed malware samples from the malware
analysis sandbox more frequently using TLS compared to HTTP
during our collection period. The top two benign domains visited
were
twitter.com
(14.5% of the connections) and
www.google.com
(1.8% of the connections). The
server_name
extension was present
in 98.2% of connections, higher than the 90% observed at Site 1
(GMT+19) during the same time. For the testing data, a total of
12,249 unique server_name values were observed.
Table 7 presents the malware detection precision and recall for
the malware analysis sandbox data and data collected from Site 1
(GMT+19). While the method of simply selecting the most preva-
lent process had a relatively high precision of 97.31%, the recall
was poor. The top process method only detected 6.82% of the mal-
ware connections. This is unsurprising due to malware’s reliance
on system-provided TLS libraries. In our dataset, over 90% of the
malicious connections used the default Windows Schannel library
[
2
], which generates TLS fingerprint strings used by many popular
Windows process such as Microsoft Office and Internet Explorer.
By leveraging the destination information contained within
the knowledge base and the weighted naïve Bayes algorithm, we
increased the recall to 88.68%. Malware-initiated connections to
*.baidu.com
and
*.googleusercontent.com
were responsible
for
45% of the false negatives. In these cases, the malware sam-
ples used an Schannel-generated TLS fingerprint string, and there
were an overwhelming number of connections to those domains
by benign processes. Due to malware’s reliance on popular hosting
Site Process Family Process
F1Score F1Score
Site 1 (GMT+19) 0.9024 (+/- 0.0577) 0.9003 (+/- 0.0582)
Site 2 (GMT+8) 0.7897 (+/- 0.0843) 0.7622 (+/- 0.0849)
Table 8: Process inference results when restricted to finger-
print strings not in the database. In these cases, the analysis
algorithm must rely on approximate matching.
services like
*.googleusercontent.com
, the performance of clas-
sifiers strictly looking at destination information underperformed
with precisions and recalls between 63% and 77%. The added in-
formation from the TLS fingerprint string helps in disambiguating
many of the sessions that would be misclassified by the destination
information alone.
To evade detection, malware authors could shift to using the
TLS libraries of popular processes such as Chrome to avoid a trivial
detection through unique or rare TLS fingerprint strings. These
libraries also offer the most flexibility in terms of potential desti-
nations that would blend into previous observations from those
libraries, e.g., CDNs. But, unlike the developers of benign applica-
tions, malware authors are under additional constraints such as
avoiding any noticeable user experience differences on the infected
machine and the potential for take-down requests impacting their
server infrastructure. If malware were to mimic popular fingerprint
strings, the authors would need to make frequent updates to ensure
their selected fingerprint string is still relevant. Our system auto-
matically incorporates the latest fingerprint string, process, and
destination information so that we are as robust as possible to the
changing TLS landscape and can incorporate the latest malware
trends into the knowledge base.
5.4 Approximate Matching
New TLS fingerprint strings are continuously introduced into the
ecosystem, and a robust system needs to handle these cases. As
described in Section 3.2, we use Levenshtein distance to find “close"
fingerprint strings, and then perform process identification on the
close fingerprint string’s process list. Similar to the process iden-
tification results using Site 2 (GMT+8), this method will fail if the
process isn’t in the knowledge base or the close fingerprint string’s
process list.
8
Feature Process Family Process
Set F1Score F1Score
fp, sni, ip, port 0.9941 (+/- 0.0004) 0.9650 (+/- 0.0013)
fp, sni, ip 0.9940 (+/- 0.0004) 0.9656 (+/- 0.0012)
fp, sni, port 0.9876 (+/- 0.0006) 0.9567 (+/- 0.0012)
fp, ip, port 0.9811 (+/- 0.0039) 0.9485 (+/- 0.0040)
fp, sni 0.9938 (+/- 0.0004) 0.9651 (+/- 0.0014)
fp, ip 0.9885 (+/- 0.0039) 0.9578 (+/- 0.0041)
fp, port 0.8955 (+/- 0.0042) 0.8862 (+/- 0.0051)
fp 0.8953 (+/- 0.0043) 0.8860 (+/- 0.0052)
sni 0.8556 (+/- 0.0073) 0.8215 (+/- 0.0068)
ip 0.8537 (+/- 0.0154) 0.8181 (+/- 0.0154)
Table 9: Process identification results when only considering
subsets of the destination features using data collected from
Site 1 (GMT+19). server_name is represented as “sni" and only
using the fingerprint is represented as “fp".
To understand the performance of approximate matching, we
analyzed the results of the process identification system when the
test data is restricted to only contain fingerprint strings
not
in the
knowledge base. During the first 10 days of June 2020, there were
159 fingerprint strings that were not in the May 2020 knowledge
base constructed from Site 1; there were 8,781 TLS sessions (out of
278,570,891 total sessions) generated from these fingerprint strings.
We also analyze the data collected from Site 2, where there were
53 unknown fingerprint strings and 3,217 TLS sessions (out of
33,820,842 total sessions).
Table 8 lists the process identification results for both sites. The
system achieved an F1score of 0.9024 for the process family prob-
lem on the data from Site 1. The system had problems classifying
processes not seen in the training data, as well as TLS scanners
[
27
,
41
] that exhibit a large diversity in TLS fingerprint strings. The
process family
F1
score for Site 2 when restricted to fingerprint
strings not in the knowledge base was 0.7897. The discrepancy
between the results on the two sites is explained almost entirely by
processes observed on Site 2 that were never observed on Site 1.
While the results for approximate matching are worse than re-
sults when there is an exact match, we believe the approximate
matching technique provides a valuable addition to a TLS finger-
printing system. Additionally, with a well-curated knowledge base,
the number of sessions requiring approximate matching should
be low. Only 0.003% and 0.010% of sessions from Site 1 and Site 2
required approximate matching.
5.5 Feature Importance
The destination features have varying levels of information. With
the weighted naïve Bayes algorithm, we take the feature’s impor-
tance into account through the weights listed in Table 1. But there
are several ongoing initiatives to increase the privacy of TLS ses-
sions by obfuscating destination information. For example, encrypt-
ing the entire ClientHello (ECHO) [
43
] is one approach to obfuscate
the
server_name
value. ECHO would result in all TLS sessions to
a single service provider (CDN, cloud provider, etc.) offering the
same server_name value.
The
server_name
is the destination feature with the highest
weight in our system, and it is natural to question the efficacy of
the system if that feature were to contain significantly less informa-
tion. To better understand the importance of particular destination
features in our system, we performed process identification using
the weighted naïve Bayes algorithm on Site 1’s June 2020 data with
different subsets of features. Table 9 provides results for each combi-
nation of feature sets, where the feature set includes the destination
feature and related equivalence mappings. Table 9 also provides
results when only destination information is utilized, which is de-
noted by the lack of the “fp" identifier in the table.
For the weighted naïve Bayes classifier, the first row in Table 9
with “fp, sni, ip, port" is the performance when using all destination
features and “fp" is the performance when ignoring all destination
features, which then defaults to selecting the most prevalent process.
If it were the case that the
server_name
extension was completely
removed from all TLS sessions, our approach would still achieve
an
F1
score of .9811 for process family identification. On the other
hand, if IP addresses no longer contained the same amount of
information, e.g., all servers were hosted on CloudFlare, the system
can still achieve an
F1
score of .9938 with the
server_name
value
alone. The port features add little information when considering
aggregate statistics like the
F1
score but do help in niche cases such
as email and remote desktop application identification.
Considering evasion with respect to destination features and
Tables 1 and 9 provides some insights. Simply omitting the
server_name
may not give the desired effects because this will
alter the fingerprint string. For example, Psiphon [
4
] exhibits many
different fingerprint strings, one of which attempts to imitate
Chrome. In some cases, Psiphon imitates Chrome but omits the
server_name
, which causes it to be identifiable from just its TLS
fingerprint string. Robust evasion needs to jointly consider all
destination features and the TLS fingerprint string, while at the
same time making sure the destinations and fingerprint string
remain prevalent in real-world traffic.
6 OPERATIONALIZING
In the previous section, all results use a knowledge base constructed
from data collected during May 2020 to classify data collected during
the first 10 days of June 2020. Maintaining an up-to-date knowl-
edge base that captures the relevant real-world traffic statistics
is critical to the success of our solution. In this section, we study
how parameters of the knowledge base, such as its age, affect the
performance of our classifier. In the previous section, we also took
the most probable process returned by the classifier, ignoring the
score. Here, we show the impact of considering the score on the
classifier’s performance and the amount of data discarded.
6.1 Knowledge Base Age
Maintaining the knowledge base with current, real-world data is
not a trivial task, but is necessary due to the introduction of new
fingerprint strings, processes, and destination information. We now
investigate the importance of continuously updating the knowl-
edge base by examining the performance of the system when only
9
Figure 5: The effect of the knowledge base’s age on classifi-
cation performance, where the results use a knowledge base
trained on the month specified by the x-axis.
considering older data. For this experiment, we built 11 separate
knowledge bases by merging daily knowledge bases for each month
between July 2019 and May 2020. The May 2020 knowledge base
was used in the experiments of the previous section.
Figure 5 shows the process family and process classification
results when the classifier only has access to data from the specified
month, where the y-axis is from 0.93 to 0.99. Similar to the previous
section, the testing data for each result is taken from the first 10
days of June 2020. As expected, the performance of the classifier
on both label sets consistently decreases as the knowledge base is
trained on older months.
The decreasing performance is due to the evolution of fingerprint
strings and destination information. For example, while the testing
data only contained 159 fingerprint strings with 8,781 TLS sessions
that did not have a match in the knowledge base from May 2020,
there were 1,238 fingerprint strings with 33 million TLS sessions
in the testing data without a match in the July 2019 knowledge
base. A heavier reliance on approximate matching will reduce the
performance of the classifier as explained in Section 5.4.
Even with approximate matching, Figure 5 illustrates the clear
need to keep an up-to-date knowledge base conditioned on recent
real-world data as described in Section 3.1.
6.2 Aging Out Data
In addition to keeping the knowledge base up to date with the most
recent observations, one must consider the effect of older data that
may no longer be relevant. Removing processes associated with
a fingerprint string that have not been recently observed slightly
increases performance and leads to a considerable reduction in the
knowledge base’s size.
To measure the impact of older data on our system, we again
constructed 11 separate knowledge bases. For the current experi-
ment, each knowledge base includes data starting from each of the
months between July 2019 and May 2020. The 11 knowledge bases
include all data between their starting month and the end of May
2020. The size of the knowledge bases became progressively smaller
as we considered less data. The size of the knowledge constructed
Figure 6: The effect of aging out older observations in the
knowledge base on classification performance, where the re-
sults use a knowledge base constructed from data spanning
the month specified by the x-axis until May 2020.
Figure 7: Effect of adjusting the classifier’s threshold with
minimum probability values in [0.5, 0.6, 0.7, 0.8, 0.9, 0.95,
0.99, 0.999, 0.9999, 1.0].
from data between July 2019 and May 2020 was 195 megabytes. The
size of the knowledge base only considering the May 2020 data was
58 megabytes.
Figure 6 illustrates the performance of the classifier when con-
ditioned on each of the knowledge bases, using the first 10 days
of June 2020 as testing data. There was no advantage in maintain-
ing data older than one month with respect to process family and
process classification
F1
scores. In fact, the performance of the pro-
cess classification system decreased as the knowledge base kept
data for longer periods of time. The decrease in performance was
driven primarily by Chromium-based browsers being misclassified
as Chrome due to lagging updates of BoringSSL [7].
6.3 Classifier Threshold
The classification algorithm described in Section 3.4 returns the
most probable process associated with a fingerprint string and
10
Figure 8: The difference in process family classification
when using the open source and internal knowledge bases
to test data collected from Site 1 (GMT+19) during June 2020.
destination along with its probability. If network operators are
unwilling or unable to accept misclassifications, they can define a
minimum probability threshold and ignore any inferences that do
not meet that threshold.
In this experiment, we used the May 2020 knowledge base and
the June 2020 testing data from Site 1. Figure 7 shows the impact
of adjusting the classifier’s threshold on the process inference
F1
score and the fraction of data discarded if we ignore results below
the given threshold. For a threshold of 0.5, we classify over 99%
of the data and have a process inference
F1
score of 0.9737. At a
threshold of 0.999, we classify 58% of the data with an
F1
score of
0.9981. Finally, at a threshold of 1.0, we classify 6% of the data with
an F1score of 0.9991.
As the threshold is increased, the system begins to ignore (fin-
gerprint string, destination) tuples that are observed with many
processes. At the 0.999 threshold, the system ignores all off-diagonal
sessions in Figure 4 except for
60% of the Microsoft Edge connec-
tions because of Chrome’s dominance with respect to its number
of observations.
7 REPRODUCIBILITY
To assist reproducibility, we have open sourced the data collection
and analysis system described in this paper. Our core tool is a C/C++
program that uses Linux’s AF_PACKET TPACKETv3 zero-copy
shared memory ring buffers to collect and analyze data on network
links with capacities of 30Gbps+. This tool supports the generation
of TLS fingerprint strings and process inference with destination
context as described in Sections 2 and 3. mercury [
38
] additionally
generates client fingerprint strings for DHCP, DTLS, HTTP, SSH,
and TCP, as well as server fingerprint strings for HTTP, DTLS, and
TLS. We also released a pip-installable python implementation to
facilitate rapid prototyping.
We released an open source version of our internal TLS finger-
print knowledge base along with the open source tools. We are
committed to releasing up-to-date TLS fingerprint knowledge bases
to the open source community, which we have currently done each
week during the first seven months of 2020. The current open source
knowledge base contains 8,405 unique TLS fingerprint strings with
associated prevalence, process, and destination information. This
dataset was constructed by considering 9.5 billion TLS sessions.
There are currently 2,695 unique process names and 11,600 unique
process executable SHA-256s in the open source knowledge base.
To comply with the policies of the organization whose sites we
monitored, we had to remove some of the knowledge base’s content.
In the open source knowledge base, we only report the top-10 most
prevalent processes per fingerprint string. For each process, we
report only the equivalence mappings for the destination features.
For example, the FQDN in the
server_name
data is only reported
as a domain name and TLD. There is a 30-day delay before obser-
vations on the monitored sites are introduced into the open source
knowledge base. Finally, we did not open source any data from the
malware analysis sandbox.
To better understand the impact of omitting data from the open
source knowledge base, we compared the performance of the open
source and internal knowledge bases when applied to data collected
during the first 10 days of June 2020 from Site 1. We used the
May 2020 internal knowledge base and the open source knowledge
base that was available on June 1st, 2020. Figure 8 illustrates the
difference in performance when only selecting the most prevalent
process and applying both the unweighted and weighted naïve
Bayes algorithms.
As expected, the difference in performance between the two
knowledge bases was small when relying on the most prevalent
process. The algorithms based on naïve Bayes had a significant ad-
vantage when using the private knowledge base. As Figure 5 demon-
strated, the 30-day delay had a small impact on performance. Re-
moving the most informative destination features, the
server_name
and IP address according to Table 1, was the primary cause for the
degraded performance. While the process inference performance
based on the open source knowledge base is somewhat reduced,
we still believe this data represents a significant contribution to
the community and the first to associate TLS fingerprint strings,
processes, and destination information on a large-scale.
8 RELATED WORK
The work presented in this paper builds on a rich history of net-
work traffic fingerprinting and analysis. TLS fingerprinting first
became popular in 2009 when Ivan Ristić released an Apache mod-
ule to monitor SSL handshakes and correlate offered cipher suite
lists with HTTP
User-Agent
strings [
44
]. This led to several open
source packages that implemented methods to extract TLS finger-
print strings and provided TLS fingerprint databases [
1
,
15
,
24
,
45
].
The previous fingerprint databases did not provide real-world preva-
lence or contextual information about the destinations and therefore
could not rely on that information to disambiguate the set of pro-
cesses that mapped to the same fingerprint string. Husák et al. [
34
]
provided the first academic study of TLS fingerprinting, but again,
did not consider destination information or have the infrastructure
in place to develop detailed knowledge bases.
While our goal was to perform process inference using TLS
fingerprinting, several efforts have used TLS fingerprinting as a
means to perform measurement studies [
18
,
31
,
33
,
36
,
40
]. For
example, Kotzias et al. [
36
] used a combination of open source
11
fingerprint databases and their own data to examine how popular
browsers modified the cryptographic parameters offered in their
client_hello
’s in response to the disclosure of high-profile attacks
against TLS [
14
,
23
]. Frolov and Wustrow [
31
] studied the unique-
ness of censorship circumvention tools’ TLS fingerprint strings to
motivate the development uTLS [
5
], a TLS library to mimic and
randomize the
client_hello
. Our work illustrates the importance
of considering destination features for libraries like uTLS when
constructing a client_hello.
Performing protocol, application, and process identification on
encrypted traffic [
16
,
17
,
22
,
25
,
37
,
39
,
46
,
47
,
49
,
51
] has been an
active area of research over the past 15 years. Initial work focused
on identifying the application layer protocols, e.g., FTP, HTTP, and
SMTP, within an encrypted tunnel [
39
,
49
]. For example, Wright
et al. [
49
] used the sequence of TCP packets and a hidden Markov
model to identify application layer protocols.
More recent work has focused on the mobile and IoT domains
[
46
,
47
,
51
]. FlowPrint [
47
] takes a semi-supervised approach that
allows it to fingerprint previously unseen mobile applications. Flow-
Print considers timing, device, and destination features, where the
optimal batch window was found to be 300 seconds. Our work
differs is several key areas but is complimentarily. Our system pro-
vides process identification results after having only seen the first
packet in a TLS session, as opposed to 300 seconds in the worst
case. We use a continuous data collection system to ensure our
knowledge base has the most recent information, but we are unable
to identify previously unseen processes.
9 DISCUSSION
The system presented in this paper relies on the large-scale col-
lection, curation, and fusion of real-world data. We were able to
achieve our results by creating a custom tool to collect network
data and leveraging data from a pre-existing host agent. This ap-
proach led to quick results, but also created some critical gaps in
our system’s coverage. As detailed in Section 4, the endpoints that
generated host logs were almost entirely MacOS and WinNT-based
desktop systems. There was a small Linux component, but a com-
plete absence of mobile and IoT devices. While future work will
include expanding the capabilities of the data collection system to
remove these blind spots, we believe the underlying system and
process inference strategy is sound and can naturally incorporate
this new data.
Our data being limited to a single enterprise was another byprod-
uct of our data collection strategy. We believe that the diversity of
processes in our knowledge base and the results when applying
the classification system to Site 2 (GMT+8) provide some evidence
that our approach would scale to networks operated by a distinct
enterprise. But, for optimal performance, having a knowledge base
that was at least in part conditioned on data observed from the
target site would be best. Any site with standard endpoint visibility
agents with similar capabilities to the AnyConnect Network Visi-
bility Module [
3
] and the capacity to perform network monitoring
could create custom knowledge bases, but this does require a sig-
nificant initial investment. Further experiments into understanding
how well the knowledge base transfers between distinct enterprises
is left for future work.
Evading a system that continuously learns from billions of real-
world TLS connections is not trivial, but we provided some best
practices that privacy enhancing technologies could employ in Sec-
tion 5.3, e.g., using system-provided TLS libraries. On the other
hand, it is possible for malware to use these same techniques to
evade detection. We hypothesize that the additional constraints
placed on many classes of malware, e.g., maintaining prolonged
periods of not being detected, make evading a continuously updat-
ing knowledge base substantially more difficult. While techniques
based on information extracted only from the TLS
client_hello
are not incapable of being evading, the results of Section 5.3 in-
dicate that our system does have value. More investigations into
the security-privacy tradeoff of our system with respect to privacy
enhancing technology and malware detection is needed.
There exists several avenues to extend the core methods of Sec-
tion 3. The most straightforward extension is to expand the set
of destination feature equivalence mappings. Obvious examples
include the global popularity or a binned consonant-to-vowel ratio
of the
server_name
. Adding additional destination features may
also improve the performance of the system. In the May 2020 data,
only 25% of the TLS fingerprint strings and 35% of the TLS ses-
sions signaled support for TLS 1.3, and TLS 1.2 will most likely
remain a large fraction of the TLS traffic for years to come. For TLS
1.2 sessions, including features around the server’s certificate will
provide additional information about the server’s identity to the
classification system.
Finally, maintaining proper ethics when performing a project
analyzing real-world network and host data is critical. The unpro-
cessed data was stored on a platform with an institution approved
access control system. The data in the knowledge base was stripped
of any indicators that could be used to identify users, such as the
source IP addresses, detailed timestamps, and host agent identifiers.
We followed all institutional procedures, including signing institu-
tional agreements declaring that we would “minimize personally
identifiable information, maintain the confidentiality of all raw and
processed data, receive written consent from your direct manage-
ment chain before releasing any data, and pledge to not follow any
practices that could be deemed discriminatory".
10 CONCLUSION
In this paper, we presented a system that continuously collects and
fuses billions of real-world TLS sessions and host logs to generate
a knowledge base correlating TLS client fingerprint strings, host
processes, and destinations features. With the generated knowledge
base, we built a system that uses a weighted naïve Bayes algorithm
to infer processes and detect malware using only the TLS fingerprint
string and destination information contained within the first data
packet of a TLS session. We demonstrated that our system was
able to achieve an
F1
score of over 0.99 when inferring the process
family, and high efficacy malware detection with 99.9% precision
and 88.7% recall. We additionally examined the performance of our
system when used to identify cloud orchestration and processing
tools and found that the precision and recall were greater than 0.99
for several popular processes belonging to this category.
To assist in reproducibility, we contributed mercury [38] to the
open source community for collecting and classifying network
12
traffic. We also released an open source version of our internal
TLS fingerprint knowledge base, which is updated weekly and
is currently the largest and most informative open source TLS
fingerprint knowledge base in existence.
ACKNOWLEDGMENTS
We thank Brandon Enright for his support in developing mercury.
We thank both Brandon and Adam Weller for their feedback and
support. We thank Lucas Messenger, Eddie Allan Jr., and Joey Rosen
for their assistance in maintaining and providing access to the data
capture infrastructure. We also thank and acknowledge Ed Paradise
for his ongoing support of this work.
REFERENCES
[1]
2012. SSL Fingerprinting for p0f. (2012). https://idea.popcount.org/
2012-06- 17-ssl- fingerprinting-for-p0f/.
[2]
2018. Protocols in TLS/SSL (Schannel SSP). (2018). https://docs.microsoft.com/
en-us/windows/win32/secauthn/protocols- in-tls- ssl--schannel- ssp-.
[3]
2019. Cisco AnyConnect Secure Mobility Client. http://www.cisco.com/go/
anyconnect. (2019).
[4] 2019. Psiphon. (2019). https://www.psiphon3.com.
[5] 2019. uTLS. (2019). https://github.com/refraction- networking/utls.
[6] 2020. Amazon Kinesis. https://aws.amazon.com/kinesis/. (2020).
[7] 2020. BoringSSL. (2020). https://boringssl.googlesource.com/boringssl/.
[8] 2020. Docker. https://www.docker.com/. (2020).
[9] 2020. Helm. https://helm.sh/. (2020).
[10] 2020. Kubernetes. https://kubernetes.io/. (2020).
[11] 2020. MaxMind’s GeoLite2. (2020). https://www.maxmind.com/.
[12] 2020. Mozilla’s Public Suffix List. (2020). https://publicsuffix.org/list/.
[13] 2020. Terraform. https://www.terraform.io/. (2020).
[14]
Nadhem AlFardan, Daniel J Bernstein, Kenneth G Paterson, Bertram Poettering,
and Jacob CN Schuldt. 2013. On the Security of RC4 in TLS. In USENIX Security
Symposium. 305–320.
[15]
John B. Althouse, Jeff Atkinson, and Josh Atkins. 2017. JA3. (2017). https:
//github.com/salesforce/ja3.
[16]
Blake Anderson and David McGrew. 2016. Identifying Encrypted Malware
Traffic with Contextual Flow Data. In ACM Workshop on Artificial Intelligence
and Security (AISec). 35–46.
[17]
Blake Anderson and David McGrew. 2017. Machine Learning for Encrypted
Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity.
In ACM SIGKDD International Conference on Knowledge Discovery in Data Mining
(KDD). 1723–1732.
[18]
Blake Anderson and David McGrew. 2019. TLS Beyond the Browser: Combin-
ing End Host and Network Data to Understand Application Behavior. In ACM
SIGCOMM Internet Measurement Conference (IMC). 379–392.
[19]
Blake Anderson, Subharthi Paul, and David McGrew. 2017. Deciphering Mal-
ware’s Use of TLS (without Decryption). Journal of Computer Virology and
Hacking Techniques (2017), 1–17.
[20]
Pieter Arntz. 2019. Spotlight on Troldesh Ransonware, aka
’Shade’. https://blog.malwarebytes.com/threat-analysis/2019/03/
spotlight-troldesh- ransomware-aka- shade/. (2019).
[21]
David Benjamin. 2017. Applying GREASE to TLS Extensibility. Internet-Draft
(Informational). (2017). https://tools.ietf.org/html/draft- ietf-tls-grease-03.
[22]
Laurent Bernaille and Renata Teixeira. 2007. Early Recognition of Encrypted Ap-
plications. In International Conference on Passive and Active Network Measurement.
165–175.
[23]
Karthikeyan Bhargavan and Gaëtan Leurent. 2016. On the Practical (in-) Security
of 64-bit Block Ciphers: Collision Attacks on HTTP over TLS and OpenVPN.
In ACM SIGSAC Conference on Computer and Communications Security (CCS).
456–467.
[24]
Lee Brotherston. 2015. FingerprinTLS. (2015). https://github.com/synackpse/
tls-fingerprinting.
[25]
Manuel Crotti, Maurizio Dusi, Francesco Gringoli, and Luca Salgarelli. 2007.
Traffic classification through simple statistical fingerprinting. Computer Com-
munication Review 37, 1 (2007), 5–16. https://doi.org/10.1145/1198255.1198257
[26]
Tim Dierks and Eric Rescorla. 2008. The Transport Layer Security (TLS) Protocol
Version 1.2. RFC 5246 (Proposed Standard). (2008). http://www.ietf.org/rfc/
rfc5246.txt.
[27] Alban Diquet. 2019. SSLyze. (2019). https://github.com/nabla-c0d3/sslyze.
[28]
Donald Eastlake. 2011. Transport Layer Security (TLS) Extensions: Extension
Definitions. Internet-Draft (Standards Track). (2011). https://tools.ietf.org/html/
rfc6066.
[29]
Brown Farinholt, Mohammad Rezaeirad, Damon McCoy, and Kirill Levchenko.
2020. Dark Matter: Uncovering the DarkComet RAT Ecosystem. In ACM Inter-
national World Wide Web Conference. 2109–2120.
[30]
Roy Fielding and Julian Reschke. 2014. Hypertext Transfer Protocol (H TTP/1.1):
Semantics and Content. RFC 7231 (Proposed Standard). (2014). http://www.ietf.
org/rfc/rfc7231.txt.
[31]
Sergey Frolov and Eric Wustrow. 2019. The use of TLS in Censorship Circum-
vention. In Network and Distributed System Security Symposium (NDSS).
[32]
Colin Grady, William Largent, and Jaeson Schultz. 2019. Emotet is
Back After a Summer Break. https://blog.talosintelligence.com/2019/09/
emotet-is- back-after-summer-break.html. (2019).
[33]
Ralph Holz, Johanna Amann, Olivier Mehani, Matthias Wachs, and Mohamed Ali
Kaafar. 2016. TLS in the Wild: An Internet-wide Analysis of TLS-based Proto-
cols for Electronic Communication. In Network and Distributed System Security
Symposium (NDSS).
[34]
Martin Husák, Milan Cermák, Tomá Jirsík, and Pavel Celeda. 2015. Network-
Based HTTPS Client Identification using SSL/TLS Fingerprinting. In Availability,
Reliability and Security (ARES). 389–396.
[35]
Jaroslaw Jedynak. 2017. A Deeper Look at Tofsee Modules. https://www.cert.pl/
en/news/single/a-deeper- look-at-tofsee- modules/#4-proxyrdll. (2017).
[36]
Platon Kotzias, Abbas Razaghpanah, Johanna Amann, Kenneth G. Paterson,
Narseo Vallina-Rodriguez, and Juan Caballero. 2018. Coming of Age: A Lon-
gitudinal Study of TLS Deployment. In ACM SIGCOMM Internet Measurement
Conference (IMC). 415–428.
[37]
Marc Liberatore and Brian Neil Levine. 2006. Inferring the Source of Encrypted
HTTP Connections. In Proce edings of the Thirteenth ACMConference on Computer
and Communications Security (CCS). 255–263.
[38]
David McGrew, Brandon Enright, and Blake Anderson. 2020. Mercury: Fast TLS,
TCP, and IP Fingerprinting. https://github.com/cisco/mercury. (2020).
[39]
Andrew W Moore and Denis Zuev. 2005. Internet Traffic Classification Using
Bayesian Analysis Techniques. SIGMETRICS Performance Evaluation Review 33
(2005), 50–60.
[40]
Abbas Razaghpanah, Arian Akhavan Niaki, Narseo Vallina-Rodriguez, Srikanth
Sundaresan, Johanna Amann, and Phillipa Gill. 2017. Studying TLS Usage in
Android Apps. In International Conference on emerging Networking EXperiments
and Technologies (CoNEXT). 350–362.
[41] ioerror rbsec. 2019. sslscan. (2019). https://github.com/rbsec/sslscan.
[42]
Eric Rescorla. 2018. The Transport Layer Security (TLS) Protocol Version 1.3.
RFC 8446 (Proposed Standard). (2018). http://www.ietf.org/rfc/rfc8446.txt.
[43]
Eric Rescorla, Kazuho Oku, Nick Sullivan, and Christopher Wood. 2020. En-
crypted Server Name Indication for TLS 1.3. Internet-Draft (Experimental).
(2020). https://tools.ietf.org/html/draft-ietf- tls-esni- 06.
[44]
Ivan Ristic. 2009. HTTP Client Fingerprinting using SSL Hand-
shake Analysis. (2009). https://blog.ivanristic.com/2009/06/
http-client- fingerprinting-using- ssl-handshake-analysis.html.
[45] Ivan Ristić. 2012. sslhaf. (2012). https://github.com/ssllabs/sslhaf.
[46]
Vincent F. Taylor, Riccardo Spolaor, Mauro Conti, and Ivan Martinovic. 2016.
AppScanner: Automatic Fingerprinting of Smartphone Apps From Encrypted
Network Traffic. In IEEE European Symposium on Security and Privacy. 439–454.
[47]
Thijs van Ede, Riccardo Bortolameotti, Andrea Continella, Jingjing Ren, Daniel J
Dubois, Martina Lindorfer, David Choffnes, Maarten van Steen, and Andreas
Peter. 2020. FLOWPRIN T: Semi-Supervised Mobile-App Fingerprinting on En-
crypted Network Traffic. In Network and Distributed System Security Symposium
(NDSS).
[48]
Quaizar Vohra and Enke Chen. 2012. BGP Support for Four-Octet Autonomous
System (AS) Number Space. Internet-Draft (Standards Track). (2012). https:
//tools.ietf.org/html/rfc6793.
[49]
Charles V Wright, Fabian Monrose, and Gerald M Masson. 2006. On Inferring
Application Protocol Behaviors in Encrypted Network Traffic. Journal of Machine
Learning Research (JMLR) (2006), 2745–2769.
[50]
Harry Zhang and Shengli Sheng. 2004. Learning Weighted Naive Bayes with
Accurate Ranking. In IEEE International Conference on Data Mining (ICDM’04).
567–570.
[51]
Wei Zhang, Yan Meng, Yugeng Liu, Xiaokuan Zhang, Yinqian Zhang, and Haojin
Zhu. 2018. HoMonit: Monitoring Smart Home Apps from Encrypted Traffic.
In ACM SIGSAC Conference on Computer and Communications Security (CCS).
1074–1088.
13
A TLS EXTENSIONS WITH DATA
Extension Name Extension Hex Code
max_fragment_length 0001
status_request 0005
client_authz 0007
server_authz 0008
cert_type 0009
supported_groups 000a
ec_point_formats 000b
signature_algorithms 000d
heartbeat 000f
application_layer_ 0010
protocol_negotiation
status_request_v2 0011
client_certificate_type 0013
server_certificate_type 0014
token_binding 0018
compress_certificate 001b
record_size_limit 001c
supported_versions 002b
psk_key_exchange_modes 002d
signature_algorithms_cert 0032
channel_id 5500
GREASE 0a0a
14