Accurate TLS Fingerprinting using Destination Context and

Knowledge Bases

Blake Anderson

Cisco

blake.anderson@cisco.com

David McGrew

Cisco

mcgrew@cisco.com

ABSTRACT

Network fingerprinting is used to identify applications, provide

insight into network traffic, and detect malicious activity. With

the broad adoption of TLS, traditional fingerprinting techniques

that rely on clear-text data are no longer viable. TLS-specific tech-

niques have been introduced that create a fingerprint string from

carefully selected data features in the

client_hello

to facilitate

process identification before data is exchanged. Unfortunately, this

approach fails in practice because hundreds of processes can map to

the same fingerprint string. We solve this problem by presenting a

TLS fingerprinting system that makes use of the destination address,

port, and server name in addition to a carefully constructed finger-

print string. The destination context is used to disambiguate the set

of processes that match a fingerprint string by applying a weighted

naïve Bayes classifier, resulting in far greater performance.

Our methods are made possible by a data fusion system that

continuously collects and fuses host and network data, building

up-to-date fingerprint knowledge bases that correlate TLS finger-

print strings, processes, and destinations for 50+ million real-world

sessions each day. Using data collected from two geographically

distinct sites and a malware analysis sandbox, we demonstrate

that our solution can achieve an

score of greater than 0.99 for

process identification and high efficacy malware detection with

99.9% precision and 88.7% recall. We provide specific results for

the set of most common processes and a set of cloud orchestration

tools. In the case of no exact fingerprint string matches, we demon-

strate that our system can accommodate approximate fingerprint

string matching with an

score of 0.90. Finally, we have released

an open source tool, mercury [

], that implements the proposed

techniques and provide weekly updates to an open source TLS

fingerprint knowledge base to assist reproducibility of our work.

CCS CONCEPTS

•Security and privacy →Network security

;Malware and its

mitigation;

KEYWORDS

TLS Fingerprinting; Process Identification; Malware

1 INTRODUCTION

Process identification from network traffic aids many use cases

including network segmentation, malware detection, and vulner-

able application detection. The HTTP

User-Agent

[

] has been

used as a proxy for process identification, but with the increasing

use of Transport Layer Security (TLS) [

], methods that rely

on clear-text data have become obsolete. Existing solutions have

been proposed for identifying processes that have initiated TLS

connections [

], but these solutions must observe complete

sessions making them unsuitable for real-time enforcement.

TLS fingerprinting has been proposed as a technique to enable

real-time enforcement by providing the initiating process’s identity

after observing the client’s initial TLS packet. Traditional TLS finger-

printing extracts metadata presented in the TLS

client_hello

and

generates a fingerprint string using a pre-defined schema. These

techniques are relevant for all versions of the TLS protocol, in-

cluding TLS 1.3 [

] where all needed data features are still pre-

sented unencrypted. Given a fingerprint string, TLS fingerprinting

then maps that string to a process by using a dictionary of known

fingerprint-to-process mappings. Unfortunately, TLS fingerprint

strings are often more indicative of a TLS library than they are of a

specific process, with fingerprint strings often mapping to tens or

hundreds of unique processes.

This limitation can be seen by analyzing publicly available mal-

ware fingerprint feeds. Abuse.ch, which cautions that their feed has

“not been tested against known good traffic yet and may cause a

significant amount of FPs", provides a list of 70 JA3 [

] hashes used

by malware. We reverse engineered 68 of the hashes and mapped

them to our data format. 59 of the indicators were more strongly

associated with benign processes, such as Internet Explorer, Python,

and Java, than they were with malware. 12 of the indicators led to

1 million or more false positives during a 10-day period for one of

our testing sites. While the TLS fingerprint string taken by itself is

often a poor indicator, additional contextual information can help

to increase performance.

In this paper, we generalize TLS fingerprinting by incorporat-

ing contextual information contained within the

client_hello

packet. Our approach uses the destination IP address, port, and

server_name

value (if available) to disambiguate potential pro-

cesses. We define equivalence classes for the destination features

to help generalize to unseen destination values. As an example, the

classification system uses both the IP address and the autonomous

system of the IP address. We combine the features using a simple

weighted naïve Bayes classifier, which relies on probability esti-

mates provided by our fingerprint knowledge base. We show that

our approach of simultaneously considering the TLS fingerprint

string and the destination information is a significantly improve-

ment compared to systems based solely on the fingerprint string or

the destination information.

An underlying assumption of TLS fingerprinting is that there

exists a well-curated database that maps TLS fingerprint strings

to process or library names. This is especially true for our work,

where the naïve Bayes algorithm requires a knowledge base that

provides prevalence information for each process associated with a

fingerprint string, along with counts for each destination feature

that a given process visited while using a specific TLS fingerprint

arXiv:2009.01939v1 [cs.CR] 3 Sep 2020

string. We built a system that continuously fuses real-world host

and network data, which we use to build a TLS knowledge base

that reflects the most recent TLS usage on the monitored networks

based on billions of connections. Furthermore, the automated nature

of our knowledge base generation ensures that our system stays

current with destination information that frequently changes.

To demonstrate the efficacy of the proposed solution, the clas-

sifier is trained on data collected from a site in GMT+19 during

the month of May 2020 and applied to data collected during the

first ten days of June 2020 from the same site as well as a site in

GMT+8. We provide the process identification

score, as well as

precision/recall results highlighting how the classifier performs on

the top applications and cloud orchestration and processing tools.

Malware’s use of TLS has been well-documented [

], with many

malware families prone to using standard libraries as discussed

above. We use the results of a malware analysis sandbox to demon-

strate that our techniques are able to achieve 99.9% precision and

88.7% recall for the malware detection task.

Operational concerns or parameters are often ignored in the

construction and application of TLS fingerprint databases. TLS fin-

gerprint strings and the way applications use TLS is constantly

evolving [

], leading to a steady introduction of new finger-

print strings. Accommodating approximate matching helps manage

unseen fingerprint strings, and we show that an approximate match-

ing scheme based on edit distance can still achieve an

score up

to 0.902 on sessions with an unknown fingerprint string. We also

demonstrate the importance of a well-tuned knowledge base by

looking at the classifier’s performance when it is not kept up to

date. A knowledge base in our format that captures all process and

destination information for hundreds of billions of sessions can be

up to several hundred megabytes, reducing the efficiency of online

classification. We show that incorporating data that is older than

one month reduces classification performance and unnecessarily

increases the size of the database.

To support this work, we developed an open source C/C++ tool,

mercury [

], that uses Linux’s AF_PACKET TPACKETv3 zero-copy

shared memory ring buffers to collect and classify data on network

links with capacities of 30Gbps+. We also released a python-based

implementation to facilitate rapid prototyping. We are committed

to releasing up-to-date TLS fingerprint knowledge bases, which

we have currently done weekly during the first seven months of

2020. Some information has been removed from the open source

fingerprint knowledge base relative to the internal knowledge bases,

and we highlight the difference in expected performance in Section

Our novel contributions include:

•

The introduction of a TLS fingerprinting system that in-

corporates destination information and accommodates ap-

proximate matching to provide state-of-the-art process and

malware identification results.

•

A study of the impact of different knowledge base configu-

ration options on the classifier’s performance.

•

An open source tool that implements all presented tech-

niques along with a TLS fingerprint knowledge base that







Version: 0303

CipherSuites: 0a0a. . .

Extensions: 0000. . .

server_name google.com

destination IP 8.8.8.8

destination port 443







→







Process: chrome.exe

Version: 79.0.3945

SHA-256: 5616. . .

Category: browser

Malware: False

OS: WinNT

OS version: 10.0.18363

OS edition: Enterprise







Figure 1: Classical TLS fingerprinting aims to map parame-

ters extracted from the client_hello to a set of informative

labels such as the process name or operating system. Gray

features are only used in generalized TLS fingerprinting.

is updated weekly and currently has over 8,400 TLS finger-

print strings with detailed process and destination infor-

mation.

2 TLS FINGERPRINTING

Transport Layer Security (TLS) [

] is the most popular proto-

col to secure communications over the Internet. A client begins a

TLS session by sending a

client_hello

message, which contains

a TLS version, a list of supported cipher suites ordered by prefer-

ence, and a list of extensions that provide additional context, e.g.,

the

server_name

extension provides the DNS hostname so that

front-end servers can route connections without having to perform

decryption [

]. The server then responds with a

server_hello

message that selects a set of cryptographic parameters offered by

the client and a

certificate

message proving the server’s iden-

tity. The server then initiates the key exchange, after which the

client and server send

finished

messages and begin exchanging

encrypted data.

TLS fingerprinting operates on the initial

client_hello

mes-

sage, allowing for real-time policy enforcement before the TLS

handshake completes or encrypted data is exchanged. Figure 1

shows a simple example of an idealized TLS fingerprinting system,

where the goal is to learn some function,

, that maps parameters

offered in the client_hello to a set of informative labels such as

the process name or operating system. In this work, we focus on

identifying the process name.

Several goals are important for the generation of a TLS finger-

print string. The fingerprint string format should be unambiguous

and reversible. It should provide the most discriminating power pos-

sible, taking advantage of every informative data feature. It should

be able to accommodate complex patterns such as TLS GREASE

[

]. Lastly, it should be computationally inexpensive, and robust

against Denial of Service (DoS) attacks that aim to exhaust system

resources. Our system uses the following fingerprint schema:

(version)(cipher suites)((ext1)(ext2)...)

where each field is a hex string corresponding to the bytes

observed in the

client_hello

to facilitate reversibility, i.e., it is

possible to determine substrings that were in the packet from

the fingerprint string. To help ensure discriminating power,

all cipher suite and extension orderings are maintained, and

((ext1)(ext2)...)

includes the data for 21 extensions along with

all type codes. The data associated with session-specific extensions

is omitted, e.g.,

server_name

and

key_share

, and data associated

Figure 2: Data fusion system that correlates network and

host logs in order to attribute network sessions to processes.

with client-specific parameters is kept, e.g.,

supported_groups

supported_versions

, and

compress_certificate

. A full list of

extensions that retain their data in the fingerprint schema is given

in Appendix A. Along with normalizing session-specific extension

data, GREASE [

] cipher suites, extension types, and extension

data values are normalized to the value

0a0a

, but their ordering is

preserved. Finally, we avoid any cryptographic computations on

the fingerprint string itself, such as computing an MD5 hash, for

computational efficiency.

Given a TLS fingerprint string, traditional fingerprinting systems

would return a single process or set of processes that have been

observed using that fingerprint string as defined in a fingerprint

database. Unfortunately, TLS fingerprint strings often map to many

processes due to those processes using the same underlying TLS

library. For example, in the data we collected during the month

of May 2020, the median number of unique process names for the

top-100 most prevalent fingerprint strings was 24.5. To provide

actionable intelligence, a TLS fingerprinting system needs signifi-

cantly more specificity than current approaches provide.

3 GENERALIZED TLS FINGERPRINTING

To overcome the limitations of previous systems, we put forward

an approach to generalize TLS fingerprinting as explained in this

section. Our approach is centered around the construction of a TLS

fingerprint knowledge base that relies on continuous, large-scale

data collection, curation, and fusion. We also propose an approxi-

mate matching scheme to accommodate the introduction of new

fingerprint strings for completeness. To further improve the robust-

ness of our system, we introduce equivalence classes on destination

features to better handle unseen destination/process combinations.

Finally, given a match in our knowledge base, we use a weighted

naïve Bayes classifier to provide the most probable process name

given the list of potential processes and their destinations.

packets

protocol

identification

fingerprint

extraction

substring

normalization

exact

matcher

approximate

matcher

Figure 3: Control flow for online matching.

3.1 Knowledge Base Generation

Traditional TLS fingerprint databases provide mappings from TLS

fingerprint strings to processes, libraries, or operating systems,

but lack the context needed to disambiguate the results. The TLS

fingerprint knowledge base provides this context by associating

each TLS fingerprint string with a list of processes observed using

it, along with destination and operating system information. We

have built a data fusion system as shown in Figure 2 to collect this

data. For the host data, we use records sent by the AnyConnect

Network Visibility Module (NVM) [

], which contain the network

5-tuple, an event start timestamp, the name of the communicating

process, the SHA-256 hash of the process executable, and the host’s

operating system. For the network data, we use our custom open

source tool, mercury [

], which reports the TLS fingerprint string

along with the network 5-tuple and an event start timestamp.

The host and network data are joined daily by the network

5-tuple and event start timestamps. If there are network 5-tuple

collisions, the records with the minimal timestamp delta are joined.

We discard records with timestamp deltas greater than 5 seconds,

which resulted in less than a 0.1% data reduction. The joined records

contain the TLS fingerprint string, destination information such

as the IP address and

server_name

extension value, and process

attribution information.

The fused records are then used to condition a knowledge base.

Records are grouped by their TLS fingerprint string, collecting a

list of all associated processes. For each process, we record the total

number of sessions and all destinations with associated counts. Each

destination is represented as a 3-tuple comprised of the IP address,

port, and

server_name

value (if present). Separate knowledge bases

are generated for each day’s traffic, and then merged to create an

operational knowledge base. We use this flexibility to discard older

data as described in Section 6.

In addition to real-world host and network data, we also generate

knowledge bases from the artifacts of a malware analysis sandbox.

The joining procedure is similar but relies on packet captures and

analysis files to create the joined records.

3.2 Approximate Matching

TLS fingerprint strings are constantly evolving. Kotzias et al. [

]

demonstrated this evolution and how the disclosure of security

flaws in TLS impacts the TLS versions, cipher suites, and extensions

clients offer. Our full knowledge base, collected between July 2019

to June 2020, contains over 8,000 fingerprint strings with process

information, but we still consistently add 10-20 new fingerprint

strings with process information per day.

Similar to previous work [

], we dealt with this issue by imple-

menting a fingerprint string similarity metric based on the Leven-

shtein distance, which measures the number of cipher suite and

extension insertions, deletions, or substitutions that are needed

to transform one fingerprint string into another. When we see a

fingerprint string that is missing from our database during online

analysis, we compute the Levenshtein distance between all known

fingerprint strings and select the fingerprint string with the mini-

mal distance. The fingerprint string’s prevalence is used to break

any ties. Results specific to approximate matching are given in

Section 5.4

Figure 3 demonstrates the system’s control flow from observ-

ing a packet on the wire to reporting the appropriate entry in the

knowledge base. We first perform protocol identification using a

bit mask over the first 8 bytes of application data, matching known

TLS versions and the

client_hello

’s record and message identifier.

We then extract the fingerprint string conforming to the schema

described in Section 2, and normalize session-specific extension

data and GREASE values. We then report the exact match if the

fingerprint string is currently in our knowledge base or use the

approximate matching technique as described above. While approx-

imate matching is computationally expensive, we store the results

in the knowledge base which leads to a low amortized cost.

3.3 Equivalence Classes

Similar to the introduction of new fingerprint strings, the des-

tination information associated with a process can change over

time. The changes often take the form of a new subdomain in the

server_name

or a different IP address within the same public cloud

environment. Our classification system described in Section 3.4

uses probabilities conditioned on real-world observations to select

the most probable process. If a

server_name

has a random compo-

nent in the subdomain but the domain name remains constant, this

would result in a zero probability despite the

server_name

being

obviously related to the known process.

We solve the problem of unseen destination values by intro-

ducing equivalence classes for the observable features, and we use

those equivalence classes in the classifier. Each equivalence relation

partitions the set of features into distinct subsets. There may be

more than one useful equivalence relation for a feature, since there

are multiple ways that addresses, ports, and domain names can be

related. All the addresses within a BGP Autonomous System (AS)

[

] are equivalent in some way, as are the addresses in a corporate

offering such as Microsoft Office 365, Azure, AWS, or Cloudflare.

In the current work, we test four equivalence classes. For the

server_name

feature, we extract the domain name and the top-

level domain using Mozilla’s Public Suffix List [

]. The IP address

is mapped to the corresponding BGP AS using MaxMind’s GeoLite2

Feature Weight

server_name 0.97192

server_name →Domain 0.16200

server_name →TLD 0.01044

IP 0.53294

IP →AS 0.10343

Port 0.00396

Port →Port Class 0.00265

Table 1: Weighted naïve Bayes feature weights as found with

the information gain ratio.

database [11]. The ports are mapped to a known application layer

protocol, e.g., 443

→

HTTPS and 993/995

→

email, with unknown

and ephemeral ports mapping to “unknown". While a more inter-

esting set of equivalence classes would undoubtedly improve the

performance of the classification system, it would not change the

mechanics of the classifier and we leave these investigations to

future work as discussed in Section 9.

3.4 Weighted Naïve Bayes Model

Given an exact or approximate fingerprint string match and a set

of destination features along with their equivalence mappings, our

goal is now to select the most probable process from the fingerprint

string’s list of possible processes.

We formalize the system as follows. Each session is associated

with a process

, which is not directly observed, as well as the ob-

served destination features and equivalence mappings,

f1, . . . , fn

denotes the set of processes previously observed using the

matched fingerprint string as given by the knowledge base. Our

goal is to construct a classifier

that given a TLS fingerprint string,

, returns the process that maximizes

P(z|f1, . . . , fn)

for

z∈ Zfp

For interpretability and computational efficiency, we chose the

naïve Bayes model:

c(f1, . . . , fn)=argmax

z∈Zfp

P(z|f1, . . . , fn)(1)

=argmax

z∈Zfp

P(z)Ö

i=1,n

P(fi|z)(2)

=argmax

z∈Zfp

log P(z)+Õ

i=1,n

log P(fi|z).(3)

Equation 1 simply defines the classifier as the function that returns

the most probable process given the destination feature set. Equa-

tion 2 applies Bayes’ theorem, removes the irrelevant denominator,

and applies the naïve Bayes assumption that each observed feature

is conditionally independent from all other observed features. Equa-

tion 3 simplifies the computation and helps to prevent underflow

by moving to sums of logarithms.

P(fi|z)

is computed by using the empirical probability estimates

provided by the knowledge base. In cases where

P(fi|z)=

0, we

use a prior probability of 1

where

is the total number of sessions

observed using a given fingerprint string in the knowledge base.

Computing

P(fi|z)

directly from the knowledge base and lever-

aging a naïve Bayes model helps to avoid an unreasonably large

Algorithm 1 Process identification.

Given: fingerprint, destination_features

proc_list.initialize_to_empty_list()

for z∈fingerprint.processes do

qlog P(z)

for f∈destination_features do

qq+wf·log P(f|z)

for γ∈f.eqv_classes do

eγ(f)

qq+wγ·log P(e|z)

end for

proc_list.append(q,z)

end for

return proc_list.get_maximum()

number of features that would be needed by other machine learn-

ing alternatives such as deep neural networks or support vector

machines.

Because the destination features and equivalence mappings pro-

vide varying levels of information about the process initiating a

TLS session, we opted to modify the naïve Bayes algorithm to use

a weighted combination of the features as described by Zhang and

Sheng [

]. The feature weights are computed using the information

gain ratio conditioned on the knowledge base. The weights found

with this method using a knowledge base constructed from data col-

lected during May 2020 is given in Table 1. Both the

server_name

and IP address have high weights and are stronger indicators of a

process given a fingerprint string relative to the port information,

which is heavily biased towards port 443 as shown in Section 4.

Section 5.5 provides results when we consider subsets of destination

features.

With the feature weights,

, computed, the new equation to

compute the most probable process becomes:

c(f1, . . . , fn)=argmax

z∈Zfp

log P(z)+Õ

i=1,n

wf·log P(fi|z).(4)

Algorithm 1 summarizes the procedure to find the most probable

process given a fingerprint string from the knowledge base and the

destination information from the session.

4 DATA

We used mercury [

] to collect network data from a site located

within GMT+19, referred to as Site 1 (GMT+19) in Section 5, that also

reports host logs from the AnyConnect Network Visibility Module

[

]. The host and network datasets are joined daily as described in

Section 3.1. The data for this paper was collected between 2019-07-

01 and 2020-06-10, where the data from June 2020 was exclusively

used for testing. Over 70,000 unique hosts generated the data from

Site 1.

In total, we observed 9,312 unique fingerprint strings from over

13 billion TLS sessions with associated host data from Site 1 during

our monitoring. Table 2 lists the ten most prevalent TLS ports at

Site 1. 99.4% of the TLS sessions used port 443, the typical port for

HTTPS. The second most prevalent port, 993, is mainly used for

IMAP-over-TLS. This dataset has a large diversity of processes with

Site 1 (GMT+19) Malware Sandbox

Port Sessions Port Sessions

443 13,148,433,441 443 53,082,740

993 40,608,706 465 701,814

5228 3,430,729 9001 70,075

80 2,940,342 80 10,434

995 2,744,560 449 5,443

8443 2,458,762 26 4,928

8080 2,418,693 8443 4,370

5986 2,222,676 993 3,762

465 1,801,507 9002 3,690

5223 1,527,906 8080 3,475

Table 2: Top-10 TLS ports from Site 1 (GMT+19) and the mal-

ware analysis sandbox.

22,969 unique process names and 243,736 unique process executable

SHA-256s. During the first 10 days in June used for testing, there

were 39,768 hosts, 2,320 fingerprint strings, 4,073 process names,

16,474 process executable SHA-256s, and 278,570,891 TLS sessions.

To perform additional validation and to test how well the knowl-

edge base generalizes to new locations, we collected the same joined

data from a geographically distinct site located in GMT+8, referred

to as Site 2 (GMT+8) in Section 5. This site belonged to the same

enterprise; we leave validation on distinct entities for future work

as described in Section 9. We collected this data from 2020-06-01

to 2020-06-10, and it was only used for testing. There were 10,175

hosts, 824 fingerprint strings, 1,471 process names, 5,210 process

executable SHA-256s, and 33,820,842 TLS sessions.

In terms of the operating systems, nearly 70% of the data collected

from Sites 1 and 2 were MacOS 10.14.6 and Windows 10.0.17134.

∼

28% of the data is comprised of other versions of Windows 10

and MacOS 10.15.x. The remaining data is primarily older versions

of Windows and MacOS. Only

∼

0.01% of the data is Linux-based,

mainly Ubuntu 19.04 and 19.10. We further discuss this limitation

in Section 9.

Finally, we collected data from a malware analysis sandbox run-

ning Windows 7 and 10 between 2019-07-01 and 2020-06-10. Simi-

lar to Site 1, the data collected in June 2020 was used exclusively

for testing. The full malware dataset has 53,958,368 TLS sessions,

9,348 unique TLS fingerprint strings, and 37,841 unique process

executable SHA-256s. As Table 2 shows, the TLS ports for the mal-

ware analysis sandbox data were dominated by HTTPS similar to

Site 1, with over 98.4% of the sessions using port 443. There were

over 1,200 unique anti-virus signatures associated with 10 or more

samples. The most common malware families were Troldesh [

Tofsee [

], Emotet [

], and DarkComet [

]. The testing data

collected in June 2020 had 347 fingerprint strings, 9,117 process

executable SHA-256s, and 298,126 TLS sessions.

5 RESULTS

Throughout this section, we present results related to process iden-

tification, identifying cloud orchestration and processing tools, mal-

ware detection, and the importance of destination feature subsets.

Method Process Family Process

F1Score F1Score

W-Naïve Bayes 0.9941 (+/- 0.0004) 0.9650 (+/- 0.0013)

Naïve Bayes 0.9879 (+/- 0.0007) 0.9571 (+/- 0.0012)

Top Process 0.8953 (+/- 0.0043) 0.8860 (+/- 0.0052)

server_name 0.8556 (+/- 0.0073) 0.8215 (+/- 0.0068)

dst_ip 0.8537 (+/- 0.0154) 0.8181 (+/- 0.0154)

Table 3: Process inference results on data collected from Site

1 (GMT+19). The server_name and dst_ip methods do not use

the fingerprint information but rely strictly on the specified

destination information.

We compare the performance of the weighted naïve Bayes, un-

weighted naïve Bayes, and “top process" classification methods,

where top process ignores the destination features and simply se-

lects the process with the most observations given a TLS fingerprint

string. The top process method still takes advantage of the process

prevalence information in the knowledge base. We present results

using the weighted naïve Bayes classification method if not stated

otherwise. We further compare our approach to methods that make

no use of the fingerprint string, instead relying solely on either the

TLS server_name or the destination IP address.

To summarize overall performance, we use the micro-averaged

score where the label set is either a process name or a process

family name. The process name label set has been normalized so

that processes appearing on different platforms, e.g., MacOS and

WinNT, map to the same label. For example,

chrome.exe

on WinNT

and

google chrome

on MacOS both map to

chrome

. The process

family name labels group sets of process that share an underlying

architecture and purpose. The two largest and most diverse process

families are Microsoft Office, including processes like

excel.exe

outlook.exe

, and

word.exe

, and Chromium-based web browsers,

including processes like

chrome.exe

brave.exe

, and

msedge.exe

The process family name labels generalize the process name labels.

We use precision,

tp/(tp +f p)

, and recall,

tp/(tp +f n)

, to highlight

the system’s performance on individual processes and malware. For

the malware detection results, we use a binary label where samples

are considered malware if five or more anti-virus engines labeled

the SHA-256 associated with the process as malicious.

All results in this section use a knowledge base constructed

on data collected during May 2020 from Site 1 (GMT+19) and the

malware analysis sandbox. As we show in Section 6, including the

training data from the months prior to May does not increase the

performance of the system. The testing data is collected during the

first 10 days of June 2020 from Site 1 (GMT+19), Site 2 (GMT+8), and

the malware analysis sandbox as specified throughout this section.

5.1 Process Identification

The core feature of the system described in Section 3 is to infer

the process name from the TLS fingerprint string and destination

information. Table 3 lists an overview of these results when ap-

plied to the first ten days of data collected in June 2020 from Site 1

(GMT+19). The

score is averaged over each day and presented

with its standard deviation. The baseline method of selecting the

Figure 4: Confusion matrix for the top-25 processes on data

collected from Site 1 (GMT+19).

process with the most observations in the knowledge base for a

given fingerprint string resulted in an F1score of 0.8953. Both the

unweighted and weighted naïve Bayes methods improved signif-

icantly on the baseline, with the weighted naïve Bayes method

achieving an

score of 0.9941 for process family identification. A

strategy that ignores the fingerprint information and selects the

process most closely associated with either the TLS

server_name

or destination IP address performed significantly worse. A diverse

set of processes often communicate with the same set of destina-

tions, and the TLS fingerprint string is needed to achieve superior

process identification performance.

The sessions misclassified by the weighted naïve Bayes algo-

rithm were skewed towards a small set of misclassifications be-

tween Microsoft Outlook, Cisco Webex, and Safari. Cisco Webex

has components that integrate with Microsoft Outlook, result-

ing in sessions initiated by Cisco Webex that communicate with

outlook.office365.com

on both WinNT and MacOS, confusing

the classifier. There is an overlap in the fingerprint strings that

Microsoft Outlook, Cisco Webex, and Safari present on MacOS.

When Microsoft Outlook or Cisco Webex communicate with CDNs

or advertising sites using the default CoreTLS library, both the fin-

gerprint string and the destination are more strongly correlated

with Safari in the knowledge base, resulting in misclassifications.

The aforementioned cases account for

∼

40% of the misclassified

sessions. The other major outlier is Chromium-based applications

like Electron and Slack being misclassified as Chromium-based web

browsers. This case is responsible for

∼

20% of the misclassifications.

Figure 4 and Table 4 present a more detailed view of the weighted

naïve Bayes classifier’s performance on the most prevalent pro-

cesses in the test data. Figure 4 presents the confusion matrix for the

top-25 process names in the test data. The “Other" category in the

confusion matrix consists of all remaining processes. In general, we

can correctly identify individual processes using the proposed meth-

ods. The primary weakness is disambiguating processes that share

a common architecture and purpose. For example, Microsoft Office

applications are often confused. The Chromium-based Microsoft

Edge is also misclassified as Chrome. In terms of implementing a

Process Sessions W-Naïve Bayes Top Process server_name dst_ip

Precision Recall Precision Recall Precision Recall Precision Recall

Cisco AMP 105,889,766 0.9999 1.0000 0.9851 0.9999 0.9576 0.9999 0.9997 0.9935

Chromium 45,931,514 0.9920 0.9998 0.9781 0.9993 0.6439 0.3059 0.6293 0.8496

Cisco Webex 38,403,011 0.9989 0.9923 0.8141 0.9680 0.9819 0.8972 0.9805 0.9653

Microsoft Office 26,855,990 0.9788 0.9911 0.7887 0.3963 0.9290 0.9745 0.9098 0.9647

Firefox 22,234,838 0.9994 0.9999 0.9992 0.9999 0.6329 0.3059 0.6124 0.2689

Safari 7,633,503 0.9787 0.9903 0.3751 0.9072 0.5094 0.2210 0.4672 0.1861

Internet Explorer 4,373,033 0.9903 0.9969 0.9490 0.8761 0.6521 0.2875 0.5343 0.2731

iCloud 4,328,783 0.9658 0.9803 0.6135 0.2546 0.9242 0.8512 0.8770 0.8230

Creative Cloud 1,891,238 0.9955 0.9950 0.5110 0.1246 0.9852 0.9881 0.9222 0.6824

Box 664,518 0.9992 0.9961 0.9822 0.9239 0.9555 0.9992 0.9508 0.9072

Table 4: Process inference results for the top-10 most prevalent process families on data collected from Site 1 (GMT+19).

Method Process Family Process

F1Score F1Score

W-Naïve Bayes 0.9858 (+/- 0.0019) 0.9702 (+/- 0.0021)

Naïve Bayes 0.9786 (+/- 0.0040) 0.9599 (+/- 0.0038)

Top Process 0.9077 (+/- 0.0057) 0.9035 (+/- 0.0061)

server_name 0.8381 (+/- 0.0108) 0.8297 (+/- 0.0091)

dst_ip 0.8145 (+/- 0.0286) 0.7904 (+/- 0.0271)

Table 5: Process inference results on data collected from Site

2 (GMT+8).

network security policy, process specificity may not be needed, and

general process families may suffice. For example, a security policy

could allow all Microsoft Office applications to bypass the firewall

and communicate directly with Microsoft servers.

Table 4 lists the precision and recall for the 10 most prevalent

process families. These process families all have precision and recall

greater than 0.99 for the weighted naïve Bayes classifier with three

exceptions: Microsoft Office, Safari, and iCloud. As was the case for

specific process categories, the lower performance on these process

families is due to other applications integrating with underlying

services and shared fingerprint strings communicating with generic

CDNs. These top-10 process families accounted for 93.41% of the

total test traffic from Site 1 (GMT+19). If we remove the data associ-

ated with these top-10 families from the testing data, the weighted

naïve Bayes classifier still achieves an

score of 0.9711, with over

60% of the misclassifications due to Chromium-based applications

being misclassified as Chromium-based web browsers. The perfor-

mance of the classifiers based solely on destination information

is significantly worse for browsers that go to a wide variety of

destinations and generally underperforms on all processes.

In order to assess how well the techniques generalize to new

networks, we used the knowledge base trained on data from Site 1

(GMT+19) in May 2020 to test on data collected from Site 2 (GMT+8)

during the first ten days of June 2020. The results are summarized in

Table 5. Selecting the most prevalent process without considering

destination information performed slightly better compared to the

results presented in Table 3, but this is simply because there was

less process diversity on Site 2 (GMT+8).

The weighted naïve Bayes method remained competitive with

a process family

score of 0.9858 (compared with .9941 on Site

1). The reduction in performance is in part due to observing pro-

cesses not seen on Site 1. There were over 300 unknown processes

initiating over 100,000 TLS sessions, or 0.3% of the total number

of TLS sessions collected from Site 2. The remaining discrepancy

is explained by geographic specific sub-domains and IP addresses

that did not appear in the original database, biasing the classifier

to select processes with higher prior probabilities.

As we discuss in Section 9, the knowledge base would ideally

be conditioned on enterprise and geography-specific data before

deployment. In cases where this isn’t feasible, Table 5 demonstrates

that competitive performance is still possible. It remains an open

question how well the knowledge base would translate to data

collected from a distinct enterprise, which we leave for future work.

5.2 Cloud Orchestration and Processing Tools

Tools specifically designed to facilitate cloud orchestration and

processing serve an important role in the current network ecosys-

tem. Identifying and prioritizing network traffic associated with

these tools can provide an enhanced user experience resulting in

increased productivity. Given applications’ reliance on cloud com-

puting resources, simply relying on IP addresses and domain names

to differentiate processes consuming cloud services versus those

responsible for running business critical services is difficult.

Table 6 provides the classification results for six different cloud

and virtualization tools. For these results, we selected processes that

facilitated cloud or virtualized workflows and also were represented

in our datasets. Terraform [

] generally manages infrastructure

resources, with terraform-provider-aws explicitly exposing AWS

APIs. AWS Kinesis [

] is a streaming data processing platform.

Docker [

], Kubernetes [

], and Helm [

] all support deploying

and managing containerized applications.

As shown in Table 6, the weighted naïve Bayes classifier outper-

formed all competing approaches. Our approach performed well

with respect to precision and recall for most applications, despite

many of these applications connecting to generic AWS services

Process Sessions W-Naïve Bayes Top Process server_name dst_ip

Precision Recall Precision Recall Precision Recall Precision Recall

terraform-provider-aws 81,784 0.9634 0.9975 0.5116 1.0000 0.9418 0.9247 0.8459 0.7754

docker 45,412 0.9979 0.9993 0.1882 0.3043 0.5008 0.9914 0.4708 0.8991

terraform 7,382 0.9849 0.6816 0.4038 0.0573 0.6607 0.6140 0.3120 0.2614

kubectl 5,653 0.9684 0.9858 0.8847 0.6474 0.9975 0.6215 0.9285 0.6282

helm3 690 0.9568 0.7696 0.9649 0.6783 0.8500 0.0986 0.8599 0.2580

awskinesistap 425 1.0000 0.9642 0.0000 0.0000 0.8182 0.2084 0.4889 0.0463

Table 6: Process inference results for various cloud orchestration and processing applications found in the data collected from

Site 1 (GMT+19) during June 2020.

Method Precision Recall

W-Naïve Bayes 0.9993 0.8868

Naïve Bayes 0.9992 0.7038

Top Process 0.9731 0.0682

server_name 0.7792 0.6949

dst_ip 0.6359 0.6499

Table 7: Malware detection results for Site 1 (GMT+19) and

the malware analysis sandbox.

like S3, which highlights the importance of incorporating the TLS

fingerprint string into the analysis.

5.3 Malware Detection

Malware has been replacing HTTP with TLS over the past several

years [

], and we observed malware samples from the malware

analysis sandbox more frequently using TLS compared to HTTP

during our collection period. The top two benign domains visited

were

twitter.com

(14.5% of the connections) and

www.google.com

(1.8% of the connections). The

server_name

extension was present

in 98.2% of connections, higher than the 90% observed at Site 1

(GMT+19) during the same time. For the testing data, a total of

12,249 unique server_name values were observed.

Table 7 presents the malware detection precision and recall for

the malware analysis sandbox data and data collected from Site 1

(GMT+19). While the method of simply selecting the most preva-

lent process had a relatively high precision of 97.31%, the recall

was poor. The top process method only detected 6.82% of the mal-

ware connections. This is unsurprising due to malware’s reliance

on system-provided TLS libraries. In our dataset, over 90% of the

malicious connections used the default Windows Schannel library

[

], which generates TLS fingerprint strings used by many popular

Windows process such as Microsoft Office and Internet Explorer.

By leveraging the destination information contained within

the knowledge base and the weighted naïve Bayes algorithm, we

increased the recall to 88.68%. Malware-initiated connections to

*.baidu.com

and

*.googleusercontent.com

were responsible

for

∼

45% of the false negatives. In these cases, the malware sam-

ples used an Schannel-generated TLS fingerprint string, and there

were an overwhelming number of connections to those domains

by benign processes. Due to malware’s reliance on popular hosting

Site Process Family Process

F1Score F1Score

Site 1 (GMT+19) 0.9024 (+/- 0.0577) 0.9003 (+/- 0.0582)

Site 2 (GMT+8) 0.7897 (+/- 0.0843) 0.7622 (+/- 0.0849)

Table 8: Process inference results when restricted to finger-

print strings not in the database. In these cases, the analysis

algorithm must rely on approximate matching.

services like

*.googleusercontent.com

, the performance of clas-

sifiers strictly looking at destination information underperformed

with precisions and recalls between 63% and 77%. The added in-

formation from the TLS fingerprint string helps in disambiguating

many of the sessions that would be misclassified by the destination

information alone.

To evade detection, malware authors could shift to using the

TLS libraries of popular processes such as Chrome to avoid a trivial

detection through unique or rare TLS fingerprint strings. These

libraries also offer the most flexibility in terms of potential desti-

nations that would blend into previous observations from those

libraries, e.g., CDNs. But, unlike the developers of benign applica-

tions, malware authors are under additional constraints such as

avoiding any noticeable user experience differences on the infected

machine and the potential for take-down requests impacting their

server infrastructure. If malware were to mimic popular fingerprint

strings, the authors would need to make frequent updates to ensure

their selected fingerprint string is still relevant. Our system auto-

matically incorporates the latest fingerprint string, process, and

destination information so that we are as robust as possible to the

changing TLS landscape and can incorporate the latest malware

trends into the knowledge base.

5.4 Approximate Matching

New TLS fingerprint strings are continuously introduced into the

ecosystem, and a robust system needs to handle these cases. As

described in Section 3.2, we use Levenshtein distance to find “close"

fingerprint strings, and then perform process identification on the

close fingerprint string’s process list. Similar to the process iden-

tification results using Site 2 (GMT+8), this method will fail if the

process isn’t in the knowledge base or the close fingerprint string’s

process list.

Feature Process Family Process

Set F1Score F1Score

fp, sni, ip, port 0.9941 (+/- 0.0004) 0.9650 (+/- 0.0013)

fp, sni, ip 0.9940 (+/- 0.0004) 0.9656 (+/- 0.0012)

fp, sni, port 0.9876 (+/- 0.0006) 0.9567 (+/- 0.0012)

fp, ip, port 0.9811 (+/- 0.0039) 0.9485 (+/- 0.0040)

fp, sni 0.9938 (+/- 0.0004) 0.9651 (+/- 0.0014)

fp, ip 0.9885 (+/- 0.0039) 0.9578 (+/- 0.0041)

fp, port 0.8955 (+/- 0.0042) 0.8862 (+/- 0.0051)

fp 0.8953 (+/- 0.0043) 0.8860 (+/- 0.0052)

sni 0.8556 (+/- 0.0073) 0.8215 (+/- 0.0068)

ip 0.8537 (+/- 0.0154) 0.8181 (+/- 0.0154)

Table 9: Process identification results when only considering

subsets of the destination features using data collected from

Site 1 (GMT+19). server_name is represented as “sni" and only

using the fingerprint is represented as “fp".

To understand the performance of approximate matching, we

analyzed the results of the process identification system when the

test data is restricted to only contain fingerprint strings

not

in the

knowledge base. During the first 10 days of June 2020, there were

159 fingerprint strings that were not in the May 2020 knowledge

base constructed from Site 1; there were 8,781 TLS sessions (out of

278,570,891 total sessions) generated from these fingerprint strings.

We also analyze the data collected from Site 2, where there were

53 unknown fingerprint strings and 3,217 TLS sessions (out of

33,820,842 total sessions).

Table 8 lists the process identification results for both sites. The

system achieved an F1score of 0.9024 for the process family prob-

lem on the data from Site 1. The system had problems classifying

processes not seen in the training data, as well as TLS scanners

[

] that exhibit a large diversity in TLS fingerprint strings. The

process family

score for Site 2 when restricted to fingerprint

strings not in the knowledge base was 0.7897. The discrepancy

between the results on the two sites is explained almost entirely by

processes observed on Site 2 that were never observed on Site 1.

While the results for approximate matching are worse than re-

sults when there is an exact match, we believe the approximate

matching technique provides a valuable addition to a TLS finger-

printing system. Additionally, with a well-curated knowledge base,

the number of sessions requiring approximate matching should

be low. Only 0.003% and 0.010% of sessions from Site 1 and Site 2

required approximate matching.

5.5 Feature Importance

The destination features have varying levels of information. With

the weighted naïve Bayes algorithm, we take the feature’s impor-

tance into account through the weights listed in Table 1. But there

are several ongoing initiatives to increase the privacy of TLS ses-

sions by obfuscating destination information. For example, encrypt-

ing the entire ClientHello (ECHO) [

] is one approach to obfuscate

the

server_name

value. ECHO would result in all TLS sessions to

a single service provider (CDN, cloud provider, etc.) offering the

same server_name value.

The

server_name

is the destination feature with the highest

weight in our system, and it is natural to question the efficacy of

the system if that feature were to contain significantly less informa-

tion. To better understand the importance of particular destination

features in our system, we performed process identification using

the weighted naïve Bayes algorithm on Site 1’s June 2020 data with

different subsets of features. Table 9 provides results for each combi-

nation of feature sets, where the feature set includes the destination

feature and related equivalence mappings. Table 9 also provides

results when only destination information is utilized, which is de-

noted by the lack of the “fp" identifier in the table.

For the weighted naïve Bayes classifier, the first row in Table 9

with “fp, sni, ip, port" is the performance when using all destination

features and “fp" is the performance when ignoring all destination

features, which then defaults to selecting the most prevalent process.

If it were the case that the

server_name

extension was completely

removed from all TLS sessions, our approach would still achieve

score of .9811 for process family identification. On the other

hand, if IP addresses no longer contained the same amount of

information, e.g., all servers were hosted on CloudFlare, the system

can still achieve an

score of .9938 with the

server_name

value

alone. The port features add little information when considering

aggregate statistics like the

score but do help in niche cases such

as email and remote desktop application identification.

Considering evasion with respect to destination features and

Tables 1 and 9 provides some insights. Simply omitting the

server_name

may not give the desired effects because this will

alter the fingerprint string. For example, Psiphon [

] exhibits many

different fingerprint strings, one of which attempts to imitate

Chrome. In some cases, Psiphon imitates Chrome but omits the

server_name

, which causes it to be identifiable from just its TLS

fingerprint string. Robust evasion needs to jointly consider all

destination features and the TLS fingerprint string, while at the

same time making sure the destinations and fingerprint string

remain prevalent in real-world traffic.

6 OPERATIONALIZING

In the previous section, all results use a knowledge base constructed

from data collected during May 2020 to classify data collected during

the first 10 days of June 2020. Maintaining an up-to-date knowl-

edge base that captures the relevant real-world traffic statistics

is critical to the success of our solution. In this section, we study

how parameters of the knowledge base, such as its age, affect the

performance of our classifier. In the previous section, we also took

the most probable process returned by the classifier, ignoring the

score. Here, we show the impact of considering the score on the

classifier’s performance and the amount of data discarded.

6.1 Knowledge Base Age

Maintaining the knowledge base with current, real-world data is

not a trivial task, but is necessary due to the introduction of new

fingerprint strings, processes, and destination information. We now

investigate the importance of continuously updating the knowl-

edge base by examining the performance of the system when only

Figure 5: The effect of the knowledge base’s age on classifi-

cation performance, where the results use a knowledge base

trained on the month specified by the x-axis.

considering older data. For this experiment, we built 11 separate

knowledge bases by merging daily knowledge bases for each month

between July 2019 and May 2020. The May 2020 knowledge base

was used in the experiments of the previous section.

Figure 5 shows the process family and process classification

results when the classifier only has access to data from the specified

month, where the y-axis is from 0.93 to 0.99. Similar to the previous

section, the testing data for each result is taken from the first 10

days of June 2020. As expected, the performance of the classifier

on both label sets consistently decreases as the knowledge base is

trained on older months.

The decreasing performance is due to the evolution of fingerprint

strings and destination information. For example, while the testing

data only contained 159 fingerprint strings with 8,781 TLS sessions

that did not have a match in the knowledge base from May 2020,

there were 1,238 fingerprint strings with 33 million TLS sessions

in the testing data without a match in the July 2019 knowledge

base. A heavier reliance on approximate matching will reduce the

performance of the classifier as explained in Section 5.4.

Even with approximate matching, Figure 5 illustrates the clear

need to keep an up-to-date knowledge base conditioned on recent

real-world data as described in Section 3.1.

6.2 Aging Out Data

In addition to keeping the knowledge base up to date with the most

recent observations, one must consider the effect of older data that

may no longer be relevant. Removing processes associated with

a fingerprint string that have not been recently observed slightly

increases performance and leads to a considerable reduction in the

knowledge base’s size.

To measure the impact of older data on our system, we again

constructed 11 separate knowledge bases. For the current experi-

ment, each knowledge base includes data starting from each of the

months between July 2019 and May 2020. The 11 knowledge bases

include all data between their starting month and the end of May

2020. The size of the knowledge bases became progressively smaller

as we considered less data. The size of the knowledge constructed

Figure 6: The effect of aging out older observations in the

knowledge base on classification performance, where the re-

sults use a knowledge base constructed from data spanning

the month specified by the x-axis until May 2020.

Figure 7: Effect of adjusting the classifier’s threshold with

minimum probability values in [0.5, 0.6, 0.7, 0.8, 0.9, 0.95,

0.99, 0.999, 0.9999, 1.0].

from data between July 2019 and May 2020 was 195 megabytes. The

size of the knowledge base only considering the May 2020 data was

58 megabytes.

Figure 6 illustrates the performance of the classifier when con-

ditioned on each of the knowledge bases, using the first 10 days

of June 2020 as testing data. There was no advantage in maintain-

ing data older than one month with respect to process family and

process classification

scores. In fact, the performance of the pro-

cess classification system decreased as the knowledge base kept

data for longer periods of time. The decrease in performance was

driven primarily by Chromium-based browsers being misclassified

as Chrome due to lagging updates of BoringSSL [7].

6.3 Classifier Threshold

The classification algorithm described in Section 3.4 returns the

most probable process associated with a fingerprint string and

Figure 8: The difference in process family classification

when using the open source and internal knowledge bases

to test data collected from Site 1 (GMT+19) during June 2020.

destination along with its probability. If network operators are

unwilling or unable to accept misclassifications, they can define a

minimum probability threshold and ignore any inferences that do

not meet that threshold.

In this experiment, we used the May 2020 knowledge base and

the June 2020 testing data from Site 1. Figure 7 shows the impact

of adjusting the classifier’s threshold on the process inference

score and the fraction of data discarded if we ignore results below

the given threshold. For a threshold of 0.5, we classify over 99%

of the data and have a process inference

score of 0.9737. At a

threshold of 0.999, we classify 58% of the data with an

score of

0.9981. Finally, at a threshold of 1.0, we classify 6% of the data with

an F1score of 0.9991.

As the threshold is increased, the system begins to ignore (fin-

gerprint string, destination) tuples that are observed with many

processes. At the 0.999 threshold, the system ignores all off-diagonal

sessions in Figure 4 except for

∼

60% of the Microsoft Edge connec-

tions because of Chrome’s dominance with respect to its number

of observations.

7 REPRODUCIBILITY

To assist reproducibility, we have open sourced the data collection

and analysis system described in this paper. Our core tool is a C/C++

program that uses Linux’s AF_PACKET TPACKETv3 zero-copy

shared memory ring buffers to collect and analyze data on network

links with capacities of 30Gbps+. This tool supports the generation

of TLS fingerprint strings and process inference with destination

context as described in Sections 2 and 3. mercury [

] additionally

generates client fingerprint strings for DHCP, DTLS, HTTP, SSH,

and TCP, as well as server fingerprint strings for HTTP, DTLS, and

TLS. We also released a pip-installable python implementation to

facilitate rapid prototyping.

We released an open source version of our internal TLS finger-

print knowledge base along with the open source tools. We are

committed to releasing up-to-date TLS fingerprint knowledge bases

to the open source community, which we have currently done each

week during the first seven months of 2020. The current open source

knowledge base contains 8,405 unique TLS fingerprint strings with

associated prevalence, process, and destination information. This

dataset was constructed by considering 9.5 billion TLS sessions.

There are currently 2,695 unique process names and 11,600 unique

process executable SHA-256s in the open source knowledge base.

To comply with the policies of the organization whose sites we

monitored, we had to remove some of the knowledge base’s content.

In the open source knowledge base, we only report the top-10 most

prevalent processes per fingerprint string. For each process, we

report only the equivalence mappings for the destination features.

For example, the FQDN in the

server_name

data is only reported

as a domain name and TLD. There is a 30-day delay before obser-

vations on the monitored sites are introduced into the open source

knowledge base. Finally, we did not open source any data from the

malware analysis sandbox.

To better understand the impact of omitting data from the open

source knowledge base, we compared the performance of the open

source and internal knowledge bases when applied to data collected

during the first 10 days of June 2020 from Site 1. We used the

May 2020 internal knowledge base and the open source knowledge

base that was available on June 1st, 2020. Figure 8 illustrates the

difference in performance when only selecting the most prevalent

process and applying both the unweighted and weighted naïve

Bayes algorithms.

As expected, the difference in performance between the two

knowledge bases was small when relying on the most prevalent

process. The algorithms based on naïve Bayes had a significant ad-

vantage when using the private knowledge base. As Figure 5 demon-

strated, the 30-day delay had a small impact on performance. Re-

moving the most informative destination features, the

server_name

and IP address according to Table 1, was the primary cause for the

degraded performance. While the process inference performance

based on the open source knowledge base is somewhat reduced,

we still believe this data represents a significant contribution to

the community and the first to associate TLS fingerprint strings,

processes, and destination information on a large-scale.

8 RELATED WORK

The work presented in this paper builds on a rich history of net-

work traffic fingerprinting and analysis. TLS fingerprinting first

became popular in 2009 when Ivan Ristić released an Apache mod-

ule to monitor SSL handshakes and correlate offered cipher suite

lists with HTTP

User-Agent

strings [

]. This led to several open

source packages that implemented methods to extract TLS finger-

print strings and provided TLS fingerprint databases [

The previous fingerprint databases did not provide real-world preva-

lence or contextual information about the destinations and therefore

could not rely on that information to disambiguate the set of pro-

cesses that mapped to the same fingerprint string. Husák et al. [

]

provided the first academic study of TLS fingerprinting, but again,

did not consider destination information or have the infrastructure

in place to develop detailed knowledge bases.

While our goal was to perform process inference using TLS

fingerprinting, several efforts have used TLS fingerprinting as a

means to perform measurement studies [

]. For

example, Kotzias et al. [

] used a combination of open source

fingerprint databases and their own data to examine how popular

browsers modified the cryptographic parameters offered in their

client_hello

’s in response to the disclosure of high-profile attacks

against TLS [

]. Frolov and Wustrow [

] studied the unique-

ness of censorship circumvention tools’ TLS fingerprint strings to

motivate the development uTLS [

], a TLS library to mimic and

randomize the

client_hello

. Our work illustrates the importance

of considering destination features for libraries like uTLS when

constructing a client_hello.

Performing protocol, application, and process identification on

encrypted traffic [

] has been an

active area of research over the past 15 years. Initial work focused

on identifying the application layer protocols, e.g., FTP, HTTP, and

SMTP, within an encrypted tunnel [

]. For example, Wright

et al. [

] used the sequence of TCP packets and a hidden Markov

model to identify application layer protocols.

More recent work has focused on the mobile and IoT domains

[

]. FlowPrint [

] takes a semi-supervised approach that

allows it to fingerprint previously unseen mobile applications. Flow-

Print considers timing, device, and destination features, where the

optimal batch window was found to be 300 seconds. Our work

differs is several key areas but is complimentarily. Our system pro-

vides process identification results after having only seen the first

packet in a TLS session, as opposed to 300 seconds in the worst

case. We use a continuous data collection system to ensure our

knowledge base has the most recent information, but we are unable

to identify previously unseen processes.

9 DISCUSSION

The system presented in this paper relies on the large-scale col-

lection, curation, and fusion of real-world data. We were able to

achieve our results by creating a custom tool to collect network

data and leveraging data from a pre-existing host agent. This ap-

proach led to quick results, but also created some critical gaps in

our system’s coverage. As detailed in Section 4, the endpoints that

generated host logs were almost entirely MacOS and WinNT-based

desktop systems. There was a small Linux component, but a com-

plete absence of mobile and IoT devices. While future work will

include expanding the capabilities of the data collection system to

remove these blind spots, we believe the underlying system and

process inference strategy is sound and can naturally incorporate

this new data.

Our data being limited to a single enterprise was another byprod-

uct of our data collection strategy. We believe that the diversity of

processes in our knowledge base and the results when applying

the classification system to Site 2 (GMT+8) provide some evidence

that our approach would scale to networks operated by a distinct

enterprise. But, for optimal performance, having a knowledge base

that was at least in part conditioned on data observed from the

target site would be best. Any site with standard endpoint visibility

agents with similar capabilities to the AnyConnect Network Visi-

bility Module [

] and the capacity to perform network monitoring

could create custom knowledge bases, but this does require a sig-

nificant initial investment. Further experiments into understanding

how well the knowledge base transfers between distinct enterprises

is left for future work.

Evading a system that continuously learns from billions of real-

world TLS connections is not trivial, but we provided some best

practices that privacy enhancing technologies could employ in Sec-

tion 5.3, e.g., using system-provided TLS libraries. On the other

hand, it is possible for malware to use these same techniques to

evade detection. We hypothesize that the additional constraints

placed on many classes of malware, e.g., maintaining prolonged

periods of not being detected, make evading a continuously updat-

ing knowledge base substantially more difficult. While techniques

based on information extracted only from the TLS

client_hello

are not incapable of being evading, the results of Section 5.3 in-

dicate that our system does have value. More investigations into

the security-privacy tradeoff of our system with respect to privacy

enhancing technology and malware detection is needed.

There exists several avenues to extend the core methods of Sec-

tion 3. The most straightforward extension is to expand the set

of destination feature equivalence mappings. Obvious examples

include the global popularity or a binned consonant-to-vowel ratio

of the

server_name

. Adding additional destination features may

also improve the performance of the system. In the May 2020 data,

only 25% of the TLS fingerprint strings and 35% of the TLS ses-

sions signaled support for TLS 1.3, and TLS 1.2 will most likely

remain a large fraction of the TLS traffic for years to come. For TLS

1.2 sessions, including features around the server’s certificate will

provide additional information about the server’s identity to the

classification system.

Finally, maintaining proper ethics when performing a project

analyzing real-world network and host data is critical. The unpro-

cessed data was stored on a platform with an institution approved

access control system. The data in the knowledge base was stripped

of any indicators that could be used to identify users, such as the

source IP addresses, detailed timestamps, and host agent identifiers.

We followed all institutional procedures, including signing institu-

tional agreements declaring that we would “minimize personally

identifiable information, maintain the confidentiality of all raw and

processed data, receive written consent from your direct manage-

ment chain before releasing any data, and pledge to not follow any

practices that could be deemed discriminatory".

10 CONCLUSION

In this paper, we presented a system that continuously collects and

fuses billions of real-world TLS sessions and host logs to generate

a knowledge base correlating TLS client fingerprint strings, host

processes, and destinations features. With the generated knowledge

base, we built a system that uses a weighted naïve Bayes algorithm

to infer processes and detect malware using only the TLS fingerprint

string and destination information contained within the first data

packet of a TLS session. We demonstrated that our system was

able to achieve an

score of over 0.99 when inferring the process

family, and high efficacy malware detection with 99.9% precision

and 88.7% recall. We additionally examined the performance of our

system when used to identify cloud orchestration and processing

tools and found that the precision and recall were greater than 0.99

for several popular processes belonging to this category.

To assist in reproducibility, we contributed mercury [38] to the

open source community for collecting and classifying network

traffic. We also released an open source version of our internal

TLS fingerprint knowledge base, which is updated weekly and

is currently the largest and most informative open source TLS

fingerprint knowledge base in existence.

ACKNOWLEDGMENTS

We thank Brandon Enright for his support in developing mercury.

We thank both Brandon and Adam Weller for their feedback and

support. We thank Lucas Messenger, Eddie Allan Jr., and Joey Rosen

for their assistance in maintaining and providing access to the data

capture infrastructure. We also thank and acknowledge Ed Paradise

for his ongoing support of this work.

REFERENCES

[1]

2012. SSL Fingerprinting for p0f. (2012). https://idea.popcount.org/

2012-06- 17-ssl- fingerprinting-for-p0f/.

[2]

2018. Protocols in TLS/SSL (Schannel SSP). (2018). https://docs.microsoft.com/

en-us/windows/win32/secauthn/protocols- in-tls- ssl--schannel- ssp-.

[3]

2019. Cisco AnyConnect Secure Mobility Client. http://www.cisco.com/go/

anyconnect. (2019).

[4] 2019. Psiphon. (2019). https://www.psiphon3.com.

[5] 2019. uTLS. (2019). https://github.com/refraction- networking/utls.

[6] 2020. Amazon Kinesis. https://aws.amazon.com/kinesis/. (2020).

[7] 2020. BoringSSL. (2020). https://boringssl.googlesource.com/boringssl/.

[8] 2020. Docker. https://www.docker.com/. (2020).

[9] 2020. Helm. https://helm.sh/. (2020).

[10] 2020. Kubernetes. https://kubernetes.io/. (2020).

[11] 2020. MaxMind’s GeoLite2. (2020). https://www.maxmind.com/.

[12] 2020. Mozilla’s Public Suffix List. (2020). https://publicsuffix.org/list/.

[13] 2020. Terraform. https://www.terraform.io/. (2020).

[14]

Nadhem AlFardan, Daniel J Bernstein, Kenneth G Paterson, Bertram Poettering,

and Jacob CN Schuldt. 2013. On the Security of RC4 in TLS. In USENIX Security

Symposium. 305–320.

[15]

John B. Althouse, Jeff Atkinson, and Josh Atkins. 2017. JA3. (2017). https:

//github.com/salesforce/ja3.

[16]

Blake Anderson and David McGrew. 2016. Identifying Encrypted Malware

Traffic with Contextual Flow Data. In ACM Workshop on Artificial Intelligence

and Security (AISec). 35–46.

[17]

Blake Anderson and David McGrew. 2017. Machine Learning for Encrypted

Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity.

In ACM SIGKDD International Conference on Knowledge Discovery in Data Mining

(KDD). 1723–1732.

[18]

Blake Anderson and David McGrew. 2019. TLS Beyond the Browser: Combin-

ing End Host and Network Data to Understand Application Behavior. In ACM

SIGCOMM Internet Measurement Conference (IMC). 379–392.

[19]

Blake Anderson, Subharthi Paul, and David McGrew. 2017. Deciphering Mal-

ware’s Use of TLS (without Decryption). Journal of Computer Virology and

Hacking Techniques (2017), 1–17.

[20]

Pieter Arntz. 2019. Spotlight on Troldesh Ransonware, aka

’Shade’. https://blog.malwarebytes.com/threat-analysis/2019/03/

spotlight-troldesh- ransomware-aka- shade/. (2019).

[21]

David Benjamin. 2017. Applying GREASE to TLS Extensibility. Internet-Draft

(Informational). (2017). https://tools.ietf.org/html/draft- ietf-tls-grease-03.

[22]

Laurent Bernaille and Renata Teixeira. 2007. Early Recognition of Encrypted Ap-

plications. In International Conference on Passive and Active Network Measurement.

165–175.

[23]

Karthikeyan Bhargavan and Gaëtan Leurent. 2016. On the Practical (in-) Security

of 64-bit Block Ciphers: Collision Attacks on HTTP over TLS and OpenVPN.

In ACM SIGSAC Conference on Computer and Communications Security (CCS).

456–467.

[24]

Lee Brotherston. 2015. FingerprinTLS. (2015). https://github.com/synackpse/

tls-fingerprinting.

[25]

Manuel Crotti, Maurizio Dusi, Francesco Gringoli, and Luca Salgarelli. 2007.

Traffic classification through simple statistical fingerprinting. Computer Com-

munication Review 37, 1 (2007), 5–16. https://doi.org/10.1145/1198255.1198257

[26]

Tim Dierks and Eric Rescorla. 2008. The Transport Layer Security (TLS) Protocol

Version 1.2. RFC 5246 (Proposed Standard). (2008). http://www.ietf.org/rfc/

rfc5246.txt.

[27] Alban Diquet. 2019. SSLyze. (2019). https://github.com/nabla-c0d3/sslyze.

[28]

Donald Eastlake. 2011. Transport Layer Security (TLS) Extensions: Extension

Definitions. Internet-Draft (Standards Track). (2011). https://tools.ietf.org/html/

rfc6066.

[29]

Brown Farinholt, Mohammad Rezaeirad, Damon McCoy, and Kirill Levchenko.

2020. Dark Matter: Uncovering the DarkComet RAT Ecosystem. In ACM Inter-

national World Wide Web Conference. 2109–2120.

[30]

Roy Fielding and Julian Reschke. 2014. Hypertext Transfer Protocol (H TTP/1.1):

Semantics and Content. RFC 7231 (Proposed Standard). (2014). http://www.ietf.

org/rfc/rfc7231.txt.

[31]

Sergey Frolov and Eric Wustrow. 2019. The use of TLS in Censorship Circum-

vention. In Network and Distributed System Security Symposium (NDSS).

[32]

Colin Grady, William Largent, and Jaeson Schultz. 2019. Emotet is

Back After a Summer Break. https://blog.talosintelligence.com/2019/09/

emotet-is- back-after-summer-break.html. (2019).

[33]

Ralph Holz, Johanna Amann, Olivier Mehani, Matthias Wachs, and Mohamed Ali

Kaafar. 2016. TLS in the Wild: An Internet-wide Analysis of TLS-based Proto-

cols for Electronic Communication. In Network and Distributed System Security

Symposium (NDSS).

[34]

Martin Husák, Milan Cermák, Tomá Jirsík, and Pavel Celeda. 2015. Network-

Based HTTPS Client Identification using SSL/TLS Fingerprinting. In Availability,

Reliability and Security (ARES). 389–396.

[35]

Jaroslaw Jedynak. 2017. A Deeper Look at Tofsee Modules. https://www.cert.pl/

en/news/single/a-deeper- look-at-tofsee- modules/#4-proxyrdll. (2017).

[36]

Platon Kotzias, Abbas Razaghpanah, Johanna Amann, Kenneth G. Paterson,

Narseo Vallina-Rodriguez, and Juan Caballero. 2018. Coming of Age: A Lon-

gitudinal Study of TLS Deployment. In ACM SIGCOMM Internet Measurement

Conference (IMC). 415–428.

[37]

Marc Liberatore and Brian Neil Levine. 2006. Inferring the Source of Encrypted

HTTP Connections. In Proce edings of the Thirteenth ACMConference on Computer

and Communications Security (CCS). 255–263.

[38]

David McGrew, Brandon Enright, and Blake Anderson. 2020. Mercury: Fast TLS,

TCP, and IP Fingerprinting. https://github.com/cisco/mercury. (2020).

[39]

Andrew W Moore and Denis Zuev. 2005. Internet Traffic Classification Using

Bayesian Analysis Techniques. SIGMETRICS Performance Evaluation Review 33

(2005), 50–60.

[40]

Abbas Razaghpanah, Arian Akhavan Niaki, Narseo Vallina-Rodriguez, Srikanth

Sundaresan, Johanna Amann, and Phillipa Gill. 2017. Studying TLS Usage in

Android Apps. In International Conference on emerging Networking EXperiments

and Technologies (CoNEXT). 350–362.

[41] ioerror rbsec. 2019. sslscan. (2019). https://github.com/rbsec/sslscan.

[42]

Eric Rescorla. 2018. The Transport Layer Security (TLS) Protocol Version 1.3.

RFC 8446 (Proposed Standard). (2018). http://www.ietf.org/rfc/rfc8446.txt.

[43]

Eric Rescorla, Kazuho Oku, Nick Sullivan, and Christopher Wood. 2020. En-

crypted Server Name Indication for TLS 1.3. Internet-Draft (Experimental).

(2020). https://tools.ietf.org/html/draft-ietf- tls-esni- 06.

[44]

Ivan Ristic. 2009. HTTP Client Fingerprinting using SSL Hand-

shake Analysis. (2009). https://blog.ivanristic.com/2009/06/

http-client- fingerprinting-using- ssl-handshake-analysis.html.

[45] Ivan Ristić. 2012. sslhaf. (2012). https://github.com/ssllabs/sslhaf.

[46]

Vincent F. Taylor, Riccardo Spolaor, Mauro Conti, and Ivan Martinovic. 2016.

AppScanner: Automatic Fingerprinting of Smartphone Apps From Encrypted

Network Traffic. In IEEE European Symposium on Security and Privacy. 439–454.

[47]

Thijs van Ede, Riccardo Bortolameotti, Andrea Continella, Jingjing Ren, Daniel J

Dubois, Martina Lindorfer, David Choffnes, Maarten van Steen, and Andreas

Peter. 2020. FLOWPRIN T: Semi-Supervised Mobile-App Fingerprinting on En-

crypted Network Traffic. In Network and Distributed System Security Symposium

(NDSS).

[48]

Quaizar Vohra and Enke Chen. 2012. BGP Support for Four-Octet Autonomous

System (AS) Number Space. Internet-Draft (Standards Track). (2012). https:

//tools.ietf.org/html/rfc6793.

[49]

Charles V Wright, Fabian Monrose, and Gerald M Masson. 2006. On Inferring

Application Protocol Behaviors in Encrypted Network Traffic. Journal of Machine

Learning Research (JMLR) (2006), 2745–2769.

[50]

Harry Zhang and Shengli Sheng. 2004. Learning Weighted Naive Bayes with

Accurate Ranking. In IEEE International Conference on Data Mining (ICDM’04).

567–570.

[51]

Wei Zhang, Yan Meng, Yugeng Liu, Xiaokuan Zhang, Yinqian Zhang, and Haojin

Zhu. 2018. HoMonit: Monitoring Smart Home Apps from Encrypted Traffic.

In ACM SIGSAC Conference on Computer and Communications Security (CCS).

1074–1088.

A TLS EXTENSIONS WITH DATA

Extension Name Extension Hex Code

max_fragment_length 0001

status_request 0005

client_authz 0007

server_authz 0008

cert_type 0009

supported_groups 000a

ec_point_formats 000b

signature_algorithms 000d

heartbeat 000f

application_layer_ 0010

protocol_negotiation

status_request_v2 0011

client_certificate_type 0013

server_certificate_type 0014

token_binding 0018

compress_certificate 001b

record_size_limit 001c

supported_versions 002b

psk_key_exchange_modes 002d

signature_algorithms_cert 0032

channel_id 5500

GREASE 0a0a