Research Article

Characterizing Anomalies in Malware-Generated HTTP Traffic

Piotr Białczak

and Wojciech Maz rczyk

CERT Polska/Research and Academic Computer Network (NASK), Kolska 1 , Warsaw 01-045, Poland

Warsaw University of Technology, Nowowiejska 15/19, Warsaw 00-665, Poland

Correspondence should be addressed to Piotr Białczak; piotr.bialczak@cert.pl

Received 14 April 2020; Revised 18 June 2020; Accepted 7 August 2020; Published 1 September 2020

Academic Editor: Clemente Galdi

Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is

properly cited.

Currently, we are witnessing a signiﬁcant rise in various types of malware, which has an impact not only on companies, in-

stitutions, and individuals, but also on entire countries and societies. Malicious software developers try to devise increasingly

sophisticated ways to perform nefarious actions. In consequence, the security community is under pressure to develop more

eﬀective defensive solutions and to continuously improve them. To accomplish this, the defenders must understand and be able to

recognize the threat when it appears. That is why, in this paper, a large dataset of recent real-life malware samples was used to

identify anomalies in the HTTP traﬃc produced by the malicious software. The authors analyzed malware-generated HTTP

requests, as well as benign traﬃc of the popular web browsers, using 3 groups of features related to the structure of requests, header

ﬁeld values, and payload characteristics. It was observed that certain attributes of the HTTP traﬃc can serve as an indicator of

malicious actions, including lack of some popular HTTP headers and their values or usage of the protocol features in an

uncommon way. The ﬁndings of this paper can be conveniently incorporated into the existing detection systems and network

traﬃc forensic tools, making it easier to spot and eliminate potential threats.

1. Introd ction

In the present-day Internet, one of the most commonly

used protocols is the Hypertext Transfer Protocol (HTTP)

[1, 2]. Its utilization is widespread as it is an essential

component of web browsing. It also serves as a “back-

bone” of many services, even those standardized with

other network protocols like e-mail and instant mes-

saging. However, HTTP protocol prevalence is steadily

decreasing in favor of TLS, HTTP/2, and FB-ZERO

protocols, according to [1]. Deployment of an HTTP

server is easy even for those who are not tech-savvy users,

with many tutorials available in national languages. It is

also often provided as a service by webhosting companies.

On top of this, there is a lack of monitoring or blocking in

many networks and easily achievable blending with le-

gitimate network traﬃc. It is not surprising that malware

developers use HTTP as a primary protocol to enable

malicious communication. For example, according to

Miller and Smith [3], HTTP is the most popular protocol

used in C&C traﬃc, surpassing HTTPS. All this led the

authors of this paper to focus on analyzing the HTTP

protocol solely.

HTTP is used by malware for various purposes, for

example, for connecting with the Command and Control

(C&C) server to register/download commands, checking the

external IP address of the infected host, and downloading

additional modules. It is also used to perform DDoS

(Distributed Denial of Service) attacks or create revenue by

clicking on referral links. Such communication is masked by

benign HTTP traﬃc which can be vastly diﬀerent,

depending on the application and its usage purpose. It must

be noted that the HTTP protocol can be used by applications

other than web browsers, for example, updaters, operating

system mechanisms, application shops, and messengers. The

main diﬀerence between the network traﬃc of such appli-

cations and the network traﬃc of web browsers lays in the

characteristic of used addresses. The latter traﬃc can be

potentially directed to any address, while in the former, the

addresses are constant: they are either a set of domain names

Hindawi

Security and Communication Networks

Volume 2020, Article ID 8848863, 26 pages

https://doi.org/10.1155/2020/8848863

or an IPs range. For example, addresses of servers used by

Windows telemetry services or Windows update mecha-

nisms are widely known and are listed in many manuals

focusing on blocking these services with network ﬁrewalls

[4, 5] or dedicated tools such as WindowsSpyBlocker

(https://github.com/crazy-max/WindowsSpyBlocker/).

Network traﬃc of these applications can be easily identiﬁed

using, for example, publicly available address lists or a short

analysis of the traﬃc in the network proxy log. Considering

the above, the authors decided to focus only on the web

browser traﬃc as the other popular HTTP-based applica-

tions are relatively easy to be identiﬁed and ﬁltered out from

the network traﬃc.

The analysis of HTTP traﬃc characteristics presented in

the current malware behavior research [6–9] suggests that

some malware families’ HTTP requests diﬀer from those

generated by benign applications. This is especially visible

when compared to the network traﬃc of applications op-

erated by humans, e.g., web browsers. However, to the

authors’ best knowledge, there is no extensive study which

systematically identiﬁes and analyzes dissimilarities between

the malicious (malware) and benign (web browsers) HTTP

traﬃc.

To ﬁll this gap, the authors have thoroughly analyzed

HTTP requests of both malware and browser traﬃc (using

recent traﬃc sources), in order to establish their distinctive

features. The research has focused solely on the Microsoft

Windows operating systems family, as it is still the most

frequently attacked platform—in 201 , more than half of the

newly developed malware targeted these systems [10]. The

conducted investigation explores a set of features and was

created based on the authors’ own experience with real

malware samples’ analyses and previous research work in

this area (see Section 2). The chosen features reﬂect the

structure of requests, values of diﬀerent HTTP protocol

ﬁelds, and the analysis of payload data. The main objective is

to identify which parts of HTTP requests are diﬀerent in

malware and the browser network traﬃc and which can be

identiﬁed as general features for distinguishing between

these two types. The features and their values deviating from

standards deﬁned by network traﬃc originating from

browsers will be deﬁned as anomalies. Some of the analyzed

features can be seen as anomalies because they do not

conform to standards or registered values, and they are

present in both malware and browser network traﬃc. In

such cases, the frequency of such occurrences will be

quantiﬁed.

The main motivation behind this work is to provide

other researchers with a list of identiﬁed anomalies of

malware HTTP traﬃc. Such a list can be used directly by

analysts when analyzing network traﬃc (e.g., during digital

investigation) but also as an entry point for the design of

malware detection systems. Availability of a well-described

set of network anomalies can also help in developing other

monitoring systems, for example, malware ﬁngerprinting

solutions. Therefore, the authors believe that this work will

help ﬁghting malicious software.

Considering the above, the main contributions of this

paper are

(i) Conducting a survey of HTTP requests’ features

previously used to detect malware in the existing

research and performing an academic veriﬁcation of

usefulness of features proposed by previous non-

academic work

(ii) Proposing an improved set of HTTP requests’

features, including original ones, which can be

potentially utilized for malware identiﬁcation

(iii) Identifying and analyzing malware HTTP requests’

distinctive features and performing analysis of their

inﬂuence on each other

(iv) Providing a list of malware HTTP requests’ dis-

tinctive features, along with practical usage

scenarios

The contributions of this paper in a summarized and

concise form are presented in Sections 6.1 and 6.2.

Due to a substantial number of performed analyses, not

all of them were described in this paper, in order to maintain

its clarity. Included are only those results which can help

distinguish between malware and HTTP network traﬃc.

The rest of the paper is structured as follows. Section 2

describes the existing work related to the HTTP-based

anomaly detection. Section 3 explains the fundamentals of

the HTTP protocol. In Section 4, an experimental meth-

odology used in this paper is outlined in detail. Section 5

presents obtained experimental results. Section 6 investi-

gates how our discoveries can be applied in practice to the

existing detection solutions. In Section 7, several limitations

of this work are discussed. Finally, Section concludes this

paper and outlines future work.

2. Related Work

This section reviews the existing works which are most

closely related to the research conducted in this paper. To

start with, academic research papers exploring the behavior

of malware are described, as they can be directly compared

with the below work. Several nonacademic sources are also

investigated; they show or use features for identiﬁcation of

the malware HTTP requests.

Rossow et al. presented in [11] the results of analysis of

malware network traﬃc. They analyzed more than 100,000

samples, from which about 43. % performed network ac-

tivity. The authors provide observations about DNS and

HTTP traﬃc, but only the latter will be summarized here.

Analysis of the HTTP requests revealed that 9.5% of

samples sent GET and 56.3% sent POST requests. Further-

more, 144 unique header names were observed. 9 .6% of

samples speciﬁed the User-Agent header; however, only 31%

of samples included correct values. Additionally, 50.6% of

samples changed this value during execution. 44.3% of

samples included the Accept-Language header; however,

24.1% of them did not respect the operating system language

locale.

In [12], Nelson presented a framework for analyzing and

visualizing malware network traﬃc. The framework expands

on the Sandnet framework [11] and its analyses. It provides a

2Security and Communication Networks

means for execution of malware samples and capturing their

network traﬃc; it also provides analysis and clustering of

protocols and visualization of the obtained results. To

evaluate the framework, an analysis was conducted, pro-

viding manual inspection of 5 malware families and semi-

automatic inspection of the whole dataset of 16,967 pcap

ﬁles. In the latter part, the author analyzed network protocol

breakdown and characteristics of DNS and HTTP protocols.

Analysis of 11 ,035 HTTP requests included in the dataset

was performed on multiple features, such as the request

method, URI, or popular header values. The results reveal

that in 6.7% of the captured ﬁles, GET requests were

present. Also, in 6. % of these ﬁles, POST requests were

present. 24 unique header names were observed, as well as 33

unique User-Agent header values. The author also performed

an analysis of Accept-Language and Content-Type headers,

stating that their values can be used to identify malicious

network traﬃc.

Calzarossa and Massari in [13] presented an evaluation

of headers’ usage in HTTP traﬃc. The authors monitored

network traﬃc generated towards web servers at their

university, focusing mainly on capturing HTTP requests.

The analyzed a dataset which consisted of 315,000 requests,

sent by about 6100 clients. The results indicated interesting

characteristics of HTTP traﬃc. About 4% of requests were

sent using HTTP/1.0, and the number of header ﬁelds was

distributed between 0 and 14 (with mean 6.34). The number

of unique header names was about 60, but the number of

occurrences was diﬀerent. Host and User-Agent headers were

the most popular ones and appeared in more than 99% of

requests, followed by Connection,Accept, and From. The

authors also analyzed headers’ usage patterns, i.e., popularity

of headers’ sets among requests. The 10 most popular

patterns occurred in 1% of requests, and about two-thirds

of requests shared one pattern. The authors also observed

that the number of headers and usage patterns diﬀered

between the browsers and web robots, thus allowing to

distinguish them easily.

As already mentioned, some nonacademic sources re-

lated to this research are presented below.

In a presentation “HTTP Header Hunter–Looking for

Malicious Behavior into Your HTTP Header Traﬃc,”

Montoro presents the scoring system for the HTTP re-

quest headers [14]. The system inspects HTTP requests’

features, whitelists and blacklists of the User-Agent header

values, or top-level domains and the third-party data

sources such as geoIP. The analyzed HTTP requests’

features include presence of common headers (for ex-

ample, Cookie,Accept-Encoding, and Connection), number

of header ﬁelds in a request, protocol version, User-Agent

header values’ size, type of ﬁles being requested in URI,

and presence of the Host header in HTTP/1.0 requests. He

also proposed the usage of headers’ ordering, response

headers, and parameter names; however, this was not

implemented in his work. The author’s analysis showed

that the malware sometimes does not include User-Agent

or its value length is usually shorter than 90 bytes. Also,

malware tends to send 1–3 headers in requests, and

nonmalicious applications usually send more than 9

headers. The presented system adds a score to the features

to provide information about maliciousness of requests,

and it was tested on 6127 data streams. The resulting

detection rate of 9.1% and a false-positive rate of 9.15%

have been achieved.

Cuckoo malware sandbox system (https://cuckoosandbox.

org/) provides community modules which analyze HTTP

protocol traﬃc. The network_cnc_http module [15] provides

information about “suspicious features which may be in-

dicative of malware-related traﬃc.” It analyzes the lack of

the Referer header in the POSTrequest, the lack of the User-

Agent header in the POSTand GETrequests, the presence of

HTTP 1.0 version requests, and the presence of the IP

address in the Host header. The multiple_useragent module

[16] veriﬁes whether multiple User-Agent header values are

used.

Lewis presented a paper about HTTP headers’ heuristics

for malware detection [17]. The author proposes utilization

of some particular anomalies to help in the malware rec-

ognition. These include observing the User-Agent string for

values which are nonstandard and diﬀerent from the usual

for the particular network, typographic errors in headers’

names and values (additional whitespaces and misspellings

of header names), and complexity of the requested resource,

e.g., the length of the requested URL.

It must be noted that a large portion of academic re-

search papers focus on describing malware detection sys-

tems using the HTTP protocol. A selected representation has

been described below.

Mizuno et al. presented in [1 ] the malware detection

system called BotDetector, which uses HTTP requests’

header patterns. The system creates HTTP templates based

on header ﬁelds; it does not focus on chosen ﬁelds, but on all

of them. Each ﬁeld is split into words, which are then

evaluated using conditional probability of their appearance

in a particular position of the header ﬁeld. After performing

calculations, header ﬁelds are clustered using the DBSCAN

algorithm, thus producing the HTTP request template.

Li et al. presented a framework for detection and clas-

siﬁcation of network traﬃc of malicious Android application

[19]. It is based on analysis of the HTTP protocol and or-

ganized into 3 components: training module, clustering

module, and malware classiﬁcation and detection module.

Training module uses 5 features for model building. It in-

cludes values of the headers: Host,Referer,User-Agent, and

Content-Type and the value of the request URI. This module

uses the scoring mechanism which incorporates the header

value occurrence frequency and its previous presence in the

database.

Kheir in [20] introduced the malware taxonomy based

on the User-Agent header values, which is used for detecting

anomalous values proposed by the author. The User-Agent

header values are clustered in a two-step process. During the

ﬁrst phase, they are clustered based on high-level features

like length of the string and diﬀerent character type fre-

quencies. In the second step, the values are ﬁne-grained and

clustered based on the similarity of value parts. Finally,

clusters are tokenized to produce HTTP signatures, used for

detection purposes.

Security and Communication Networks 3

Li et al. in [21] presented the detection system for

malware traﬃc. The system uses HTTP requests’ features

such as character distribution and the length of the URL,

values of Content-Type and User-Agent headers, and or-

dering of the header in request.

In [22], the authors presented the malware detection

system utilizing analysis of multiple HTTP requests to create

behavior models. Statistical models were created for coarse-

grained clustering, based on multiple requests of malware

using, for example, the average length of the payload, re-

sponse, or URI. For ﬁne-grained clustering, they used the

request method and lexical features of the URI.

To the authors’ best knowledge, the papers presented

above have the following disadvantages when compared

with the research presented in this paper:

(i) The number of malware families, samples, or ana-

lyzed HTTP requests is smaller than that in this

analysis

(ii) The scope and the number of analyzed features are

limited

(iii) The existing work focuses on presentation of the

detection system, without thoroughly (or only to the

limited extent) exploring feature analysis

(iv) The identiﬁcation of anomalies is not proved within

the existing analyses

(v) Features of HTTP requests identiﬁed by nonaca-

demic sources were not veriﬁed academically

Considering the above, the below work aims at ﬁlling

these gaps by analyzing more extensively the dataset and

providing systematic analysis of a large number of HTTP

requests’ features.

Some academic sources provide diﬀerent approaches to

the problem of malware and browser distinctiveness or to a

broader problem of detection of malicious behavior. Two

examples are presented below, along with discussion about

connection to this paper.

Mimura and Tanaka in [23] presented a generic attack

detection method based on proxy server logs and URL. The

method is independent of attack methods and does not

require designing features for classiﬁers. The authors used

the paragraph vector algorithm to capture the context be-

tween multiple lines of proxy logs and produce vectors for

three classiﬁers: support vector machine, random forests,

and multilayer perceptron. The experimental results proved

that the method can detect unseen drive-by-download at-

tacks and C&C traﬃc in proxy server logs. Mimura et al. in

their paper focused on detection of malware behavior, while

in this paper, authors emphasize on the search of particular

features, which distinguish malware and browser traﬃc.

Moreover, the method used by Mimura et al. does not re-

quire designing of features and can be seen as independent of

particular features, whereas in this paper, the authors

provided a static list of features, which are analyzed.

Nia et al. in [24] presented a detection method of new

generations of cyber threats using the pattern-based random

walk. The authors proposed to use a limited method of

random walk called the self-avoiding walk, in order to create

a behavioral graph based on network traﬃc. The method

uses the ordered triple of time,size, and direction, created for

packets in analyzed network ﬂows. The authors created a

database of behavioral graphs for known threats. If the

analyzed packet set has a similar graph created by the self-

avoiding walk algorithm, then it is detected as malicious. The

authors reported a true detection rate of 95% for malicious

traﬃc. Nia et al. focused on detection of threats, while

authors of this paper emphasize on identiﬁcation of features

distinguishing malware and browser traﬃc. Moreover, the

method by Nia et al. works on ﬂow-level features and is

independent of higher network-level protocols, while au-

thors of this paper focused on speciﬁc parts of the HTTP

protocol, which is an application-level protocol.

3. HTTP Protocol Basics

HTTP protocol in version 1.1 was originally deﬁned in RFC

2616 [25] in June 1999. The RFC has been obsoleted by RFCs

7230–7235 [26–31]. Two earlier versions of the protocol

exists: 0.9 and 1.0, where the latter one was deﬁned in RFC

1945 [32]. However, only 1.0 version should still be

supported.

The HTTP protocol is based on the client-server ar-

chitecture, where the client sends a request and the server

replies to this request with a response. Request methods

deﬁned by RFC 7231 [27] are presented below with short

descriptions:

(i) GET: the primary method for resource retrieval; it

usually does not carry payload data, but this is not

forbidden

(ii) HEAD: this method is similar to GET, but the

server must not send any data in the response body

(except for the header section)

(iii) POST: the method of signalling to the server re-

quest for processing data enclosed in the payload

(iv) PUT: this method is used to create or replace the

state of the target resource with the state enclosed

in the message payload

(v) DELETE: according to RFC 7231, this method

“requests that the origin server removes the as-

sociation between the target resource and its

current functionality”

(vi) CONNECT: this method is used to signal proxy

request for creation of a connection with a des-

tination server provided in the request

(vii) OPTIONS: this method is used for discovering

information about the communication options

available for the requested resource

(viii) TRACE: this method is used to request the server to

resend the request back to the client; it must not

contain payload data

An example of an HTTP GET request is presented in

Figure 1.

4Security and Communication Networks

The ﬁrst line in the request in Figure 1 indicates the

request method, “GET,” requested Uniform Resource

Identiﬁer (URI)—“/,” and the protocol statement with

protocol version—“HTTP/1.1.” At the end, the Carriage

Return Line Feed (CRLF) is added. Further lines contain

header ﬁelds, each with a ﬁeld name and a ﬁeld value,

separated by a colon “:”. Field names are case insensitive,

and ﬁeld values consist of printable US-ASCII characters.

In practice, the majority of the HTTP client imple-

mentations follow such behavior and use printable US-

ASCII characters in the header ﬁelds. Additionally, RFC

7230 obsoleted the usage of non-US-ASCII characters in

these ﬁelds.

According to RFC 7230, there cannot be any space or

horizontal tabulator between the ﬁeld name and the colon.

There can however be any number of such characters be-

tween the colon and the ﬁeld value, as well as between the

ﬁeld value and the end of the header ﬁeld. Usually, there is

only one space before the ﬁeld value and no whitespace

characters at the end of the ﬁeld. Additionally, the header

ﬁeld order does not have any special meaning.

According to Section 5 of RFC 7231, the reason behind

the usage of header ﬁelds is “[...] to provide more infor-

mation about the request context, make the request con-

ditional based on the target resource state, suggest preferred

formats for the response, supply authentication credentials,

or modify the expected request processing.” Moreover, the

header ﬁelds are “ought to be registered with IANA”

(according to Section 3.2.1. of RFC 7230). The registry can be

accessed at https://www.iana.org/assignments/message-

headers/message-headers.xhtml.

Despite a rather extensive number of registered header

ﬁeld names (or shortly, headers), some of them are more

popular than others. Many of these ﬁelds are included in this

analysis. They are listed below:

(i) Host—carries information about the host, port, and

target URI; this header ﬁeld must be present in all

requests of the HTTP protocol version 1.1

(ii) Accept—speciﬁes response media types that are

acceptable by the client

(iii) Accept-Language—characterizes the set of natural

languages preferred by the client in the response

(iv) Accept-Encoding—depicts acceptable content

codings

(v) User-Agent—indicates which application is the

source of the request

(vi) Connection—speciﬁes control options of the con-

nection desired by the client

(vii) Referer—points to the source URI from which

requested URI originates

Once the set of header ﬁelds is in place, an additional

CRLF tag is inserted. At this point, a message body can be

added. The message body may end with the CRLF.

The message body can contain encoded data if the

original payload is compressed. The popular methods used

for this purpose are deﬂate and gzip. Additionally, data can

be divided and encoded with chunked transfer coding. In

such a way, parts of the data are sent using chunk-size

information.

RFC 7230 also deﬁnes pipelining mechanism. When

using this communication mode, the client can send mul-

tiple requests without waiting for corresponding responses.

4. HTTP Traffic Analysis Overview and

Experimental Methodology

4.1. HTTP Traﬃc Analysis Overview. An overview of the

malware HTTP requests’ analysis workﬂow is presented in

Figure 2. The process of analysis begins with choosing pcap

ﬁles which contain HTTP request traﬃc from datasets. The

ﬁles are ﬁltered with the tshark (https://www.wireshark.org/

docs/man-pages/tshark.html) ﬁlter so that only the TCP

protocol segments containing HTTP requests that are not

OCSP (Online Certiﬁcate Status Protocol) requests remain.

Files with HTTP requests are then fed to an IDS system. In

the proposed approach, Snort IDS with ET Pro rules (https://

www.proofpoint.com/us/threat-insight/et-pro-ruleset) and

Snort registered rules (https://www.snort.org/downloads/

#rule-downloads) are used to check for alert logs trig-

gered by the network traﬃc within the pcap ﬁles. Labeling of

HTTP requests begins with semimanual check of the gen-

erated alerts. They are reviewed in order to ﬁlter out those

which do not present HTTP traﬃc, or without information

about maliciousness, or alerting about nonmalicious ap-

plications or services, for example, Tor traﬃc. The top 10

most common Snort IDs (SIDs) with alert messages after

semimanual ﬁltering are presented in Table 1. Alerts for

Trojan Dridex and ransomware Locky are frequent in the

dataset, and this impact will be discussed in Section 4.2. In

the next step of request labeling, every unique SID is labeled

manually with the malware family name, depending on the

name provided by the alert message. If no family name is

present, it is labeled as No-name. If diﬀerent variants of

family names are present (occurring when various vendors/

malware analysts provide diﬀerent names), they are nor-

malized to one name. In the ﬁnal step of HTTP request

labeling, every request is assigned with a set of Snort IDs

alerted for a particular request. The assignment is done

automatically, on the basis of correlation of tuple: source IP

address, source port number, destination IP address, desti-

nation port number, and timestamp between the tshark

output for every request and the corresponding tuple in the

IDS alert set. Please note that the timestamp values are

transformed and normalized, in order to prevent any time

Figure 1: Listing of an exemplary HTTP request (some of the lines

were wrapped to ﬁt the table).

Security and Communication Networks 5

deviations, if the request timestamp reported by tshark is

diﬀerent than that reported by the IDS. If the tuples are the

same, the request is assigned with a particular SID of the

alert, along with the malware family name. In the case of

multiple SIDs assigned to one request, but with diﬀerent

family names, the request is analyzed manually to provide

the ﬁnal label.

HTTP requests labeled as malicious are evaluated using

feature analyzers. The feature list is static and is discussed in

Section 4.3. The analyzers utilize popular tools to perform

the actual analysis of the features. Feature extraction and

analysis process can be divided into three steps: (i) analysis

of basic features, (ii) analysis of complex features, and (iii)

analysis of payload features.

Basic feature extraction and analysis covers all features

which can be analyzed by their direct value extracted from

the HTTP request, for example, version of the protocol, the

type of the request method, or the popular header values. For

the extraction process, tshark is used to provide values of

particular ﬁelds, supported by the tool, for example, the

http.request.method for extraction of the request method.

Extraction and analysis of complex features covers the

process of obtaining values for features which are not direct

values of request ﬁelds but involves further analysis of such

ﬁelds. These features include, for example, verifying the

presence of unusual whitespace characters or non-US-ASCII

characters in the header values. The process is performed by

using the Scapy Python library (https://scapy.net/) for

analysis of pcap ﬁles and extraction of HTTP requests. The

actual analysis of data provided by Scapy is continued using

Python scripts, depending on the particular feature.

The ﬁnal step of feature extraction and analysis is per-

formed for requests which have payload. Firstly, tshark Lua

scripts are used for extraction of the payload data. Then, a

Perl script is used to detect the presence of the non-US-

ASCII characters in the extracted payload. Finally, the

payload entropy is calculated using ent—a pseudorandom

number sequence test program (http://www.fourmilab.ch/

random/).

After analyzing request features, the requests are

assigned to request groups and malware categories in order

to prepare data for statistic calculation. This part of the

process is extensively described in Section 4.2.

It must be noted that the benign browser-based HTTP

traﬃc was directly analyzed using the same set of analyzers

as described above. Additionally, it was also fed into the IDS

system in search for any traces of malicious traﬃc. The

results did not show any signiﬁcant alerts.

4. . HTTP Traﬃc Statistic Calculation. The traﬃc statistics

for the malicious dataset rely on grouping analyzed requests

into diﬀerent sets of requests, called request groups. Re-

quests form a request group when they trigger the same

Malware network

traffic datasets

PCAP files with

HTTP requests

Analysis of

PCAP files using

an IDS system

Labeling of HTTP

requests

Request feature

analysis

Assigning

requests to

request groups

Mapping request

groups to malware

categories

Statistic

calculation

Figure 2: An overview of the malware HTTP requests’ analysis workﬂow used in the paper.

Table 1: The top 10 most common Snort IDs and alert messages observed in the analyzed traﬃc.

Snort ID Alert message

1:436 5:1 MALWARE-OTHER Win.Trojan.Nemucod variant outbound connection

1:2023577:1 ET TROJAN Locky CnC Checkin HTTP Pattern

1:3267 :2 MALWARE-CNC Win.Trojan.Dridex variant outbound connection

1:33145:2 MALWARE-CNC Win.Trojan.Dridex initial outbound connection

1:201947 :1 ET TROJAN Dridex POST Checkin

1:2023551:1 ET TROJAN Locky CnC checkin Nov 21

1:2023552:1 ET TROJAN Locky CnC checkin Nov 21 M2

1:2 07610:2 ETPRO TROJAN DirtJumper DDoS (INBOUND)

1:2016 79:2 ET POLICY Unsupported/Fake Windows NT Version 5.0

1:2 21731:3 ETPRO CURRENT_EVENTS MalDoc Request for Payload Aug 17, 2016

6Security and Communication Networks

alerts at the IDS. Exemplary relations between inspected

HTTP requests and request groups are presented in Figure 3.

The mechanism of assigning HTTP requests to particular

request groups is as follows. For all HTTP requests reported

as malicious, Snort IDs (SIDs) indicated for this particular

request are analyzed, and a corresponding vector of SIDs is

created (this part of the procedure is presented in Section

4.1). All HTTP requests with the identical SID set (i.e., the

same vector) are put into the same request group. As pre-

sented in Figure 3, SIDs can overlap between the request

groups (see, e.g., SID 1); however, only unique SID vectors

are treated as distinct. Therefore, {SID 1} and {SID 1, SID 2}

groups are treated as diﬀerent groups, as well as distinct

from the {SID 1, SID 2, SID 3} request group. The motivation

behind it is that the HTTP requests triggering similar but a

bit diﬀerent set of IDS rules are also a little diﬀerent from

each other.

If not speciﬁed otherwise, statistics of the HTTP features

are calculated based on the request groups and not based on

single requests. An example of statistic calculation is pre-

sented in Figure 4. When considering how often request

methods are prevalent, it must be veriﬁed in how many of

the request groups a particular method is present. Every

request in a request group is checked; if in all of them the

method is present, the request group is treated as one entity

from the statistical point of view. In the example presented

in Figure 4, such a situation occurs for request groups 1 and

3 with the GET method and for the request group 4 with the

POST method. In the request group 2, both GET and POST

methods are present; thus, such a request group is reported

as having multiple values. Depending on the feature, such

cases are rather infrequent. Final results for the above-

mentioned example indicate that the GET method was

present in 50% of request groups, POST in 25%, and multiple

values were present in 25% of request groups.

It is worth noting that the statistics of the benign dataset

are calculated directly on requests—in this case, the requests

are not grouped as they do not trigger IDS alerts; thus, it is

impossible to group them.

Request groups are further divided depending on the

type of malware they represent. The main idea behind such a

presentation of results is to provide potential insights into

characteristics of various malware categories, which can

often demonstrate diﬀerent operational behavior.

As in the example introduced earlier, statistics of the

GET request occurrences are calculated using request

groups of a particular malware category. The algorithm is

the same as the one described before; i.e., every request in

a request group is checked, and if all of them are GET

requests, the request group of the category is treated as

one unit for the statistics. When in 2 out of the 4 request

groups of the category a certain feature is present, the

results will show its 50% occurrence in this category.

The name of the related malware was obtained from the

IDS alerts and was used for classiﬁcation purposes. The

request groups were divided into 20 categories, presented in

Table 2, along with the number of request groups in each

category. These categories were based on information

provided by the IDS rule comments, information from

malware dissection articles from the Internet, and from own

experience. Request groups were labeled semimanually.

Many of the malware categories mentioned in Table 2 are

self-explanatory, e.g., Banker,Spambot,Ransomware, or

RAT. However, some other classes need to be explained in

more detail. The IP check category groups requests which

were sent to the IP address identiﬁcation services. In that

way, malware typically checks the external IP address of the

infected machine or whether there is an Internet connection

available. The UA problem category contains only requests,

which were alerted by the IDS as a problematic User-Agent

header value but without information about the malware

family. The Downloader type groups all malware families

which are used to download other malware. This class is

diﬀerent from Downloader/JS, where in the latter one, the

actual code is a JavaScript, while in the former one, it is a

binary ﬁle (EXE ﬁle). Similar to these categories is Maldoc,

where additional malware is downloaded using malicious

documents, for example, Microsoft Word or Excel macros.

When the IDS labeled a request as a malware download, but

no information about the malware family name was

Request 1

Request 2

Request 3

Request 4

Request 5

Request 6

SID 1

SID 1, SID 2

SID 3, SID 4,

SID 5

Request

group 1

Request

group 2

Request

group 3

Figure 3: An example of assigning HTTP requests into request

groups based on the Snort ID rules reported by this IDS.

Security and Communication Networks 7

provided, the request was treated as the Malicious download

type. Request groups of Trojan malware which cannot be

assigned to any other specialized category (such as Stealer,

Banker, or Clicker) were treated as the catch-all Trojan kind.

Finally, the group request which could not be incorporated

into any other category is ascribed to the Other class.

All requests with

the same method

Request

group 1

GET

Multiple methods

in requests

Request

group 2

GET

POST

Request

group 3

GET

Request

group 4

POST

GET

MULTI

POST

Statistics

50% of request groups with GET

25% of request groups with POST

25% of request groups with mutiple

values

Figure 4: An example presenting statistic calculation methodology used for result analysis.

Table 2: Malware categories used to organize the obtained experimental results.

Category Description Number of request groups

Downloader Downloading other malware 134

Banker Banking Trojan 125

Trojan Trojan malware 117

Ransomware Crypting ﬁles and demanding ransom 5

Stealer Stealing users’ information 45

PUA/Adware Potentially unwanted applications or adware 30

IP check Checking IP address or connectivity 2

UA problem Problem with User-Agent header value 26

DDoS DDoS attack malware 24

Spambot Sending spam e-mails 20

Malicious download Downloading other malware 20

Miner Cryptocurrency mining 1

Maldoc Downloading other malware 16

Clicker Ad and link clicking 13

Downloader/JS Downloading other malware 12

Backdoor Backdoor Trojan 11

RAT Remote access Trojan 9

Bruteforce Bruteforcing, e.g., login panels 9

Other Other malware

Keylogger User key stroke logging 6

Security and Communication Networks

The reason behind such a request arrangement (both in

request groups and in malware categories) is to limit the

eﬀect of inequality of the number of requests between

malware families. For example, the Locky ransomware

family is represented in our dataset by one of the biggest

number of requests (ca. 1 0,000) which constitutes almost

30% of all requests. Without the proposed categorization,

such requests would have a tremendous impact on the

presented results.

It must also be noted that the quantity bias was not fully

eliminated. Even after introduction of request groups,

families with a large number of requests can still have a

higher number of request groups. The approach taken in this

paper regarding malware categories can limit this impact,

but it cannot eliminate it completely. The authors of this

paper believe that it is a trade-oﬀ between bias of the dataset

and identiﬁcation of potential anomalies in the broader

datasets.

4.3. Analyzed HTTP Traﬃc Features. The analyzed HTTP

requests’ features for the purpose of this paper were assigned

into 3 categories, with each of them representing diﬀerent

aspects of the request characteristics: (i) HTTP request

structure, (ii) header ﬁeld values, and (iii) HTTP request

payload feature groups. Such an approach was based on the

authors’ malware behavior analysis experience and previous

works of other researchers, as discussed in Section 2. Table 3

outlines the relation between the research source and the

corresponding feature category. The description of the

proposed feature groups is presented below, along with

information, in which features were proposed by the authors

of this paper, based on their experience. Such features in this

paper are marked with “∗”.

4.3.1. HTTP Request Structure Features. The analysis of the

HTTP request structure involves checking the form of the

request, i.e., occurring headers, protocol control informa-

tion, structure of the ﬁelds, and, as an extension, TCP

protocol destination port of the request. The features are

presented in Table 4.

Repetition of some headers is a known method for HTTP

requests’ smuggling through network devices such as ﬁre-

walls and web proxy servers (cf. [33]). In this research, it is

utilized to identify errors of malware developers, such as

unskillful change in the header value.

4.3. . Header Field Value Features. Header ﬁeld values were

examined in order to verify whether any of them are invalid

or signiﬁcantly diﬀerent from others. Also, the presence of

some additional or unusual whitespace characters was

veriﬁed as well as the presence of non-US-ASCII characters

(from this point onward, non-US-ASCII and non-ASCII will

be used interchangeably). Evaluation of the User-Agent

header was performed to obtain a list of names which

malware presents itself to the server. During analysis of the

Host header value, the type of the value was determined; it

was veriﬁed whether it was an IP address, domain name, or

some other value. The feature list is presented in Table 5.

The Host header value was analyzed for the value types as

presented in Table 6.

Host header value is frequently analyzed in the existing

works (for example, Li et al. [19]). In this research, it was

inspected to establish its value type, including error values.

As such, according to the best of the authors’ knowledge, it is

the ﬁrst attempt to provide such information in a general

manner.

4.3.3. HTTP Request Payload Features. Analysis of the

payload data includes features as presented in Table 7.

Its evaluation was performed on the data after decoding/

decompression and dechunking and not on the data as seen

on the wire. The reason behind it was to analyze the ﬁnal

payload, excluding the inﬂuence of compression and

chunking mechanisms on the analyzed payload data. Please

note that in the “presence of non-ASCII characters in the

payload” feature, “non-ASCII” is meant as “non-US-ASCII,”

and it will be used in the shorter form in this text.

4.4. Data Sources. Two data sources were used in the

conducted investigations. As the sources of malicious HTTP

traﬃc, pcap ﬁles from CERT Polska’s sandbox systems and

Malware Capture Facility Project (MCFP) (https://www.

stratosphereips.org/datasets-malware) were used. Basic in-

formation about these two sources is presented in Table .

PCAP ﬁles from CERT Polska’s sandbox environment

were generated by a Windows-based malware, analyzed in

2016–201 . The malware samples originate from automatic

systems and incident reports. The former represents systems

which collect samples from CERT Polska’s internal malware

hunting systems and publicly available sources provided by

various entities, including Shadowserver (https://www.

shadowserver.org/) or Abuse.ch (https://abuse.ch/). The

incident reports which provided malware samples were

reported mainly by Polish citizens (CERT Polska acts as a

Polish national CSIRT) and also by researchers and other

entities outside of Poland. All malware samples were ac-

quired during the period of 2016–201 and represent

malware encountered in the wild. The malware analysis

system consisted of Windows 7 virtual machines orches-

trated by the modiﬁed Cuckoo Sandbox system. Main

modiﬁcations were introduced into hardening the system

against anti-VM and anti-analysis techniques and into

process monitoring services. MCFP repository is maintained

at the Czech Technical University in Prague and consists of

pcap ﬁles from the long-term Windows malware observa-

tions. Both repositories represent popular malware families.

For the legitimate browser traﬃc, the authors decided to

generate it on their own. Various web browsers under

control of diﬀerent versions of the Windows OS were used,

as depicted in Table 9. This table also contains a number of

analyzed requests for each web browser. The browsers were

instrumented using the Selenium automation toolset

(https://www.seleniumhq.org/) to visit websites from the list

of 500 most popular websites worldwide. The list was created

Security and Communication Networks 9

using the Alexa top 1 million websites worldwide (http://s3.

amazonaws.com/alexa-static/top-1m.csv.zip). The websites

were accessed between 9 and 15 February 2017 and between

13 and 1 October 2017, depending on the browser.

Table 10 presents the top 5 malware families in the

categories grouped by the number of request groups. It

should be noted that 1 .67% of request groups were sent by

an unknown malware. In the table, such request groups are

marked as No-name.

5. Experimental Res lts

In this section, experimental results of analysis of features

presented before in Section 4.3 are outlined. The presen-

tation of the results uses the categorization introduced there.

5.1. HTTP Request Structure Features. Some of the analyzed

features did not show any results in the malware and browser

HTTP traﬃc. These are the repetitions of the header (two

header ﬁelds with the same name) and presence of pipelined

requests; i.e., multiple requests are sent without waiting for

their corresponding responses. The lack of colon in the

header ﬁeld was not observed in malware traﬃc and was

present in only 4 requests of Internet Explorer 11 on

Windows 7 and Windows .1 (two in each browser), i.e., in

about 0.01% of requests. Also, analysis of HTTP request

methods showed that it cannot be directly used to distin-

guish between malware and browser traﬃc. It is however

indicated that browsers mostly sent the GET request, while a

signiﬁcant portion of malware requests are POST requests.

5.1.1. HTTP Version. The results of the analysis of the HTTP

protocol version in malware traﬃc are presented in Figure 5.

The majority of the analyzed malware families grouped in

categories usually used version 1.0 of HTTP. The highest

level of occurrence of the version 1.1 (i.e., 42.22% of requests

groups) has the malware in the Stealer category, while the

other categories have a lower number of request groups for

this version of the protocol. Banker, Downloader, and Trojan

have about 15% of request groups with version 1.1. RATand

PUA/Adware about 10%. There are 6 categories with

nonzero levels up to 10% of requests groups. Finally, 6

categories of malware do not have any requests with version

1.1.

Table 3: The relation between the research source and the analyzed HTTP request feature groups.

Feature group Research sources

HTTP request

structure Montero [14], Cuckoo, Calzarossa et al. [13], Rossow et al. [11], Nelson [12], Li et al. [21]

Header ﬁeld values Montero [14], Lewis [17], Mizuno et al. [1 ], Calzarossa et al. [13], Li et al. [19], Kheir [20], Rossow et al. [11], Nelson

[12], Li et al. [21], Perdisci et al. [22]

HTTP request

payload Perdisci et al. [22]

Table 4: List of HTTP requests’ structure features.

Feature name

HTTP protocol version

Request method

Repetitions of the header (two header ﬁelds with the same name)∗

Lack of colon in the header ﬁeld∗

Number of headers in the request

Frequency of the headers’ occurrence

Misspellings of the header names

Presence of request pipelining∗

TCP destination port in the request

Features proposed by the authors of this paper are marked with ∗(an

asterisk).

Table 5: List of header ﬁeld value features.

Feature name

First character of the header ﬁeld is a whitespace

Whitespace before CRLF tag

Space before colon, semicolon, or comma

New line character other than CRLF∗

Double space

Nonstandard whitespace characters in the header ﬁeld∗

Non-ASCII value in the header∗

Accept-Language header value

Accept-Encoding header value

Connection header value

Host header value∗

User-Agent header value

Features proposed by the authors of this paper are marked with ∗(an

asterisk).

Table 6: List of Host header value types.

Host header value type

IP address

Domain name

IP address with the port number

Domain name with the port number

Error in the domain

Other value

Table 7: List of HTTP request payload features.

Feature name

Payload data length

Payload entropy value

Presence of non-ASCII characters in the payload∗

Presence of non-POST requests with the payload∗

Presence of Referer header in the POST request

Features proposed by the authors of this paper are marked with ∗(an

asterisk).

10 Security and Communication Networks

The same analysis related to the benign browser traﬃc

showed that only in Internet Explorer-based traﬃc running

under Windows 7 and .1 OSs, HTTP requests with version

1.0 occurred (0.01% and 0.0 %, respectively). The com-

parison of the results for malicious and benign traﬃc shows

signiﬁcant diﬀerence in protocol version usage. Therefore, in

the authors’ opinion, this feature is a good candidate in

selection of features distinguishing malware from browsers.

5.1. . Number of Header Fields. Analysis of the number of

headers in the request groups shows that the number of

headers varies between 1 and 11. The results are presented in

a graphical form in Figure 6. Their analysis shows that in

most categories, there were less than headers and up to 6

headers in categories such as Clicker,DDoS,Maldoc,Miner,

PUA/Adware, or Spambot. In many categories, 5 headers in a

request is a dominant value.

Most categories have request groups with multiple values

of headers’ number. These include 4 categories with more

than 40% of request groups (Bruteforce, IP check, Keylogger,

and Malicious download). All of them were analyzed further,

and results indicate similar header number ranges as in the

single header number value request groups.

The number of headers in browser requests is in the

range from 0 to 24 headers. However, for every browser, the

number of headers was in the range between 0 and 11 in at

least about 99% of requests. Results for this range are

presented in Figure 7 where percentage results of the

number of headers in the browser traﬃc requests are il-

lustrated. The most common number of headers is 7 and .

However, the ranges of the number of headers for

malware and browser traﬃc overlap, and their distributions

are diﬀerent. As already mentioned, in the benign traﬃc,

most of the values are close to two maxima (7 or headers in

a request). For malicious traﬃc, the majority of requests has

up to 6 headers. From this perspective, the number of

headers in the request can be perceived as a useful feature to

distinguish malware and browser HTTP traﬃc.

5.1.3. Header Occurrence. The top 10 headers sorted by their

average frequency of occurrences in the benign traﬃc are

presented in Table 11. The ﬁrst 7 headers occurred in all

browsers in at least about 90% of the requests. Some of the

well-known headers did not occur in the top 10, for example,

Origin (3.97% of all requests in all browsers), Content-Type

(1.21% of all requests), Cache-Control (1.1 % of all requests),

or Content-Length (0.95% of all requests). Nonstandard

headers, which begin with preﬁx X, were also observed.

Some of them are relatively known, e.g., x-ﬂash-version, but

others are server platform speciﬁc, e.g., X-TeaLeaf-Browser-

Res. One of the headers observed in the benign dataset was

particularly interesting. The “_” (an underscore) header was

present only in two requests sent to unid.go.com on Win-

dows 7 OS by Google Chrome and Microsoft Internet Ex-

plorer 11. This network traﬃc is associated with the content

delivery networks (CDN) operations. Additionally, some of

the well-known headers were observed written in varying

cases, for example, Authorization and authorization,Con-

tent-Type,Content-type, and content-type, and Accept and

accept. Generally, lower case versions were less frequent.

Table 12 summarizes the presence of particular headers

in the requests of malware traﬃc. The values present the top

10 headers regarding the percentage of all malware cate-

gories where the header appeared. Percentages were counted

in the request groups in which the header was present in all

requests.

Analysis of unique header names found in malicious

traﬃc shows that besides well-known headers, their versions

written in lowercase were also present, for example, accept-

Language or Content-type. Some of the header names cannot

Table 9: Networking environments in which HTTP traﬃc was analyzed.

Browser name Operating system Abbreviation Number of requests

Microsoft Edge Windows 10 Edge Win10 17,912

Google Chrome Windows 7 Chrome Win7 30,621

Mozilla Firefox (Adobe Flash Player installed) Windows 7 Firefox-FP Win7 1 ,705

Mozilla Firefox Windows 7 Firefox Win7 2 ,17

Microsoft Internet Explorer 11 Windows 7 IE11 Win7 30,799

Google Chrome Windows .1 Chrome Win .1 23,967

Mozilla Firefox Windows .1 Firefox Win .1 1 ,153

Microsoft Internet Explorer 11 Windows .1 IE11 Win .1 20,24

Note. Abbreviations introduced here are used in the paper to refer to the speciﬁc environments.

Table : Basic information about malicious pcap repositories.

Feature CERT.pl MCFP Sum

No. of pcaps in repository 36,26 117 36,3 5

No. of pcaps with HTTP network traﬃc 26,042 91 26,133

No. of pcaps with HTTP network traﬃc containing requests alerted by IDS 22,630 67 22,697

No. of reported IDS alerts 2,133,6 2 425,441 2,559,123

No. of reported IDS alerts assigned to requests 405,116 23 , 05 643,921

No. of unique alerted IDS rules 57 139 642

Security and Communication Networks 11

be found in any oﬃcial documentation, for example, Fil-

ename,Idle-time,Content-Key, or Server-Key. One header

user- looks like it was created to mimic the User-Agent

header, but for some reason it remained unﬁnished.

It must also be noted that no misspellings of header

names were found in the observed malicious and benign

traﬃc. The user- cannot be categorized as misspelling, but in

some way it proves the observation that sometimes malware

developers do make errors.

In both HTTP traﬃc datasets, the header names with

alternative case spellings were observed, for example, Content-

Type and Content-type. RFC 7230 does not prohibit such a

usage, stating that header names are case insensitive. However,

the observed traﬃc demonstrates that upper-cased ﬁrst

characters are more popular, regardless of the type of the traﬃc

dataset (benign/malicious). The occurrence of the lower-cased

version of the header names is low in both malware and

Table 10: Top 5 malware families in categories grouped by the

number of request groups.

Family name Number of request groups

Backdoor

Htbot 3

GrayBird 2

Dimnie 2

Zeprox 1

Votwup.D 1

Mokes 1

Banker

Ursnif 27

Dreambot 24

Chthonic 12

Emotet 11

Kronos 10

Bruteforce

No-name 6

Pifagor 2

Clicker

KOVTER 6

Zeroaccess 4

Sefnit 2

Miuref/Boaxxe 1

DDoS

DirtJumper 17

MegalodonHTTP 4

Madness 2

MedusaHTTP 1

Downloader

Pony 21

Nemucod 19

SmokeLoader 17

Locky 12

Zbot 11

Downloader/JS

No-name

Cryxos 4

IP check

No-name 2

Keylogger

AgentTesla 3

Keybase 2

KeyLogger.acqh 1

Maldoc

No-name 16

Malicious download

No-name 20

Miner

No-name 11

Adylkuzz 4

1ms0rry 2

Smominru 1

Other

FakeAlert.jh 3

Ratankba 1

Psiphon 1

No-name 1

DustySky 1

Table 10: Continued.

Family name Number of request groups

PUA/Adware

Wizzcaster 3

InstallCapital 3

BubbleDock 3

Sureseeker 2

OﬀerCast 2

Ransomware

Locky 3

AlphaCrypt

PadCrypt 4

Sage 3

Fatboy 3

RAT

Quasar 2

XPCSpyPro 1

TViewer 1

Teamspy 1

ShinoBot 1

Spambot

Kelihos.F

Necurs 5

XnxxAgent 3

Sality 3

Tofsee 1

Stealer

AZORult 11

Loki 10

FormBook 6

WernikStealer 2

Hawkeye 2

Trojan

Zbot 29

No-name 16

Andromeda 12

Graftor 7

Betabot 6

UA problem

No-name 26

12 Security and Communication Networks

9.09

15.20

7.69

14.18

3.57

5.00

5.56

10.00

2.35

11.11

5.00

42.22

15.38

90.91

84.80

100.00

92.31

100.00

85.82

100.00

96.43

100.00

90.00

94.44

100.00

90.00

97.65

88.89

95.00

57.78

83.76

96.15

5.00

0.85

3.85

Percentage of request groups (%)

Backdoor

Banker

Bruteforce

Clicker

DDoS

Downloader

Downloader/JS

IP check

Keylogger

Maldoc

Malicious download

Miner

Other

PUA/Adware

Ransomware

RAT

Spambot

Stealer

Trojan

UA problem

0.00 25.00 50.00 75.00 100.00

HTTP/1.1

HTTP/1.0

Both

Figure 5: HTTP protocol version (malicious traﬃc).

Trojan

100

Backdoor

Banker

Bruteforce

Clicker

DDoS

Downloader

Downloader/JS

IP check

Keylogger

Maldoc

Malicious download

Miner

Other

PUA/Adware

Ransomware

RAT

Spambot

Stealer

UA problem

Percent of request groups (%)

Headers’ number

Figure 6: The number of headers in a request (malicious traﬃc).

Security and Communication Networks 13

browser traﬃc and therefore is not distinctive enough to show

the general diﬀerence between the malicious and benign traﬃc.

Nevertheless, it can be more useful for distinction from the

perspective of individual malware families.

Based on the presented analysis, it can be concluded that

the presence of some particular headers can be used as a

feature for distinction between malicious and benign traﬃc.

The list of such headers should include those indicated as the

most popular ones in the browser traﬃc: Connection,Accept,

Accept-Encoding,Accept-Language, and Referer. These

headers appear in the analyzed malware traﬃc less

frequently.

Some previous works have been already performed when

it comes to the usage of the header order in malware

100

Chrome Win7

Chrome Win8.1

Edge Win10

Firefox−FP Win7

Firefox Win7

Firefox Win8.1

IE11 Win7

IE11 Win8.1

Percent of request groups (%)

Headers’ number

Figure 7: The number of headers in a request (benign traﬃc).

Table 11: Top 10 headers in a request (benign traﬃc).

Browser

Header name-percentage of requests

“Host” “User-

Agent” “Connection” “Accept” “Accept-

Encoding”

“Accept-

Language” “Referer” “Cookie” “DNT”

“Upgrade-

Insecure-

Requests”

Edge

Win10 100.00 99. 9 100.00 99. 6 96. 0 95.46 91.05 4 .64 1.07 0.00

Chrome

Win7 100.00 100.00 100.00 99.79 99.62 99.16 95.62 55.19 0.00 6.12

Firefox-FP

Win7 100.00 100.00 100.00 100.00 99.96 99. 4 93.97 4 .30 0.00 7.35

Firefox

Win7 100.00 100.00 100.00 99.99 99.92 9 .12 93. 6 4 .05 0.00 5.64

IE11 Win7 99.99 99.99 99.99 99.9 92.90 91.9 .33 43. 7 .4 0.00

Chrome

Win .1 100.00 100.00 99.7 99.46 99.39 97.7 93.64 52.02 0.00 6. 1

Firefox

Win .1 100.00 100.00 99. 2 99. 3 99.7 99.77 94.01 49.62 0.00 7.31

IE11

Win .1 99.99 99.99 99.74 99.70 93.91 92.9 .65 44.02 0.26 0.00

Average 100.00 99.99 99.93 99. 3 97.71 96.77 92.41 4 . 3 21.54 4.13

14 Security and Communication Networks

detection or browser ﬁngerprinting, for example, in the p0f

tool (http://lcamtuf.coredump.cx/p0f3/). The idea behind it

was to check the order in which the headers occur in the

HTTP request and to identify the application which sends it.

The authors of this paper believe that this problem has not

been fully analyzed, and more research is needed.

5.1.4. Destination Port. Destination ports of requests in

malicious traﬃc were also investigated. Unsurprisingly,

most of the requests were sent on port 0. 7 malware cat-

egories sent requests to other ports, but even in these sit-

uations, at least % of the request groups used port 0. Two

categories (Banker and Ransomware) sent requests on port

443, which is a registered port for the HTTPS protocol. This

behavior can be seen as anomalous regardless of the ap-

plication type which sent such requests.

Destination ports of requests in benign traﬃc were also

analyzed. It occurred that every browser sent over 99. % of

requests to port 0. However, some other ports were also

discovered, e.g., 443, 0 0, 0, and 050.

When comparing traﬃc results for malware and

browsers, one can see that both categories send requests

mainly on port 0. Other destination ports occur, but they

are not so frequent. The main diﬀerence between the

browser and the malware HTTP traﬃc is that the malware

uses ports of higher numbers, for example, higher than

10,000 in the Downloader category and higher than 40,000 in

the Ransomware category. This diﬀerence can be seen only

for single-request groups, and it cannot be conﬁrmed as

regular. Thus, the analysis of the utilized destination ports

cannot be conclusive to distinguish between the malicious

and benign traﬃc.

Nevertheless, usage of some ports for HTTP traﬃc is

improper, for example, port 443 which is registered by IANA

for the HTTPS protocol. Such situation can be anomalous on

its own, regardless of the type of network traﬃc, and is

usually alerted by the network monitoring systems.

5. . Header Field Value Features. In this section, diﬀerent

features of the header ﬁeld values were investigated. The

conducted experiments revealed that some of the features

initially selected to inspect (see Section 4) were not present at

all in the analyzed HTTP requests. This includes the fol-

lowing features: (i) the presence of the space at the beginning

of the header ﬁeld, (ii) the occurrence of space before colon,

and (iii) the appearance of the space before semicolon.

Considering the above, the obtained results for these 3

features are omitted. Some other features were observed in

the analyzed traﬃc. They did not however give any signif-

icant results that can be utilized for distinguishing between

malware and browser traﬃc because they were rare.

Whitespace character before CRLF tag was not present in the

browser traﬃc, but it was present in network traﬃc of 5

malware categories (of which 2 categories were exceeding

10% of request groups). The next feature is the presence of a

space character appearing before a comma in the header

ﬁeld. Its analysis revealed that it was absent in the browser

traﬃc and present only in about 2.35% of request groups of

the Ransomware category. The new line character other than

CRLF was not present in malware traﬃc, but it was observed

Table 12: Top 10 headers present in requests (malicious traﬃc) sorted by % of all categories where they appeared.

Category

Percentage of requests in category

“Host” “User-

Agent” “Connection” “Accept” “Accept-

Encoding”

“Cache-

Control”

“Content-

Length”

“Content-

Type”

“Accept-

Language” “Cookie”

Backdoor 100.00 100.00 45.45 45.45 27.27 54.55 45.45 54.55 27.27 9.09

Banker 100.00 73.60 75.20 13.60 1.60 77.60 60. 0 19.20 4. 0 4. 0

Bruteforce 100.00 100.00 7.50 100.00 12.50 0.00 7.50 7.50 0.00 25.00

Clicker 100.00 100.00 30.77 7.69 7.69 46.15 61.54 46.15 0.00 0.00

DDoS 100.00 3.33 7.50 0.00 0.00 .33 4.17 4.17 0.00 0.00

Downloader 100.00 90.30 75.37 55.22 32. 4 44.03 36.57 30.60 20.15 0.75

Downloader/

JS 100.00 3.33 100.00 3.33 3.33 0.00 0.00 0.00 .33 0.00

IP check 100.00 67. 6 75.00 32.14 2 .57 14.29 0.00 0.00 2 .57 7.14

Keylogger 100.00 66.67 0.00 0.00 0.00 16.67 66.67 66.67 0.00 0.00

Maldoc 100.00 100.00 1.25 1.25 1.25 6.25 0.00 0.00 25.00 0.00

Malicious

download 100.00 75.00 0.00 60.00 40.00 35.00 10.00 10.00 5.00 0.00

Miner 100.00 77.7 50.00 11.11 27.7 0.00 3 . 9 3 . 9 0.00 0.00

Other 100.00 75.00 62.50 12.50 12.50 12.50 12.50 25.00 12.50 12.50

PUA/Adware 100.00 0.00 66.67 23.33 13.33 43.33 40.00 33.33 0.00 3.33

Ransomware 100.00 0.00 71.76 62.35 47.06 70.59 77.65 71.76 42.35 1.1

RAT 100.00 . 9 55.56 22.22 22.22 33.33 44.44 44.44 0.00 11.11

Spambot 100.00 70.00 45.00 5.00 5.00 40.00 75.00 35.00 5.00 0.00

Stealer 100.00 57.7 6.67 35.56 13.33 11.11 66.67 6 . 9 26.67 0.00

Trojan 100.00 90.60 74.36 47. 6 12. 2 53. 5 52.14 35.90 9.40 2.56

UA problem 96.15 92.31 0.77 11.54 15.3 19.23 11.54 11.54 0.00 7.69

Note. The header was present in all requests of a particular request group in the malware category.

Security and Communication Networks 15

in the browser traﬃc of Internet Explorer 11 on Windows 7

and .1 in 2 and 13 requests, respectively. Both values are

below 0.1% of all requests for both browsers. The next feature

which gave limited results is the presence of a double space

in the header ﬁeld. It was not observed in the malicious

traﬃc; with regards to the browser traﬃc, only Google

Chrome (both on Windows 7 and .1) did not send requests

with double space in the header ﬁeld. Mozilla Firefox on

Windows 7 presented the highest percentage of such re-

quests (0.21%), but overall, the percentages are low. Finally,

nonstandard whitespace characters were not observed in

browser requests, and in malicious traﬃc, it was present only

in 1.1 % of Ransomware request groups. Additionally,

analysis of values of Accept-Language,Accept-Encoding, and

Connection headers did not indicate any distinguishing

features between browser and malware traﬃc. Thus, the

numerical results will be omitted for brevity of the text.

5. .1. Non-ASCII Characters in the Header Field.

Furthermore, the presence of the non-ASCII characters in

header ﬁelds in the malicious traﬃc was analyzed. This

feature was observed in two malware categories: Backdoor

(3.57% of request groups) and Ransomware (1.1 % of re-

quest groups). Additionally, in 3 request groups of the

Ransomware category (additional 3.53%), the feature was

present irregularly.

Non-ASCII character in the header ﬁeld in benign HTTP

traﬃc was observed only for 3 browsers: Chrome, Firefox,

and Internet Explorer running on Windows 7 OS. However,

it was only one request per browser, which is less than 0.01%

of all requests in the respective sets. It was caused by the

presence of Polish characters.

It must also be noted that the presence of the non-ASCII

character in the header ﬁeld can be treated as an anomaly

itself. It however occurs sporadically in both malicious and

benign traﬃc. It can be considered as an indicator of

anomalous traﬃc, but in the presented form, it cannot be

used as a general rule to distinguish malware and browser

HTTP traﬃc.

5. . . Host Header. The obtained Host header values in the

malicious traﬃc are presented in Table 13. The domain is

present as the main value in most of the malware categories.

However, for some categories, also the IP address value is

noticeable. These include Ransomware,Spambot,Clicker,

and Miner categories. Some categories also have request

groups with multitype values, for example, Maldoc has the

highest percentage among all categories, i.e., 31.25%.

The analysis of actual values in the requests with value

types deﬁned as Error in domain and Other was also per-

formed. The results indicate that the domain names con-

tained suﬃxes such as .bit,.xn–p1ai, additional “.” characters

(.com.), or were malformed, for example, 7M5 us or 5t9AR

us. Also, the malformed IP address and the port pair were

identiﬁed (5.141. .43:13404).

The results of the feature analysis for the benign browser

traﬃc show that in the majority of requests, the domain is

present in the Host header value, regardless of the browser

and OS used. In all browsers, such a value was present in at

least 99. % of requests. Other value types include IP address

(maximum value of 0.07% for Chrome browser on Windows

7), domain and port (maximum value of 0.1% in case of

Firefox browser on Windows .1), and IP address and port.

However, the latter ones are negligible for all browsers.

The comparison of results of value types in the Host

header shows that values other than the domain name are

more frequently spotted in the malware traﬃc. This means

that this feature can be used in some cases to discern

malware and browser traﬃc. However, the feature is strongly

related with the infrastructure used by cybercriminals. In-

tuition and malware analysis experience suggest that at-

tackers do use some nonstandard addresses for C&C servers.

The results show that it does not happen as often as it could

be expected. This research does not analyze the purpose of

sending particular requests; nevertheless, some of them are

sent by malware to benign addresses, for example, as a

connectivity check. This could impact the obtained results.

5. .3. User-Agent Header. The analysis of the User-Agent

header strings was performed for both traﬃc types. Some

typical values observed in the browser traﬃc are presented in

Table 14.

Table 15 presents the distribution of the User-Agent value

types in the browser traﬃc. The values were analyzed based

on their similarity to standard values (presented in Table 14)

or lack of them in the User-Agent header. All browsers used

standard values in at least 91% of the requests. Internet

Explorer on Windows 7 and Windows .1 OSs in about 1%

of the requests used values similar to the standard ones but

slightly expanded in some ﬁelds, e.g., Mozilla/4.0 (com-

patible; MSIE 7.0; Windows NT 6.1; Trident/7.0; SLCC ;

.NET CLR .0.507 7; .NET CLR 3.5.307 9; .NET CLR

3.0.307 9; Media Center PC 6.0; .NET4.0C).

Also, these two browsers experienced a higher per-

centage of values not similar to the standard User-Agent

strings (7.35% and 6.37%, respectively). Some signiﬁcant

results were also noted for the Microsoft Edge, Mozilla

Firefox on Windows 7, and Google Chrome on Windows

.1. For all Microsoft browsers (Internet Explorer 11 on

Windows 7 and .1 and Edge) the main part of the requests

with nonstandard values consists of the request sent by

modules responsible for downloading certiﬁcate revocation

lists [34]—Microsoft-CryptoAPI. This mechanism is utilized

by browsers to download the current sets of revocated X.509

certiﬁcates used in the HTTPS protocol communication.

Other User-Agent header values were also present, for ex-

ample, Microsoft BITS or Microsoft-WNS which can be used

by OS mechanisms like Windows Update or Windows Push

Notiﬁcation Services from Windows onward. The usage of

the Microsoft BITS User-Agent found in the Chrome browser

HTTP traﬃc (both for Windows 7 and .1) can be attributed

to the update mechanism of this browser.

Four browsers requests without the User-Agent string

were present. In overall, this was applied to less than 0.1%.

These requests were sent by system mechanisms or gener-

ated along with the web page activity.

16 Security and Communication Networks

In the end, the authors decided to leave such requests

in the dataset as it is not known with certainty which

requests were sent by the browsers and which were not.

The value of the User-Agent header could be misleading, or

assumption about the system service could be erroneous.

Additionally, the number of nonbrowser requests is not

high and thus should not introduce much noise into the

results.

For the User-Agent header, the analysis performed on the

malicious HTTP traﬃc uncovered 621 unique values of the

User-Agent header. These values were further analyzed in

order to establish well-known browser results as malware

developers typically try to mimic the behavior of the benign

software. The results of this analysis are presented in Fig-

ure , and they are grouped by the popular browser names

and lack of the User-Agent header or the nonstandard value.

If the requests carry many diﬀerent values from any of these

classiﬁcation groups, they are classiﬁed into the Misc group.

From the 4 browsers indicated in Figure , most of the

requests include Internet Explorer or the Firefox User-Agent

string. Also, only 4 categories (Backdoor, Bruteforce,

Downloader/JS, and Spambot) do not have requests without

the User-Agent header. Some categories have a high per-

centage of the User-Agent values other than those 4 standard

ones, e.g., Miner, Other, PUA/Adware, RAT, or Stealer.

Table 13: Host header values (malicious traﬃc).

Category Domain IP IP + port Domain + port Error in domain Other Multi

% of all % of all % of all % of all % of all % of all % of all

Backdoor 72.73 9.09 0.00 0.00 0.00 0.00 1 .1

Banker 70.40 9.60 .00 0.00 6.40 0.00 5.60

Bruteforce 7.50 0.00 0.00 0.00 0.00 0.00 12.50

Clicker 69.23 23.0 0.00 0.00 0.00 0.00 7.69

DDoS 100.00 0.00 0.00 0.00 0.00 0.00 0.00

Downloader 76. 7 6.72 2.24 0.75 2.24 0.75 10.45

Downloader/JS 100.00 0.00 0.00 0.00 0.00 0.00 0.00

IP check 100.00 0.00 0.00 0.00 0.00 0.00 0.00

Keylogger 66.67 16.67 0.00 0.00 0.00 0.00 16.67

Maldoc 6 .75 0.00 0.00 0.00 0.00 0.00 31.25

Malicious download 0.00 10.00 0.00 0.00 0.00 0.00 10.00

Miner 72.22 22.22 0.00 5.56 0.00 0.00 0.00

Other 75.00 0.00 25.00 0.00 0.00 0.00 0.00

PUA/Adware 100.00 0.00 0.00 0.00 0.00 0.00 0.00

Ransomware 40.00 52.94 1.1 0.00 0.00 0.00 5.

RAT 77.7 11.11 0.00 0.00 11.11 0.00 0.00

Spambot 40.00 60.00 0.00 0.00 0.00 0.00 0.00

Stealer 6 . 9 13.33 4.44 0.00 0.00 0.00 13.33

Trojan 65. 1 17.09 3.42 0. 5 0.00 0.00 12. 2

UA problem 0.77 11.54 0.00 0.00 0.00 0.00 7.69

Table 14: The standard values of the User-Agent header (benign traﬃc).

Browser User-Agent value

Edge Win10 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116, Safari/

537.36 Edge/15.15063

Chrome Win7 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924. 7 Safari/537.36

Firefox-FP

Win7 Mozilla/5.0 (Windows NT 6.1; rv:51.0) Gecko/20100101 Firefox/51.0

Firefox Win7 Mozilla/5.0 (Windows NT 6.1; rv:51.0) Gecko/20100101 Firefox/51.0

IE11 Win7 Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko

Chrome Win .1 Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36

Firefox Win .1 Mozilla/5.0 (Windows NT 6.3; rv:56.0) Gecko/20100101 Firefox/56.0

IE11 Win .1 Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko

Table 15: The distribution of the User-Agent header value types

(benign traﬃc).

Browser

Main

Similar to main

UA No UA Other

% of all % of all % of all % of all

Edge Win10 96. 1 0.00 0.11 3.0

Chrome Win7 99.19 0.00 0.00 0. 1

Firefox-FP

Win7 99. 4 0.00 0.00 0.16

Firefox Win7 9 .12 0.00 0.00 1.

IE11 Win7 91.53 1.11 0.01 7.35

Chrome Win .1 97.91 0.00 0.00 2.09

Firefox Win .1 99.77 0.00 0.00 0.23

IE11 Win .1 91.99 1.63 0.01 6.37

Security and Communication Networks 17

Nonstandard User-Agent header values discovered in the

malicious HTTP traﬃc present values of the software modules/

libraries, such as LuaSocket .0. ,AutoIt, or python-requests/

.1 .4. Some others include meaningless values, e.g., W1pbbA((,

EMSFRTCBVD. Others upload some system information,

presented below with regular expression or obfuscated for

privacy purposes: regexp: ^∖{[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]

{4}-[A-Z0-9]{4}-[A-Z0-9]{10}∖}$,C:Users[user’s name]AppDa-

taRoaming v o5g0Ie5itemp.zip. Finally, some of them present

directly their name and purpose, e.g., TrickLoader or Botnet by

Danij.

It must also be noted that additional examination

should be applied to the requests without the User-Agent

header. They are hard to be found in the browser HTTP

traﬃc, but they are present in the majority of malware

traﬃc—from 3. 5% of the request groups of the UA

problem category up to 42.22% for the Stealer category.

The authors believe that the lack of the User-Agent header

can be used to distinguish between the malicious and

benign HTTP traﬃc.

5.3. HTTP Request Payload Features. The analysis of the

payload data length did not give any signiﬁcant results in

distinction between malware and benign traﬃc. However,

the authors of this paper believe it should be analyzed with a

more specialized approach than used in the study, giving

more speciﬁc results.

5.3.1. Payload Entropy. In Table 16, the results of analysis of

the request payload entropy for malicious traﬃc are pre-

sented. As was the case with the payload length, the statistics

are based on the values of the payload entropy inside

malware categories, but the values are not organized into

request groups. On the other hand, the investigation of the

same feature in the browser traﬃc is presented in Table 17.

The comparison of the obtained results demonstrates

that many malware families achieved higher levels of the

payload entropy. When comparing the median, mean,

quartile, and maximum values, 11 out of 15 malware

categories have higher mean and median values of the

payload entropy than in the benign HTTP requests. The

maximum value of the browser traﬃc payload entropy is

6.13bits (in Chrome browser on Windows .1), whereas in

the malicious traﬃc in 4 categories (i.e., Downloader, PUA/

Adware, Spambot, and Trojan), the value achieved almost

the highest possible value of bits. The minimum values of

entropy of 1.0 bits in both malware and browser traﬃc are

caused by the requests with a very small payload size (1-

2 bytes).

Figure 9 allows visual comparison of malware categories

and browser traﬃc, and it presents the boxplot diagram of

the corresponding payload entropy. Categories such as

Bruteforce, DDoS, Keylogger, RAT, and UA problem almost

overlap with the benign dataset entropy range when ana-

lyzing their interquartile range. For Ransomware and PUA/

Adware categories, the median value is slightly smaller than

18.18

7.69

1.49

11.11

3.33

36.36

52.80

61.54

33.33

38.06

83.33

7.14

75.00

50.00

11.11

23.33

51.76

22.22

40.00

4.44

43.59

26.92

2.24

10.71

6.25

2.35

15.00

11.11

4.27

22.40

16.67

6.72

16.67

17.86

33.33

5.00

22.22

25.00

20.00

11.11

30.00

42.22

6.84

3.85

9.09

6.40

100.00

16.67

9.70

7.14

50.00

6.25

5.00

11.11

3.33

2.35

22.22

2.22

4.27

3.85

36.36

6.40

23.08

16.67

36.57

10.71

16.67

12.50

5.00

44.44

75.00

50.00

23.53

44.44

40.00

32.48

50.00

12.00

7.69

16.67

5.22

46.43

35.00

15.00

8.55

15.38

Percentage of request groups (%)

Backdoor

Banker

Bruteforce

Clicker

DDoS

Downloader

Downloader/JS

IP check

Keylogger

Maldoc

Malicious download

Miner

Other

PUA/Adware

Ransomware

RAT

Spambot

Stealer

Trojan

UA problem

0.00

Opera

Chrome

No “User-Agent” Misc

Firefox

Other

25.00 50.00 75.00 100.00

Figure : Browser string as reported in the User-Agent header (malicious traﬃc).

1 Security and Communication Networks

for the benign traﬃc. Both categories however have many

outliers. Backdoor, Clicker, and Miner categories have me-

dian values higher than those for the browser traﬃc. They

also have outlying values in the interquartile range of benign

traﬃc. Stealer and Trojan categories have higher median

values than browser HTTP traﬃc, but still, the interquartile

Table 17: The payload entropy statistics for HTTP requests (benign traﬃc) in bits.

Browser Median Mean 1st quartile 3rd quartile Min value Max value

Edge Win10 4. 4 4.74 4.62 4.99 2.73 5.72

Chrome Win7 4. 1 4.71 4.42 5.10 1.00 5.93

Firefox-FP Win7 4.72 4.30 3.24 4.92 1.00 5.76

Firefox Win7 4. 2 4.72 4.43 5.16 1.00 5.75

IE11 Win7 4. 0 4.72 4.43 5.09 2.50 5.94

Chrome Win .1 4.74 4.39 3.24 4.96 1.00 6.13

Firefox Win .1 4. 3 4.70 4.53 5.02 1.00 5.7

IE11 Win .1 4. 2 4.76 4.59 4.9 2.50 5.72

Chrome Win7

Chrome Win8.1

Edge Win10

Firefox−FP Win7

Firefox Win7

Firefox Win8.1

IE11 Win7

IE11 Win8.1

Backdoor

Banker

Bruteforce

Clicker

DDoS

Downloader

Keylogger

Miner

PUA/Adware

Ransomware

RAT

Spambot

Stealer

Trojan

UA problem

Mean entropy (bits)

Figure 9: Boxplot diagrams of the payload entropy of malicious and benign HTTP traﬃc.

Table 16: The payload entropy statistics for HTTP requests (malicious traﬃc) in bits.

Category Median Mean 1st quartile 3rd quartile Min value Max value

Backdoor 5. 6 5. 5 5. 6 5. 6 4.0 5.96

Banker 5.42 6.11 5.42 7.35 1.00 7. 6

Bruteforce 4.27 4.32 4.27 4.35 4.24 7.63

Clicker 5.91 5. 5. 9 5.92 4.29 5.96

DDoS 4.31 4.31 4.31 4.31 4.31 4.31

Downloader 6.15 6.41 5.06 7.76 3. .00

Keylogger 4.96 5.05 4.96 5.14 1. 2 7.63

Miner 5.52 5.47 5.51 5.53 3.60 5.99

PUA/Adware 4.21 4.45 4.21 4.21 4.21 7.95

Ransomware 4.44 4.4 4.39 4.4 3.51 7.53

RAT 4.99 4.97 4.7 5.00 4.45 5. 9

Spambot 7.10 6. 4 6.77 7.16 3. 5 .00

Stealer 6.00 5.24 4.29 6.03 4.11 6.6

Trojan 5. 1 5.39 4.36 5. 2 1.00 7.99

UA problem 4.9 5.04 4. 7 5.13 4.65 5.63

Note. The statistics were counted using all requests in the particular malware category, without being organized into request groups.

Security and Communication Networks 19

range partially overlaps with those of browser traﬃc. Cat-

egories such as Banker, Downloader, and Spambot visually

achieve diﬀerent distribution of values than other categories,

and their median values are higher than those of browser

traﬃc.

Overall, the obtained results prove that for many mal-

ware categories, the payload entropy is higher than for the

browser HTTP traﬃc. From this perspective, the value of the

payload entropy can be used as a feature to diﬀerentiate

between the malicious and benign traﬃc. It should also be

noted that the above conclusions are made in a general

manner and that some particular malware families can

exhibit diﬀerent behaviors.

5.3. . Non-ASCII Characters in Payload. Figure 10 shows

the presence of non-ASCII characters in the payload of

malicious HTTP traﬃc. For the Banker, Downloader, and

Spambot categories, more than 50% of request groups

contained non-ASCII characters in the payload data. Ad-

ditionally, non-ASCII characters were present also in the

Bruteforce, Keylogger, PUA/Adware, Ransomware, and

Trojan categories, but in less than 40% of request groups.

It must be noted that non-ASCII characters are rarely

seen in the browser benign HTTP request payloads, but they

are present in the traﬃc of all browsers. Only for the Firefox

and Chrome browsers on Windows 7 OS, the numbers are

higher than 1% (3.40% and 1.26%, respectively). After

performing manual analysis of these requests, it turned out

that these were either a part of a JSON with Chinese UTF

encoded characters or a part of data sent to URL: http://sqm.

microsoft.com/sqm/vstudio/sqmserver.dll.

To summarize, this feature should be considered useful

for identifying malicious HTTP traﬃc.

5.3.3. Methods of Requests with Payload. Request methods

with payload data in the malware traﬃc were also analyzed. The

majority of malware categories used POSTrequest to send data.

However, malware categories used requests other than POST.

In case of Clicker and Spambot,GET requests contained

payload in 25.00% and 46.67% of request groups, respectively.

In the remaining 6 categories, mixed values were pre-

sent—GET or POST requests could both be present in request

groups. They were also mixed with requests without payload.

Analogous analysis has been performed on the benign

browser traﬃc. It turned out that all requests used POST

methods.

Therefore, the comparison of the results for both types of

traﬃc leads to the conclusion that in some cases, the feature

can be used to distinguish between these two traﬃc types.

Note that RFC 7231 does not prohibit sending payload data

in the GET requests; however, as it can be seen from the

obtained results, browsers usually perform such operation

using the POST method.

5.3.4. Presence of Referer Header in POST Requests.

Finally, the presence of the Referer header in POST requests

of malicious HTTP traﬃc has been investigated.

The obtained results show that most malware categories

sent POST requests without the Referer header. However, in

the Ransomware category, this header was present in almost

55% of request groups. Only in 2 categories (i.e., Stealer and

Bruteforce)Referer was spotted in more than 10% request of

groups (27.7 % and 14.29%, respectively).

POST requests constitute a small fraction of all browser

requests in the analyzed traﬃc, i.e., less than 1.5% depending

on the browser. For 2 browsers, the Referer header is present

in every POSTrequest (Internet Explorer 11 and Firefox with

Flash Player, both on Windows 7). In the other case, only at

most in 3.17% of all POST requests, this header is not

present.

Based on the comparison of the obtained results for

malware and browser traﬃc, it can be concluded that the

lack of presence of the Referer header in POST requests can

be a promising feature to distinguish malicious and benign

HTTP traﬃc.

5.4. Comparison of Results with the Related Work. In this

section, the obtained results are compared with those re-

ported in the previously discussed papers (see Section 2). In

the remainder of this section, only sources which explored

features of HTTP network traﬃc are taken into account, as

these can be directly compared to this work.

In two sources (Rossow et al. in [11] and Nelson in [12]),

a number of unique header names in malware traﬃc was

observed. In the ﬁrst source, it was 144 headers, and in the

second one, it was 24, whereas in this analysis, it was 42.

Also, as reported by Rossow et al. [11], in 9 .6% of samples

the User-Agent header was present. The below paper revealed

that depending on the malware category, it was at least

57.7 %, but in many cases, the percentage was closer to 0%.

Rossow et al. observed the Accept-Language header in 44.3%

of samples—in this analysis, this header was hardly present

in requests at all. Nelson in [12] also analyzed the Accept-

Language header along with Content-Type, considering

values of these headers as helpful for identifying malware. In

this paper, their values were not analyzed extensively, but it

was observed that their presence (or lack) can be used as a

distinctive feature to spot malware traﬃc.

Calzarossa et al. in [13] analyzed benign HTTP network

traﬃc. The authors observed HTTP/1.0 version of protocol

in 4% of requests, whereas in the below research, it was not

observed at all or it was present in less than 0.1% of requests.

Also, they reported about 60 unique header names with their

number between 0 and 14 in request. In the below analysis, it

was about 90 unique header names, with 0 to 24 headers in a

single request. The most popular headers were similar in

both analyses, apart the From header which was not present

in the below dataset. It can be explained with the nature of

analyzed network traﬃc. This header is popular in requests

sent by robot HTTP clients, and such traﬃc was present in

the discussed analysis. However, it was not present in the

traﬃc analyzed here.

In Section 2, nonacademic sources were also reviewed.

Montoro in his presentation [14] presented a set of request

features, which can be used for malware detection. Some

20 Security and Communication Networks

of them were also identiﬁed here as distinctive for mal-

ware network traﬃc. These include lack of popular re-

quests, protocol version, and the number of headers in

request. The author observed that malware sometimes

does not include the User-Agent header; this was also

observed in the below research. Additionally, he also

identiﬁed that malware sometimes sends less than 4

headers, while benign applications send usually more than

9. In the below analysis, a similar dichotomy was ob-

served; i.e., malware tends to send less headers in request

than browsers. However, the numbers were diﬀerent, as

malware tends to send at most 6 headers and browsers

between 7 and 9.

Analysis included in community modules of the

Cuckoo sandbox system (https://cuckoosandbox.org/) was

also explored in this work. The features of HTTP traﬃc

analyzed by the network_cnc_http module were also

identiﬁed here, proving their usefulness. Lack of the Referer

header in malware in POST requests was observed as well as

the lack of the User-Agent header in POST and GET re-

quests. Also, it was identiﬁed that HTTP/1.0 version of the

protocol is often seen in malware requests. Regarding the

module’s feature of the IP address in the Host header value,

it was observed that values other than the domain name are

hardly seen (less than 1% of requests) in the in the browser

traﬃc. The IP address is not as a popular Host value as the

domain in the malicious dataset; however, it is still more

frequent than in the browser traﬃc. As the value of the Host

header depends on the infrastructure of the attacker, it can

be used as a potential indicator in malicious traﬃc

identiﬁcation.

Lewis presented in [17] observations about HTTP headers

sent by malware. The below analysis has conﬁrmed the ﬁndings

of the author that malware sometimes uses nonstandard values

of User-Agent. However, the below experiments did not ﬁnd

frequent typographic errors in the header names and values. For

example, the only features regarding the whitespace character

which give any results were when the space was present before

CRLF (present in the network traﬃc of 5 malware categories,

only 2 categories exceeding 10%) and when the space was

present before a comma (2.35% of request groups in the

Ransomware category). Additionally, the double space was not

present in the malware traﬃc, but it was observed in the browser

traﬃc (below 0.1% of requests).

The diﬀerences in results between the below analysis and

the reviewed papers can be explained as follows. Firstly, the

datasets analyzed in all sources and in this analysis are not

uniform; that is, the uniformity of represented malware

families and samples is not guaranteed. Secondly, the reviewed

work is older than this analysis, and some malware families’

behavior could already change, for example, to be able to

further avoid detection. This shows that the analysis of malware

network behavior should be performed regularly, especially

when observed against the behavior of benign software. In this

case, having a continuous monitoring can signiﬁcantly increase

the chance of detection of evolving threats.

6. Application of the Cond cted Research

Until this point, HTTP protocol requests were analyzed in

order to identify features, which could be helpful in dis-

tinguishing between malicious (malware) and benign

100.00

30.14

85.71

100.00

28.95

75.00

85.71

75.00

87.88

100.00

20.00

90.00

52.17

60.00

61.64

14.29

71.05

25.00

16.67

12.12

80.00

37.68

8.22

100.00

14.29

8.33

10.00

10.14

40.00

Percentage of request groups (%)

Backdoor

Banker

Bruteforce

Clicker

DDoS

Downloader

IP check

Keylogger

Miner

PUA/Adware

Ransomware

RAT

Spambot

Stealer

Trojan

UA problem

0.00 25.00 50.00 75.00 100.00

ASCII characters

Non-ASCII characters

Other

Figure 10: Non-ASCII characters in the payload (malicious traﬃc).

Security and Communication Networks 21

(browser) HTTP traﬃc. In this section, the obtained results

will be summarized to provide more practical and opera-

tional information and insights.

6.1. Practical Observations. In the previous analyses, the

number of features indicated signiﬁcant diﬀerences between

the malware and browser traﬃc. These features are sum-

marized in Table 1 . Features marked with ∗were proposed

by the authors at the beginning of this paper as worth of

analysis. Please note that the results for the destination port

other than 0 are limited; however, they are analyzed against

values of the Host header.

The values for the number of headers were chosen

according to analyses conducted in Section 5.1.2 which

showed that only in a small number of browser requests, less

than 4 headers were present. Also, the analyses indicated that

the boundary of 6 headers in a request can be chosen as a

distinctive value for the majority of requests between

malicious and benign HTTP traﬃc.

High payload entropy is deﬁned as greater than 6.13 bits.

This speciﬁc value was chosen as the maximum of the entropy

value observed in the browser traﬃc (Section 5.3.1). Such a

deﬁnition can also be supported from the practical perspective

as in the popular network tool CyberChef (https://gchq.

github.io/CyberChef/#recipe=Entropy()) in which authors

state that English texts’ entropy value lies usually between 3.50

and 5 bits. Also, an analysis of the Zeus botnet by Al-Bataineh

and White [35] showed that the payload entropy was higher

than 6.5 bits which is similar to the value proposed in this

paper.

The features presented in Table 1 were further analyzed

to determine how often their pairs co-occur. The results of

this analysis provide some practical observations. They can

be used as indicators in the manual malware analysis or

treated as an entry point for further analyses. The most

important observations from co-occurrence of features

analysis in the malicious HTTP requests are that in the

signiﬁcant number of requests:

(i) A low number of headers occurs with the lack of

the User-Agent header

(ii) Requests with the high entropy payload do not

have a domain in the Host header value or the

requests use the POST method without the Referer

header

(iii) When the request is sent to port other than 0, the

User-Agent header value is diﬀerent from the

standard ones

(iv) When the GET request has payload, it also has a

low number of headers or the entropy of payload is

high or the payload contains non-ASCII characters

(v) The POST requests without the Referer header also

have non-ASCII characters in payload

(vi) Requests without the Accept header also lack Ac-

cept-Encoding or Accept-Language headers

(vii) Requests without the Connection header also lack

Accept,Accept-Encoding, or Accept-Language

headers

(viii) When the request is sent to the port other than 0,

the Host header value is not a domain

(ix) With 1.0 version of the protocol, POST requests do

not contain the Referer header

6. . Practical Usage Scenarios. HTTP request features pre-

sented in previous sections can be practically applied in

multiple scenarios. The main ones are outlined below.

Firstly, some of the features, especially those presented in

Section 6.1, can be used directly to identify suspicious re-

quests. The term suspicious is used intentionally because the

presence or lack of some of these features cannot be treated

as an unambiguous indicator of the request maliciousness.

Nevertheless, multiple usage scenarios can be presented.

One of them is a manual inspection of the HTTP traﬃc

during malware sample analysis. Also, the observations

presented in this paper can be incorporated as rules in

network security monitoring systems. Examples of such

rules are presented in Figures 11 and 12. Such rules were

used with success in the malware analysis laboratory of

CERT Polska.

Secondly, all presented features can be used to create an

application ﬁngerprinting system. Such a system can create a

unique identiﬁer by extracting and investigating particular

features of the HTTP traﬃc. The identiﬁer can be attributed

to the particular application and afterwards used as a pattern

to recognize such application’s network traﬃc. Finger-

printing systems are used for some protocols, for example,

for the TLS with the JA3 system (https://github.com/

salesforce/ja3) or HTTP with p0f (http://lcamtuf.

coredump.cx/p0f3/). The latter is not actively developed;

however, it can be used as an inspiration. The HTTP request

ﬁngerprinting system can be used to identify particular

malware families but potentially can also help to reveal

information about the nature or the purpose of the inspected

requests. For example, it can provide information whether

request was a C&C server beacon or a connectivity check.

Also, in the strictly controlled environments, a list of allowed

application ﬁngerprints can be used, and if a ﬁngerprint

previously not encountered is detected, the system can raise

an alarm. Observations presented in this paper were used to

create a prototype of the HTTP analysis and ﬁngerprinting

module for the Long-Term Sandboxing subsystem in the

Horizon 2020 SISSDEN Project [36]. The system helped in

the observation of malware behavior, for example, by

providing a means for identiﬁcation of malware operations

such as connecting to the C&C server or connectivity checks.

Thirdly, the presented analyses identiﬁed HTTP request

features which allow to distinguish malicious and benign

HTTP traﬃc. Such features can be conveniently used to

create a malware detection system which utilizes them to

provide information about maliciousness of the HTTP re-

quests. As a result, information on whether infected hosts

are present in the monitored network can be provided.

22 Security and Communication Networks

7. Limitations of the Work

Even after carefully designed and performed analyses, some

limitations of this research were identiﬁed. They are dis-

cussed below.

First of all, the presented generalized observations can be

applied only to the analyzed malware samples. Other mal-

ware families or even other samples of the presented families

can behave in a diﬀerent way. The authors believe, however,

that most of the identiﬁed HTTP request features can be

generalized to other malicious software representatives as

many of them capture diﬀerences inherent to the general

malware behavior.

Secondly, every work aimed at describing general be-

havior directly depends on the quality of analyzed data,

especially in the representation and identiﬁcation of actual

malware behavior. Much eﬀort was put in providing high-

quality and relevant data. Nevertheless, the quality of the

malware execution process in the sandbox systems cannot be

guaranteed, as these were not designed and maintained by

the authors of this paper. It must be noted that some

malware families are able to detect that they are being an-

alyzed in a virtual environment, what triggers their diﬀerent

behavior or even termination of operations [37, 3 ]. As such,

it could alter the analysis results. Yet, checking the presence

of such antianalysis techniques could be a broad task and

hard to perform using only pcap ﬁles without the knowledge

of the machine-level behavior. In such a situation, it was

assumed that the network traﬃc alerted by IDS will rep-

resent actual behavior and that the lack of the traﬃc will

result in termination of analysis of a particular malware

sample.

Thirdly, the analysis environment has an impact on the

obtained results. The authors of this paper used industry-

proven IDS rule sets as the ground truth to conduct the

process of malware request detection and identiﬁcation of

their family names. The authors believe that it is a high-quality

source of information, but as with every detection system, it is

possible that some HTTP requests were detected mistakenly

or were not detected at all. The detection of such cases would

have required additional detection systems, which would

have, in turn, introduced additional complexity of the system

and would be partially in conﬂict with the rule sets’ licenses.

Figure 11: An example of the Suricata IDS rule based on presented observations. The rule detects POST requests in 1.0 version of the

protocol without the Referer header.

Table 1 : Features indicating signiﬁcant diﬀerences between malware and browser traﬃc.

Name of the feature

HTTP/1.0 version of protocol

0–3 headers

High entropy of the payload

Lack of the User-Agent header

Nonstandard value of the User-Agent header

Non-ASCII characters in payload∗

Presence of POST request without the Referer header

Presence of GET request with payload∗

Host header value other than domain∗

Destination port other than 0

Lack of any of Accept,Accept-Encoding,Accept-Language,Referer,Connection headers

Features marked with ∗(an asterisk) were proposed originally by the authors at the beginning of this paper.

Figure 12: An example of the Zeek network security monitoring

analysis module based on presented observations. The rule detects

requests with the number of headers less than 4 and without the

User-Agent header.

Security and Communication Networks 23

Fourthly, malware request grouping and categorization

could potentially introduce bias of data. As discussed in

Section 4.2, the authors introduced such an approach to

limit the impact of diﬀerent sizes of malicious request sets.

As an alternative approach, reducing the number of requests

for some malware families was considered. However, the

authors did not want to miss too much information related

to the malicious software behavior within omitted requests.

Also, as the reduction of a number of requests was feasible,

addition of requests to some families to equate the numbers

was not. Thus, such a reduction would also introduce data

bias. From this perspective, the authors believe that their

approach was more adequate, despite some data bias

residues.

Fifthly, identiﬁed features are based on observed be-

havior diﬀerences of malware and browser network traﬃc.

In a situation when malware is equipped with mimicking

mechanisms, that is, its behavior is deliberately changed to

imitate a browser, the features will deteriorate with regard

to distinction of malicious and benign traﬃc, depending

on the changed feature. The identiﬁed features have

diﬀerent levels of technical diﬃculty to introduce mim-

icking behavior, some of them deeply connected with the

C&C protocol design, as with the payload entropy higher

than 6 bits. Changing such behavior would require, for

example, rejection of payload obfuscation or using other

protocol for data exchange. The authors of this paper

believe that in the majority of examples, simple mimicry

techniques would be used, such as the changing value of

the User-Agent header or adding the additional header,

which still could not counter all identiﬁed features. In the

authors’ opinion, in an extreme scenario of nearly perfect

imitation of browser traﬃc, malware would still manifest

some features of network traﬃc which would diﬀerentiate

it from the web browser, as these two types of software are

designed to perform diﬀerent tasks. However, such sce-

narios are out of scope of this study.

Sixthly, in this paper the authors focused on analysis of

diﬀerences between malware and browser traﬃc from the

perspective of network monitoring in the sandbox envi-

ronment. This perspective can be changed to a centralized

one using logs from an actual proxy server as the data

source, where data would contain more diverse sets of

HTTP clients. Such a change could have impact on results

of the analysis.

Finally, malware can use the HTTPS protocol for

communication; however, this study did not analyze such

cases. As mentioned in the introduction, the HTTP protocol

is more popular than HTTPS when used by malware, thus

the authors have focused solely on it. Nevertheless, if HTTPS

malware traﬃc is decrypted to the HTTP protocol (for

example, in a sandbox environment), it can be analyzed

using the identiﬁed features. It must be noted that the

presented analysis did not utilize such decrypted network

traﬃc; however, the authors believe it should be similar to

nonencrypted traﬃc. In a scenario when HTTPS traﬃc

cannot be decrypted, the below ﬁndings cannot be applied,

and analysts should refer to techniques based on this

protocol.

8. Concl sion and F t re Work

This paper focuses on presenting extensive and systematic

analyses of the HTTP requests for malware- and browser-

generated traﬃc. Its main aim was to establish the most

promising distinctive features which can be used to identify

malicious requests. Several features have been designed

based on the previous works and own experience from

malware behavior analysis. Datasets of malware and

browser network traﬃc were analyzed using these features

to identify which can be utilized to distinguish between

malicious and benign HTTP traﬃc. The obtained results

indicate which features can be generally used to spot

anomalies understood as a deviation from the normal

behavior. It was identiﬁed that these features include

HTTP/1.0 protocol version, number of headers smaller

than 4, the lack of Accept,Accept-Encoding,Accept-Lan-

guage,Connection, and to some extent Referer headers, the

payload entropy higher than 6 bits, the occurrence of non-

ASCII characters in the payload, and the presence of the

Referer header in the POST requests.

A special category of features are those connected to the

Host and User-Agent headers. Because of their purpose, their

values are changed frequently. Host header values other than

domain names, such as IP addresses, are more often ex-

perienced in the malicious than in the benign HTTP traﬃc.

However, these values are strictly connected to the network

infrastructure used by criminals and as such should be used

in a controlled manner. In case of the User-Agent header, its

value presents an even more complicated matter. Certainly,

the lack of this header should be treated as an anomaly, but

an analysis of its values is a more demanding task. The results

in this paper indicate that many malware categories use well-

known values, similar to those sent by the browsers. Nev-

ertheless, a signiﬁcant number of malicious software families

use values which were not recognized as popular. Inter-

estingly, many malware categories use predominantly one

User-Agent header value.

Other analyzed features did not yield any signiﬁcant

results; i.e., these features were not observed in the traﬃc at

all or were too scarce to be treated as deviations from the

typical browser traﬃc. Some of these features could be seen

as anomalies on their own, even without comparing them

with those from the browser traﬃc because they break RFCs

or standardization. Good examples are the lack of a colon in

the header ﬁeld, misspellings of the header name, requests

sent to the TCP protocol port other than registered for

HTTP, not ASCII printable character in the header value, a

new line character other than CRLF, repetition of headers

(applicable to the majority of them), or nonstandard

whitespace characters in the header ﬁeld (other than space or

horizontal tabulator).

Results presented in this paper showed that nonaca-

demic sources reviewed at the beginning of the analysis

provide features which are helpful for distinction between

malware and browser traﬃc. Only the category of typo-

graphic errors, presented by Lewis in [17], did not yield any

results, as these errors were very rare or nonexistent in the

analyzed datasets.

24 Security and Communication Networks

Some of the features need additional investigations.

These include payload data length or the value of some

popular headers such as Accept,Accept-Encoding,Language,

and Connection. Analysis of their values could be correlated

with the inspection of the User-Agent value in such way that

the occurrence of the particular value in 1 header should

imply a deﬁned value in the other.

Additional issue is caused by the GET requests with

payload data. They were not present in the analyzed browser

dataset, but they were rarely seen in the malware dataset.

RFC 7230 does not prohibit sending such requests, but

experience tells to monitor them. Despite low-level occur-

rence of these requests, the authors consider them as an

anomaly.

In order to search for suspicious requests, features and

anomalies identiﬁed in the course of this analysis can be

directly applied to the existing network monitoring systems,

such as IDSs or malware sandboxes. Also, with the use of the

presented results, it would be feasible to create a malware

detection system. Such a system could detect new malware

samples in which presented anomalies appear. Finally, this

work can be utilized to create a ﬁngerprinting system which

can be used as an identiﬁcation mechanism of similar

malware requests or as a source to create a whitelist of

known applications in the network. The authors plan to

explore these directions in their future work.

Data Availability

A part of the pcap ﬁles used in this study origins from the

Malware Capture Facility Project (MCFP) and is publicly

available at https://www.stratosphereips.org/datasets-

malware. PCAP ﬁles from CERT Polska’s sandbox system

and web browser traﬃc have not been made publicly

available because of commercial conﬁdentiality and privacy

reasons.

Conflicts of Interest

The authors declare that there are no conﬂicts of interest

regarding the publication of this article.

Acknowledgments

This research was partially supported by the EU Horizon

2020 program towards the Internet of Radio-Light project

(H2020-ICT 761992).

References

[1] M. Trevisan, D. Giordano, I. Drago, M. Mellia, and

M. Munafo, “Five years at the edge: watching Internet from

the ISP network,” in Proceedings of the 14th International

Conference on emerging Networking Experiments and Tech-

nologies, ACM, Heraklion, Greece, pp. 1–12, December 201 .

[2] P. Richter, N. Chatzis, G. Smaragdakis, A. Feldmann, and

W. Willinger, “Distilling the Internet’s application mix from

packet-sampled traﬃc,” in Passive and Active Measurement,

J. Mirkovic and Y. Liu, Eds., pp. 179–192, Springer Inter-

national Publishing, Cham, Switzerland, 2015.

[3] S. Miller and P. Smith, Rise of Legitimate Services for Backdoor

Command and Control, Anomali, Tech. Rep., 2017, https://www.

anomali.com/ﬁles/anomali-labs-reports/legit-services.pdf.

[4] S. Tkachenko, Stop Windows 10 Spying on You Using Just

Windows Firewall, https://winaero.com/blog/stop-windows-

10-spying-on-you-using-just-windows-ﬁrewall/, 2015.

[5] Block Windows Update with Firewall, 201 , https://www.reddit.

com/r/MoneroMining/comments/ l5wpt/block_windows

_update_with_ﬁrewall/.

[6] B. Duncan, 019-03-15-Malspam Pushes Lokibot, http://

malware-traﬃc-analysis.net/2019/03/15/index2.html, 2019.

[7] P. Srokosz, Analysis of Emotet V4, https://www.cert.pl/en/

news/single/analysis-of-emotet-v4/, 2017.

[ ] E. Brumaghin and H. Unterbrink, Picking Apart Remcos

Botnet-In-A-Box, https://blog.talosintelligence.com/201 /0 /

picking-apart-remcos.html, 201 .

[9] R. Joven, New Stealth Worker Campaign Creates a Multi-

Platform Army of Brute Forcers, https://www.fortinet.com/blog/

threat-research/new-stealth-worker-campaign-creates-a-multi-

platform-army-of-bru.html, 2019.

[10] A.-T. GmbH, AV-TEST Security Report 018/ 019, AV-TEST

Institute, Tech. Rep., 2019, https://www.av-test.org/ﬁleadmin/

pdf/security_report/AV-TEST_Security_Report_201 -2019.pdf.

[11] C. Rossow, C. J. Dietrich, H. Bos et al., “Sandnet: network

traﬃc analysis of malicious software,” in Proceedings of the

First Workshop on Building Analysis Datasets and Gathering

Experience Returns for Security, ACM, New York, NY, USA,

pp. 7 – , 2011.

[12] A. Nelson, “Sandnet++-a framework for analysing and vis-

ualising network traﬃc from malware,” Information Security

Group, Royal Holloway University of London, Tech. Rep.,

2016.

[13] M. C. Calzarossa and L. Massari, “Analysis of header usage

patterns of HTTP request messages,,” in Proceedings of the

014 IEEE International Conference on High Performance

Computing and Communications, 014 IEEE 6th International

Symposium on Cyberspace Safety and Security, 014 IEEE 11th

International Conference on Embedded Software and Syst

(HPCC, CSS, ICESS), pp. 47– 53, IEEE, Paris, France, 2014.

[14] R. Montoro, “HTTP Header Hunter-Looking for malicious

behavior into your HTTP header traﬃc,” 2011, http://2011.

video.sector.ca/video/397 6962.

[15] “Cuckoo sandbox network CnC HTTP community module,”

2019, https://github.com/cuckoosandbox/community/blob/

master/modules/signatures/network/network_cnc_http.py.

[16] “Cuckoo sandbox multiple user-agents community module,”

https://github.com/cuckoosandbox/community/blob/master/

modules/signatures/windows/multiple_ua.py.

[17] T. Lewis, “HTTP header heuristics for malware detection,”

SANS Institute InfoSec Reading Room, Tech. Rep., 2013.

[1 ] S. Mizuno, M. Hatada, T. Mori, and S. Goto, “Botdetector: a

robust and scalable approach toward detecting malware-in-

fected devices,” in Proceedings of the 017 IEEE Int. Conference

on Communications (ICC), IEEE, Paris, France, pp. 1–7, May

2017.

[19] Z. Li, L. Sun, Q. Yan, W. Srisa-an, and Z. Chen, “Droid-

classiﬁer: eﬃcient adaptive mining of application-layer header

for classifying android malware,” in Proceedings of the In-

ternational Conference on Security and Privacy in Commu-

nication Systems, Springer, Guangzhou, China, pp. 597–616,

October 2016.

[20] N. Kheir, “Behavioral classiﬁcation and detection of malware

through HTTP user agent anomalies,” Journal of Information

Security and Applications, vol. 1 , no. 1, pp. 2–13, 2013.

Security and Communication Networks 25

[21] K. Li, R. Chen, L. Gu, C. Liu, and J. Yin, “A method based on

statistical characteristics for detection malware requests in

network traﬃc,” in Proceedings of the 018 IEEE Third In-

ternational Conference on Data Science in Cyberspace (DSC),

IEEE, Guangdong, China, pp. 527–532, June 201 .

[22] R. Perdisci, W. Lee, and N. Feamster, “Behavioral clustering of

HTTP-based malware and signature generation using mali-

cious network traces,” in Proceedings of the 7th USENIX

Symposium on Networked Systems Design and Implementa-

tion, p. 14, San Jose, CA, USA, 2010.

[23] M. Mimura and H. Tanaka, “Leaving all proxy server logs to

paragraph vector,” Journal of Information Processing, vol. 26,

pp. 04– 12, 201 .

[24] M. A. Nia, R. E. Atani, B. Fabian, and E. Babulak, “On

detecting unidentiﬁed network traﬃc using pattern-based

random walk,” Security and Communication Networks, vol. 9,

no. 16, pp. 3509–3526, 2016.

[25] R. Fielding, J. Gettys, J. Mogul et al., RFC 616: Hypertext

Transfer Protocol–HTTP/1.1, 1999.

[26] R. Fielding and J. Reschke, RFC 7 30: Hypertext Transfer

Protocol (HTTP/1.1): Message Syntax and Routing, 2014.

[27] RFC 7231: Hypertext Transfer Protocol (HTTP/1.1): Se-

mantics and Content, 2014.

[2 ] RFC 7232: Hypertext Transfer Protocol (HTTP/1.1): Condi-

tional Requests, 2014.

[29] R. Fielding, Y. Lafon, and J. Reschke, RFC 7 33: Hypertext

Transfer Protocol (HTTP/1.1): Range Requests, 2014.

[30] R. Fielding, M. Nottingham, and J. Reschke, RFC 7 34: Hy-

pertext Transfer Protocol (HTTP/1.1): Caching, 201 .

[31] R. Fielding and J. Reschke, RFC 7 35: Hypertext Transfer

Protocol (HTTP/1.1): Authentication, 2014.

[32] T. Berners-Lee, R. Fielding, and H. Frystyk, RFC 1945: Hy-

pertext Transfer Protocol–HTTP/1.0, 1996.

[33] C. Linhart, A. Klein, R. Heled, and S. Orrin, HTTP Request

Smuggling, Watchﬁre Corporation, Tech. Rep., 2005.

[34] D. Cooper, S. Santesson, S. Farrell, S. Boeyen, R. Housley, and

W. Polk, RFC 5 80: Internet X.509 Public Key Infrastructure

Certiﬁcate and Certiﬁcate Revocation List (CRL) Proﬁle, 200 .

[35] R. R. Scherberger, H. Kaess, and S. Br¨uckner, “Studies on the

action of an anticholinergic agent in combination with a tran-

quilizer on gastric juice secretion in man,” in Proceedings of the

01 7th International Conference on Malicious and Unwanted

Software, IEEE, Washington, DC, USA, pp. 1460–1463, October

2012.

[36] SISSDEN, Secure Information Sharing Sensor Delivery

Event Network (SISSDEN). Deliverable D5.3: Final Data

Analysis Results, https://sissden.eu/download/SISSDEN-D5.

3-Final_Data_Analysis_Results.pdf, 2019.

[37] R. R. Branco, G. N. Barbosa, and P. D. Neto, Scientiﬁc but not

Academical Overview of Malware Anti-Debugging, Anti-Dis-

assembly and Anti-VM Technologies, 2012.

[3 ] P. Chen, C. Huygens, L. Desmet, and W. Joosen, “Advanced

or not? a comparative study of the use of anti-debugging and

anti-VM techniques in generic and targeted malware,” in

Proceedings of the IFIP International Information Security and

Privacy Conference, Springer, Ghent, Belgium, pp. 323–336,

May 2016.

26 Security and Communication Networks