Tool for fingerprinting HTTP requests of malware. Based on Tshark and written in Python3. Working prototype stage

Its main purpose is to provide unique representations (fingerprints) of malware requests to help identify them. Unique means that each fingerprint is only seen in one specific malware family, but a family can have multiple fingerprints. Hfinger represents requests in a form that is shorter than printing the entire request, but still understandable to humans.
Hfinger can be used for manual malware analysis, but also in sandbox systems or SIEMs. The generated fingerprints can be used to group requests, pinpoint requests of a specific malware family, identify different operations of the same family , or discover unknown malicious requests that other security systems ignore but share fingerprints.
An academic paper accompanies the research of the tool, describing the motivations for the design choices, and an evaluation of the tool compared to p0f , FATT , and Mercury, for example .
The basic assumption of the project is that HTTP requests of different malware families are more or less unique, so they can be fingerprinted to provide some kind of identification. Hfinger retains information about the structure and values of certain headers to provide a means of further analysis. For example, grouping similar requests - currently, this is still a work in progress.
After analyzing HTTP requests and headers of malware, we found that some parts of the request are the most distinctive. These include: * Request method * Protocol version * Header order * Values of common headers * Payload length, entropy and presence of non-ASCII characters In addition
, some standard characteristics of the request URL are taken into account. All these parts are transformed into a set of features, which are described in detail below .
The above features are translated into a variable-length representation, the actual fingerprint. Depending on the reporting pattern, different features are used to fingerprint the request. More information about these patterns will be presented below. The feature selection process will be described in an upcoming academic paper.
Minimum requirements before installation: * Python>= 3.3, * Tshark>= 2.2.0.
Can be installed from PyPI:
pip install hfinger
Hfinger has been tested on Xubuntu 22.04 LTS using the tshark version of the package, but should also work on older versions such as Xubuntu 18.04 or Xubuntu 20.043.6.2. 2.6.103.2.3
Note that, as with any PoC, you should run Hfinger in an isolated environment, at least in a Python virtual environment. Its setup is not covered here, but you can try this tutorial .
Once installed, you can call the tool directly from the command line, hfinger or as a Python module, calling python -m hfinger.
For example:
foo@bar:~$ hfinger -f /tmp/test.pcap<br>[{"epoch_time": "1614098832.205385000", "ip_src": "127.0.0.1", "ip_dst": "127.0.0.1", "port_src": "53664", "port_dst": "8080", "fingerprint": "2|3|1|php|0.6|PO|1|us-ag,ac,ac-en,ho,co,co-ty,co-le|us-ag:f452d7a9/ac:as-as/ac-en:id/co:Ke-Al/co-ty:te-pl|A|4|1.4"}]<br>
-h You can use the short or long switch to display help:
usage: hfinger [-h] (-f FILE | -d DIR) [-o output_path] [-m {0,1,2,3,4}] [-v]<br> [-l LOGFILE]<br><br>Hfinger - fingerprinting malware HTTP requests stored in pcap files<br><br>optional arguments:<br> -h, --help show this help message and exit<br> -f FILE, --file FILE Read a single pcap file<br> -d DIR, --directory DIR<br> Read pcap files from the directory DIR<br> -o output_path, --output-path output_path<br> Path to the output directory<br> -m {0,1,2,3,4}, --mode {0,1,2,3,4}<br> Fingerprint report mode. <br> 0 - similar number of collisions and fingerprints as mode 2, but using fewer features, <br> 1 - representation of all designed features, but a little more collisions than modes 0, 2, and 4, <br> --verbose <br> Output logfile in the verbose mode. Implies -v or --verbose switch.<br><br>You must provide a path to a pcap file (-f), or a directory containing pcap files (-d). The output is in JSON format. It will print to standard output
or to the directory provided (-o) using the name of the source file. For example, the output of the command:
hfinger -f example.pcap -o /tmp/pcap
will be saved to:
/tmp/pcap/example.pcap.json
Reporting Mode -m / --mode can be used to change the default reporting mode by providing an integer in the range 0-4. These modes differ in the request characteristics represented or in the rounding mode. We chose the default mode ( 2 ) to represent all characteristics typically used during request analysis, but it also provides a lower number of collisions and generated fingerprints. Using other modes you can achieve different goals. For example, in mode 3 you get a lower number of generated fingerprints, but a higher probability of collisions between malware families. If you are unsure, you do not need to change anything. For more information about reporting modes, see . Starting
with version 0.2.1 of Hfinger, the verbosity has been reduced. If you want to receive information about encountered non-standard header values, non-ASCII characters in non-payload parts of requests, missing CRLF markers ( ), and other non-application error problems in the analyzed requests, you should use -v / . When any such problems are encountered in verbose mode, they will be printed to the standard error output. You can also use the / switch (it implies / ) to save the log to a defined location. The log data will be appended to the log file. --verbose\r\n\r\nl --log-v --verbose
Starting with version 0.2.0, Hfinger supports importing into other Python applications. To use it in your application, just import the hfinger_analyze function from hfinger.analysis and call it with the path to the pcap file and the reporting mode. The returned result is a list of dictionaries containing the fingerprinting results.
For example:
from hfinger.analysis import hfinger_analyze<br><br>pcap_path = "SPECIFY_PCAP_PATH_HERE"<br>reporting_mode = 4<br>print(hfinger_analyze(pcap_path, reporting_mode))<br>
Starting with version 0.2.1, Hfinger uses the logging module to log information about encountered non-standard header values, non-ASCII characters in non-payload parts of requests, missing CRLF markers (\r\n\r\n), and other problems in the analyzed requests that are not application errors. Hfinger creates its own logger with the name hfinger, but the log information is effectively discarded if not configured beforehand. If you want to receive this log information, before calling hfinger_analyze , you should configure the hfinger logger, set the log level to logging.INFO , configure the log handler according to your needs, and add it to the logger. More information is available in the hfinger_analyze function docstring.
The fingerprint is based on features extracted from the request. The use of specific features from the full list depends on the reporting mode selected from a predefined list (more information on reporting modes can be found here ). The following image shows the creation of an example fingerprint in the default reporting mode.
Three parts of the request are analyzed to extract information: the URI, the structure of the headers (including the method and protocol version), and the payload. Specific features of the fingerprint are separated using (pipes). The final fingerprint of the request in the example is: POST
2|3|1|php|0.6|PO|1|us-ag,ac,ac-en,ho,co,co-ty,co-le|us-ag:f452d7a9/ac:as-as/ac-en:id/co:Ke-Al/co-ty:te-pl|A|4|1.4
The creation of the features is described below in the order they appear in the fingerprint.
First, extract URI features: * URI length, expressed as the logarithm of the length, rounded to an integer, (the URI in the example is 43 characters long, so log10(43)≈2), * Number of directories, (there are 3 directories in the example), * Average directory length, expressed as base 10 of the actual average length of the directory, rounded to an integer, (there are three directories in the example, with a total length of 20 characters (6 + 6 + 8), so log10(20/3)≈1), * Extension of the requested file, but only if it is in the list of known extensions hfinger/configs/extensions.txt, * Average length, expressed as base 10 of the actual average length, rounded to one decimal place, (both values in the example are 4 characters long, which is obviously equal to 4 characters, and log10(4)≈0.6).
Secondly, the header structure characteristics were analyzed: * The request method is encoded as the first two letters of the method (PO), * The protocol version is encoded as an integer ( 1 for version 1.1 , 0 for version 1.0 , 9 for version 0.9 ), * The order of the headers, * And popular headers and their values.
To represent the order of headers in the request, the name of each header is encoded according to the pattern in hfinger/configs/headerslow.json, for example, the User-Agent header is encoded as us-ag. The encoded names are separated by ,. If the header name does not start with an uppercase letter (or does not start with any part of an uppercase letter when analyzing a composite header such as Accept-Encoding), the encoded representation is prefixed with !. If the header name is not in the list of known headers, it is hashed using the FNV1a hash and the hash is used as the encoding.
When analyzing common headers, the presence of these headers in the request is checked. These headers include: * Connection * Accept-Encoding * Content-Encoding * Cache-Control * TE * Accept-Charset * Content-Type * Accept * Accept-Language * User-Agent
When a header is found in the request, its value is checked against a typical value table to create a pair header_name_representation:value_representation. The name of the header is encoded according to the schema in hfinger/configs/headerslow.json (as described earlier), and the value is encoded according to the schema stored in the hfinger/configs directory or in the configs.py file, depending on the header. In the example above, Accept is encoded as ac and its value is (), resulting in. These pairs are inserted into the fingerprint in the order they appear in the request, separated by . If a header value is not found in the encoding table, it is hashed using the FNV1a hash. If the header value consists of multiple values, they are tokenized to provide a list of values separated by , for example, resulting in . However, during development, if a header value contains a "quality value" tag ( ), the entire value is encoded using its FNV1a hash. Finally, the values of the User-Agent and Accept-Language headers are encoded directly using their FNV1a hashes. */*as-asasterisk-asteriskac:as-as/
,Accept: / , text/*ac:as-as,te-asq=
Finally, in the payload characteristics: * the presence of non-ASCII characters, represented by the letter N, otherwise represented by A, * the Shannon entropy of the payload, rounded to an integer, * and the payload length, represented as the base-10 logarithm of the actual payload length, rounded to one decimal place.
Hfinger operates in five reporting modes that differ in the features represented by the fingerprints used to extract information from the request. These are (using the numbers from the tool configuration): * Mode 0 - produces a similar number of collisions and fingerprints as mode 2, but uses fewer features, * Mode 4 - represents all designed features, but produces slightly more collisions than modes 1, 2, and 3, * Mode 5 - Best (default mode), represents all features typically used during request analysis, but also provides a small number of collisions and generated fingerprints, * Mode 6 - produces the fewest number of generated fingerprints of all modes, but achieves the most collisions, * Mode 7 - provides the highest fingerprint entropy, but also produces slightly more fingerprints than mode 8. 02423402
These modes were chosen to optimize Hfinger's ability to uniquely identify malware families, as well as the number of generated fingerprints. Modes 0, 2, and 4 provide similar numbers of collisions between malware families, however, mode 4 generates slightly more fingerprints than the other two modes. With comparable numbers of generated fingerprints and collisions, mode 2 represents more request features than mode 1. Mode 2 is the only mode that represents all designed features, but it increases the number of collisions by nearly two times compared to modes 1, 2, and 3. The pattern produces at least two times fewer fingerprints than the other patterns, but it introduces about nine times more collisions. Descriptions of all design features are here .
The 010143 pattern consists of the following features (in the order they appear in the fingerprint): * Pattern 0: * Number of directories, * Average directory length as an integer, * Extension of the requested file, * Average length as a floating point number, * Order of headers, * Top headers and their values, * Payload length as a floating point number. * Pattern 1: * URI length as an integer, * Number of directories, * Average directory length as an integer, * Extension of the requested file, * Variable length as an integer, * Number of variables, * Average length as an integer, * Request method, * Protocol version, * Order of headers, * Top headers and their values, * Presence of non-ASCII characters, * Payload entropy as an integer, * Payload length as an integer. * Mode 2: * URI length as integer, * Number of directories, * Average directory length as integer, * Extension of requested file, * Average length as floating point number, * Request method, * Protocol version, * Order of headers, * Popular headers and their values, * Presence of non-ASCII characters, * Payload entropy as integer, * Payload length as floating point number. * Mode 3: * URI length as integer, * Average directory length as integer, * Extension of requested file, * Average length as integer, * Order of headers. * Mode 4: * URI length as a float, * Number of directories, * Average directory length as a float, * Extension of the requested file, * Variable length as a float, * Average length as a float, * Request method, * Protocol version, * Header order, * Popular headers and their values, * Presence of non-ASCII characters, * Payload entropy as a float, * Payload length as a float.
Download Hfinger