Evaluating the Resilience of Browser Fingerprinting to Block Adversarial Crawlers

Get Complete Project Material File(s) Now! »

Building a Browser Fingerprint

In this subsection, we present the different attributes constituting a browser fingerprint. While fingerprinting can be used for security purposes, we focus on attributes used for tracking. We provide more details about fingerprinting attributes used for security at the end of this chapter, as well as in Chapter 5 where we explain how commercial fingerprinters detect crawlers based on their fingerprint. Fingerprint attributes require two properties when used for tracking:
1. Uniqueness. While not each attribute need to be unique individually, their combination—i.e., the browser fingerprint—should be unique in order to distinguish between different browsers. Indeed, if different browsers have the same fingerprint, they cannot be tracked using browser fingerprinting.
2. Stability. Even in the case where a browser fingerprint is unique, tracking requires a certain stability of the fingerprinting. Indeed, if we consider an extension that randomizes the value of a canvas at each visit, then the browser fingerprint keeps on being unique solely because the canvas is unique. Nevertheless, since the canvas keeps on changing, it becomes challenging for a fingerprinter to keep track of the fingerprint over time.
We distinguish three main families of attributes constituting a fingerprint: HTTP headers, attributes collected using JavaScript and attributes collected using Flash. For each category, we present the different attributes of this category. We explain how these attributes are collected and we also provide examples, as well as information about the attribute such as its uniqueness.

HTTP Headers

When a browser sends an HTTP request to obtain a page or to transmit data using the XMLHttpRequest API for example, it attaches headers to its request that provide information to the server receiving this request. The role of these headers has been defined in different Request For Comments (RFC), in particular in the RFC 7231 [32] where they define the semantics and the contents of header. They also explain how some of the headers leak information about the user or the device, and the risk it can be used for fingerprinting (Section 9.7 of the RFC)14.
We present four different HTTP headers, as well as a fifth attribute, the order of the headers, that leak information about the device and its user and that can therefore be used for fingerprinting. User-Agent. This header provides information about the device and the software, a browser in our case, sending the request. The semantic and the content of this header are defined in the section 5.5.3 of the RFC 7231 [33]. It can be used by servers to gather analytics data or for compatibility purposes when an application is only available on certain kinds of devices. The User-Agent header provides several information useful for fingerprinting, such as the browser and its version, as well as the Operating System (OS). To protect against fingerprinting, the RFC advises developers not to include finegrained details about the device. Nevertheless, it does not specify any format for the User-Agent header. Thus, as we show in the table presenting examples of user agents, some applications on mobile devices with an embedded browser may indicate sensitive information, such as the name of the carrier.

JavaScript Attributes

Attributes collected using JavaScript are the main source of entropy for browser fingerprints. In order to help developers adapt their websites to their user device—for example, to change the style depending on the size of the screen—browsers expose different APIs that leak information about the device. We present how different JavaScript APIs accessible without any permission, such as the canvas or the audio API, are used by fingerprinters to gather highly unique fingerprinting attributes. We first introduce several attributes that can be accessed using the navigator object,15 a special object exposed by default in all main browsers, which provides information about the browser and the OS. navigator.userAgent. The user agent value can also be accessed in JavaScript trough the navigator.userAgent property. In normal conditions—i.e., in the absence of any user agent spoofers, this property returns the same value as the user agent contained in the HTTP headers. navigator.plugins. This attribute returns the list of plugins installed in the browser. For each plugin, it provides information about its name, the associated filename, a description as well as a version of the plugin. Due to the deprecation of the Netscape Plug-in API (NPAPI),16 mostly because of security reasons, the entropy of this entropy has decreased over time.

Studying Browser Fingerprints Diversity

Mayer [2] brought to light the privacy problems that arise from browser diversity and customization. Since there are different OS, browsers, screen resolutions or plugins, this diversity could be exploited to uniquely identify browsers. At the time the thesis was written in 2009, the situation was even worse due to the widespread use of Java applets and Flash Action scripts that had access to even more attributes than JavaScript programs. Over two weeks, Mayer collected fingerprints from 1,328 different browsers, among which 1,278 (96.23%) were unique. Mayer’s work motivated the first large-scale study on browser fingerprinting uniqueness conducted by Eckersley [3], with the collaboration of the Electronic Frontier Fondation (EFF).25 They created a website, Panopticlick,26 on which they collected 470,161 fingerprints between 27th January and 15th January 2010. Their results confirm Mayer’s initial findings: 83.6% of the browsers had a unique fingerprint. Uniqueness was even higher, 94.2%, for browsers with either Flash or Java activated. Indeed, among Flash and Java users, only 1% of the browsers had an anonymity set larger than two. They showed that that the list of plugins and the list of fonts were the two attributes with the most entropy. With this proportion of unique browser fingerprints, they argue that this technique can be used for tracking, in particular as a mechanism to regenerate supercookies or deleted cookies. To support this claim, they proposed a simple heuristic that aims at linking multiple fingerprints of the same browser. First, they studied the stability of browser fingerprints over time and showed that among the 8,833 users that had accepted a cookie and that had visited the websites multiple times, more than 37% displayed at least one change (besides activating or deactivating JavaScript) in their fingerprint. Nevertheless, they are aware this number may be overestimated because of the nature of their website that tends to make people change their fingerprint on purpose, e.g. by changing the list languages they prefer or by deactivating a plugin. Nevertheless, they showed that despite these frequent changes in the fingerprint, browser fingerprinting could still be used for tracking. Their heuristic was able to make correct predictions 65% of the time, incorrect predictions 0.56% of the times. Otherwise, 35% of the time, it made no prediction. Laperdrix et al. [27] also created a website, AmIUnique, to study the diversity of fingerprints. Between 2014 and 2015, they collected more than 118,000 fingerprints fingerprinting. In addition to the attributes collected in the study conducted by Eckerlsey [3], they also collect new attributes, such as canvas [38] and WebGL fingerprinting. They use the normalized Shannon’s entropy to compare their dataset with Panopticlick dataset. They found similar results, except for the list of plugins and the list of fonts where they obtained a lower entropy. This difference can be explained by the decrease of Flash usage, which means that the list of fonts was not collected for all the fingerprints, therefore decreasing its entropy. The difference can also be explained by the rise of mobile usage, on which there are no plugins, which also includes Flash. Besides attributes also collected on Panopticlick, they analyzed the entropy of seven new attributes, such as canvas and WebGL fingerprinting or the presence of an ad-blocker. They found that canvas was among the five most discriminating attributes, with a normalized entropy close to the entropy of the list of plugins. Among the 118,934 fingerprints they collected, there obtained 8,375 distinct canvas values, among which 5,533 were unique. They also studied the differences between computer, either desktop or laptops, and mobile fingerprints. While 90% of desktop fingerprints were unique, only 81% of mobile fingerprints were unique. This difference was mostly explained by the low entropy of the list of fonts and the list of plugins on mobile. Nevertheless, mobile fingerprints still achieve a high uniqueness because of attributes such as the user agent or the canvas that are more unique on mobile. Indeed, in the case of the user agent, they noticed that some phone manufacturers were adding sensitive information to this header, such as the precise version of the model or the version of the Android firmware. In the case of the canvas, they noticed that the emoji included in it was also a great source of entropy since its rendering depends on the phone OS version as well as the phone manufacturer.

READ General remarks on cyanopolyynes in different environments

Use of Browser Fingerprinting on the Web

We present multiple large-scale studies that analyzed the use of browser fingerprinting on the web. We present these studies in a chronological order to better convey the evolution of fingerprinting use and techniques over time. The first large-scale studies on browser fingerprinting started in 2013, three years after Mayer [2] and Eckersley [3] brought to light the privacy risk arising from browser customization. Nikiforakis et al. [4] analyzed the code of three popular fingerprinters. They noticed that commercial fingerprinters used more aggressive techniques than those presented by Eckerlsey [3]. For example, commercial fingerprinters heavily relied on Flash and ActiveX plugins to obtain information not available in JavaScript, such as whether or not the browser is behind a proxy. They noticed that even for simple attributes, such as the platform that can be accessed using navigator.platform or the user agent, the Flash platform attribute provides more detailed information, such as the exact version of the Linux kernel, which can be used both for tracking, as well as to exploit vulnerabilities. They detected that fingerprinters adapted their behavior based on the nature of the browser and the plugins available. For example, when the script detected Internet Explorer, it tried to exploit specific APIs available only on Internet Explorer, such as navigator.systemLanguage. When specific plugins were detected, two of the fingerprinters even tried to invoke them to obtain sensitive information, such as the hard disk identifier, the computer’s name, the installation date of Windows as well as the list of installed system drivers. They also detected a shift in the way fonts were obtained because of the decline of Flash. Thus, while two of the fingerprinters used Flash to obtain the list of available fonts, one of the fingerprinters was using JavaScript [39]. They also crawled the Top Alexa 10K to study the adoption of these three fingerprinting scripts among websites of the Top Alexa 10K. They detected 40 sites (0.4%) of sites using scripts provided by one of the three commercial fingerprinters. They also used Wepawet,33 an online platform for the detection of web-based threats, to detect if these scripts were used by less popular websites and found out that 3,804 domains analyzed by Wepawet used one of these scripts.

Table of contents :

I Preface
1 Introduction
1.1 Motivations
1.2 Contributions
1.2.1 Tracking Browser Fingerprint Evolutions
1.2.2 Studying The Privacy Implications of Browser Fingerprinting Countermeasures
1.2.3 Evaluating the Resilience of Browser Fingerprinting to Block Adversarial Crawlers
1.3 List of Scientific Publications
1.4 List of Tools and Prototypes
1.5 Outline
II State of the Art
2 State-of-the-art
2.1 Context
2.1.1 Browsers Evolution
2.1.2 Monetizing Content on the Web: Advertising and Tracking
2.2 Browser Fingerprinting
2.2.1 Definition
2.2.2 Building a Browser Fingerprint
2.2.3 Studying Browser Fingerprints Diversity
2.2.4 Use of Browser Fingerprinting on the Web
2.3 Countermeasures Against Fingerprinting
2.3.1 Blocking Fingerprinting Script Execution
2.3.2 Breaking Fingerprint Stability
2.3.3 Breaking the Uniqueness of Browser Fingerprints
2.3.4 Summary of Existing Countermeasures
2.3.5 Limits of Fingerprinting Countermeasures
2.4 Security Applications
2.4.1 Enhancing Web Security Using Browser Fingerprinting
2.4.2 Detecting Bots and Crawlers Without Fingerprinting
2.5 Conclusion
2.5.1 FP-Stalker: Tracking Browser Fingerprint Evolutions
III Contributions
3 Fp-Stalker: Tracking Browser Fingerprint Evolutions
3.1 Browser Fingerprint Evolutions
3.2 Linking Browser Fingerprints
3.2.1 Browser fingerprint linking
3.2.2 Rule-based Linking Algorithm
3.2.3 Hybrid Linking Algorithm
3.3 Empirical Evaluation of Fp-Stalker
3.3.1 Key Performance Metrics
3.3.2 Comparison With Panopticlick’s Linking Algorithm
3.3.3 Dataset Generation Using Fingerprint Collect Frequency
3.3.4 Tracking Duration
3.3.5 Benchmark/Overhead
3.3.6 Threats to Validity
3.3.7 Discussion
3.4 Conclusion
4 Fp-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies
4.1 Investigating Fingerprint Inconsistencies
4.1.1 Uncovering OS Inconsistencies
4.1.2 Uncovering Browser Inconsistencies
4.1.3 Uncovering Device Inconsistencies
4.1.4 Uncovering Canvas Inconsistencies
4.2 Empirical Evaluation
4.2.1 Implementing FP-Scanner
4.2.2 Evaluating FP-Scanner
4.2.3 Benchmarking FP-Scanner
4.3 Discussion
4.3.1 Privacy Implications
4.3.2 Perspectives
4.3.3 Threats to Validity
4.4 Conclusion
5 FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting tonBlock Adversarial Crawlers
5.1 Detecting Crawler Blocking and Fingerprinting Websites
5.1.1 Detecting Websites Blocking Crawlers
5.1.2 Detecting Websites that Use Fingerprinting
5.2 Analyzing Fingerprinting Scripts
5.2.1 Describing our Experimental Dataset
5.2.2 Detecting Crawler-Specific Attributes
5.2.3 Checking Browser Inconsistencies
5.2.4 Checking OS Inconsistencies
5.2.5 Checking Screen Inconsistencies
5.2.6 Other Non-fingerprinting Attributes
5.3 Detecting Crawler Fingerprints
5.3.1 Experimental Protocol
5.3.2 Experimental Results
5.4 Discussion
5.4.1 Limits of Browser Fingerprinting
5.4.2 Threats to Validity
5.4.3 Ethical Considerations
5.5 Conclusion
IV Final Remarks
6 Conclusion
6.1 Contributions
6.1.1 FP-Stalker: Tracking Browser Fingerprint Evolutions
6.1.2 FP-Scanner: The Privacy Implications of Browser Fingerprint Inconsistencies
6.1.3 FP-Crawlers: Evaluating the Resilience of Browser Fingerprinting to Block Adversarial Crawlers
6.2 Future work
6.2.1 Automating Crawler Detection Rules Learning
6.2.2 Investigate New Fingerprinting Attributes
6.2.3 Studying Fingerprinting for Authentication
6.2.4 Developing Web Red Pills
6.3 Future of Browser Fingerprinting
References