Economic Impact and Evolution
A first study on the phishing impact [apw04], made by the Anti-Phishing Working Group, re-ported some hundreds phishing attacks per month at the end of 2003. Back then, the most targeted industry sector was financial services and only 24 brands were targeted by phishing.
In ten years, phishing activities have grown above expectations despite the eﬀort engaged in the fight to cope with it. The latest report [rsa14] states that around 60,000 phishing attacks are observed every month. This increased more than twofold since 2010 where around 200,000 phishing attacks were reported for the whole year. These phishing attacks are mainly web-based attacks since more than 40,000 unique phishing websites are detected every months and around 60,000 unique phishing emails are reported every month in 2014 [apw14]. This proves that phishing emails and fake websites are still mostly used in current phishing attacks. Many domain names are registered to direct users to phishing websites since around 10,000 maliciously registered domain names were used for phishing every month in 2014 [apw14]. This shows an increase compared to 2012 where only 7,712 maliciously registered domains were used during the whole first semester [AR14]. Alternative techniques to direct users to fake websites such as domain hijacking or DNS cache poisoning are cumbersome tasks that require technical skills. Registering free domain names for malicious purposes is a much easier task for phishers explaining the increasing trend in the use of maliciously registered domain names.
As the Web was growing, the count of brand targeted by phishing raised. More than ten industry sectors are targeted by phishing nowadays. These mostly consisted in ISPs (AOL) at the beginning and started to target financial institutions few years later. Now, social network, online gaming, government or retail websites are targeted as well. As depicted in Figure 1.2, the most targeted industry sector is payment services (e.g. PayPal, Visa, MasterCard, etc.) that represents almost 40% of the phishing targets. ISP and financial institutions are still targeted but in lower proportion, 8.42% and respectively 20.2%. Compared to the initial 24 brands targeted by phishing in 2003, around 350 diﬀerent brands are currently targeted [apw14]. This renders the protection of brands with dedicated solutions [eba] diﬃcult since one would have to install and use too much add-ons to stay safe.
The raise of phishing targets caused as well the raise of financial damage. No accurate estimation is available to quantify the financial loss due to phishing. However, in 2007 a Gartner survey [gar07] estimated a direct financial loss of 3.2 billion dollars over the year due to phishing. In 2010, a report [str10] about identity theft mainly performed with phishing attacks presented an estimation of 54 billion dollars loss as consequence of this theft. Lately, in 2013, the estimated direct loss over the year due to phishing was of 5.9 billion dollars reaching a record [rsa14]. As described in [AR14] the first day of a phishing attack is the most lucrative and thus fast take down of phishing websites is paramount to limit the financial damage. Between 2010 and 2014, the average uptime of phishing websites has been divided by two, dropping from 72 hours to 32 hours. More significant is the median uptime of phishing websites that decreased from 15 hours in 2010 to less than 9 hours in 2014. The development of phishing detection and prevention techniques reduced the delay of fake websites take down, thus reducing the financial loss due to each phishing attack. While causing important loss, phishing is a low reward activity for phishers [HF08]. Hence, the number of phishing attacks constantly increases to increase the gain, having for consequence the increase of victims and the raise of the global financial loss. The availability of easy-to-deploy phishing kits [CKV08] that target both users and victims makes performing phishing attacks an easy task and explain the raise of perpetrated attacks1.2.1 Strong Authentication Schemes While accessing banking services, retail services or social network on the Internet for instance, one is expected to enter some personal information (e.g. username, password, etc.) to prove his identity to the web service and access his account. This is needed since the account associated to such service contains some other more sensitive information about the user that must be protected from access of others. This is the process of authentication required by any Internet service dealing with personal data. This process has been implemented with some flaws letting unauthorized users accessing this information. The use of weak passwords and their reuse by unsavvy users make the authentication process subject to attacks. Moreover, the reverse process, consisting in authenticating a website to a user, has been proven weak. Users cannot reliably identify the entity they communicate with, letting space for phishing attacks to occur.
Hence, some solutions have been proposed to fight phishing by developing strong authenti-cation processes allowing both parties (client and server) to prove their identity without having to reveal sensitive information during the handshake.
The first proposed solutions consisted of browser extensions aiming to help users distinguish content provided by legitimate websites from content provided by unlegitimate entities. This is to avoid that users enter sensitive information in unlegitimate fields and to prevent thus credentials stealing. In [YS02], a browser extension giving diﬀerent colors to window boundaries is introduced. Two kinds of windows are depicted to users according to the nature of the window content. It allows to distinguish browser provided status from web server provided contents. Easily distinguishable windows help users to detect malicious content provided by web servers and prevent them for providing sensitive information to unlegitimate entities. In [DT05], a similar solution of tuned browser windows is proposed to authenticate a web server. It uses unique abstract images to display a dedicated password window for users to enter their credentials and log in to a given website. The look of the window is diﬀerent for every user and transaction and generated based on a shared password between the user and the server. Hence, users can easily see if the displayed password window is a spoofed window since such window would not have the expected look for a given legitimate website.
Methods for improving servers authentication have been proposed like in [TH09]. DNS TXT records are used to store the legitimate entity’s certificate and authenticate to the client. A client plugin and a server plugin are used. The client plugin authenticates the web server by validating the certificate stored in the DNS TXT record. Then both plugins realize a mutual authentication using a one-time password.
One bad habit of Internet users is to use similar weak passwords for diﬀerent websites. These can be easily guessed by phishers and one stolen password can give access to several accounts. To cope with this bad habit some techniques to strengthen and diﬀerentiate passwords have been proposed [RJM+05, YS06, GLLA07]. For instance, Ross et al. introduce a browser extension named PwdHash [RJM+05]. This extension transparently produces diﬀerent password for each site a user wants to sign in based on a single password. It relies on hash functions that generates a password from the unique user password and some data associated with the website that cannot be spoofed by a phisher: the domain name. It strengthens web password authentication and prevents to provide user’s password to phishers in fake websites since fake websites have a diﬀerent domain name than the one they spoof.
On the one hand, the latest presented methods having for aim the strengthening of users authentication are eﬃcient to avoid passwords stealing and the reuse of them by phishers. Some of the techniques proposing a second factor authentication have been adopted by sensitive services such as e-banking services to enforce the security. On the other hand, methods for strengthening server authentication while having good foundations did not improve the situation. Contrarily to strong user authentication that is imposed by e-services, strong server authentication is up to Internet users. Since most Internet users are unaware of the phishing dangers and since security is a secondary purpose, these optional methods have not been adopted. These security solutions [YS02, DT05] are not mandatory, diﬃcult to understand and globally add constraints to users for their primary purpose of surfing the Web. Other techniques [TH09] being transparently used still have some flaws like being vulnerable to DNS cache poisoning attacks. Even though some of these techniques would be widely implemented, most users do not understand the provided security indicators proving authenticity of the entities they communicate with [HJ08]. This limits the applicability of server authentication techniques. and the increase of phisher’s count. Phishing is the cybercrime equivalent of pickpocketing. It is easily perpetrated with low technical skills and low cost to the attacker thanks to phishing kits and cheap criminal hosting services.
Web Page Content Analysis
Phishing websites usually try to adopt the look and feel of the popular websites they spoof. Hence, they often exhibit visual similarities with other websites or at least exhibit specific characteristics that researchers tried to identify. Web page content analysis was used to perform this task and identify phishing websites on the fly.
Some techniques focused on the analysis of visual similarities between phishing websites and the legitimate websites being spoofed [MKK08, CDM10, CSDM14]. Medvet et al. [MKK08] propose a technique to extract a signature of a web page depicting its visual composition. The signature considers the text contained in the page, its style, color, font family and size as pre-sented in the leaf text nodes of the HTML Document Object Model (DOM) tree. It considers as well the images embedded in the page, their position, size and color histograms. The last component of the signature corresponds to the global viewport of the web page rendered by the user agent. Signatures of legitimate and phishing web pages are compared and if the similarity is too high, an unknown web page is considered as phishing. The idea of visual similarity com-parison is good for spoofing website detection, although the authors do not propose any solution to previously identify the potentially spoofed website in order to find the comparison basis for the presumed phishing web page. Another work [CSDM14] relies on the same basis of visual similarity comparison to provide a targeted protection for a limited number of websites. Every original web page to protect from phishing is cached in the system and when an unknown web page is requested by the user, the rendered image is captured as an image file. The Normalized Compression Distance (NCD [CV05]) is computed between the cached web page images and the requested web page. A set of features is obtained and submitted to a classification algorithm to determine if the web page is a phishing page. This technique addresses the weakness of [MKK08] since it proposes a set of web pages to compare suspected phishs to. However, this shows that visual similarity analysis of web pages is limited to the protection of a reduced set of predefined websites.
More general techniques of web page content analysis have been proposed and the most famous example is CANTINA [ZHC07]. CANTINA extracts the terms included in a web page and apply an information retrieval algorithm (TF-IDF [SM83]) to generate a lexical signature of the web page. The signature consists in five terms that provide the fingerprint of a web page.
Content Delivery Network
A Content Delivery Network (CDN) [PB07] is an Internet wide distributed system of servers deployed in multiple data centres, which provides content delivery services. CDNs provides Internet contents with high availability and high performances to users by taking advantages of the abstraction of resource location provided by the DNS.
When requested for a given content with a DNS request, Content Delivery Networks deliver tailor made DNS Resource Records taking into account several parameters such as the location of the requester, the location of the data center where the content is available or the load of the servers part of the network providing the content. DNS requests to CDNs are for domain name of the form user-content-info.cdn.com where user-content-info contains some user specific and content requested coded information in order to define which servers of the CDN are the best to quickly deliver the content. The IP addresses of the selected content delivery servers are included as A RR in the DNS reply sent to the requesting user. Resource Records provided by CDNs have a low Time To Live to direct users to diﬀerent servers over time and to distribute the load among the several servers. In addition the adoption of low TTL facilitates the recovery in case a delivery server is down since a new DNS request must be done when the TTL expires and a new delivery server can be selected.
From a DNS point of view, CDN domain names have some distinguishable characteristics:
• RRs with low TTL (from 30 seconds to 10 minutes on average).
• Ever changing A RR for the same domain name with IP addresses belonging to a fixed set.
• Several subdomains, often algorithmically generated, for the same domain name (corre-sponding to the several contents provided and user served).
Many Internet content providers like Apple or Facebook pay CDN operators to deliver their content. CDNs services provide high availability and high performances in content delivery. In addition, these reduce the cost of their clients since clients do not need to own servers to host and deliver their content themselves, thus reducing their infrastructure costs.
DNS Misuses and Security Issues
When a service provides benefits to legitimate services, the same benefits can apply to malicious activities as well. This statement holds for the DNS. While enhancing content delivery and increasing availability of legitimate contents, it is also used to ensure the availability of malicious contents on the Internet. As a popular and common Internet service and network protocol, the DNS is used to convey and hide other kind of network traﬃc. In addition, the old conception of this service (1987) did not consider security breaches that have been exploited for several malicious purposes.
Table of contents :
Part I State of The Art and Background
Chapter 1 Phishing and Protection Techniques
1.1 Phishing: an Online Con Game
1.1.1 Definition and History
1.1.2 Phishing Vectors
1.1.3 Economic Impact and Evolution
1.1.4 Challenges to Fight Phishing
1.2 Phishing Prevention Techniques
1.2.1 Strong Authentication Schemes
1.2.2 Security Toolbars
1.3 Phishing Detection Techniques
1.3.1 Phishing Emails Detection
1.3.2 Web Page Content Analysis
1.3.3 URL Analysis
Chapter 2 Domain Name System Monitoring
2.1 The Domain Name System
2.1.1 Organization and Implementation
2.1.2 DNS Usage
2.1.3 DNS Misuses and Security Issues
2.2 DNS Monitoring
2.2.1 DNS Monitoring Strategies
2.2.2 Performance Evaluation and Anomaly Detection
2.2.3 Malicious Activity Detection
Part II Phishing Domain Names and URLs Detection
Chapter 3 Large Scale Passive DNS Monitoring for Identifying Malicious Domains
3.1 Passive DNS Monitoring Architecture
3.1.1 DNS Data Gathering
3.1.2 Distributed Storage and Processing System
3.2 Data Mining in DNS Space
3.2.1 DNS Features Extraction
3.2.2 Domain Names Clustering
3.3 Experimental Evaluation
3.3.2 Feature Analysis
3.3.3 K-means Clustering Evaluation
Chapter 4 Phishing Domain Name Identification Based on Word Relatedness
4.1 Phishing URL Obfuscation
4.1.1 Obfuscation Techniques
4.1.2 Obfuscation Words Semantic
4.2 Semantic Analysis of Domain Names
4.2.1 Word Extraction
4.2.2 Word Relatedness Computation
4.2.3 Similarity Metrics
4.3 Domain Sets Comparison
4.3.2 Similarity Metrics Evaluation
4.3.3 Domains Set Size and Composition
Chapter 5 Semantic Based Phishing URLs Rating
5.1 Intra-URL Relatedness Analysis
5.1.1 URL Word Extraction
5.1.2 Shortcomings of Word Relatedness Evaluation Tools
5.1.3 Search Engine Query Data
5.1.4 Feature Computation
5.2.1 Distributed Word Relatedness Inference
5.2.2 Bloom Filter for Features Computation
5.3 Phishing URL Detection
5.3.2 Features Analysis
5.3.3 URL Classification
5.3.4 URL Rating
Part III Semantic Based Phishing Domain Names Prediction
Chapter 6 Semantic DNS Probing
6.1 Smart DNS Probing
6.1.1 Hostnames Composition Schemes
6.1.2 System Overview
6.1.3 Smart DNS Brute Forcer
6.2 Semantic Discovery of Subdomains
6.2.1 Similar Names
6.2.2 Incremental Discovery
6.3 DNS Probing Evaluation
6.3.2 Exploration Parameters
6.3.3 Performance Evaluation
Chapter 7 Proactive Discovery of Phishing Domain Names
7.1 Modeling a Phisher’s Language
7.1.1 Domain Names Features
7.1.2 Domain Names Generation Model
7.2 Domain Names Features Evaluation
7.2.2 Features Analysis
7.3 Phishing Domain Names Generation
7.3.1 Types of Generated Domains
7.3.2 Efficiency and Steadiness of Generation
7.3.3 Predictability and Strategy
1 Summary of Contributions
2 Research Perspectives
List of Figures
List of Tables