Current Datasets In Practice

Get Complete Project Material File(s) Now! »

Chapter 3 Dataset

Due to the recent emergence in fake news and increasing interest in its detection, the resources for its research are still in the infancy stage. Labeling a dataset requires Subject Matter Experts (SMEs) to annotate an article, and the notion of what is fake and what is real can be fuzzy since such articles are intended to deceive the readers. The lack of reliable benchmark datasets poses a significant challenge in advancing the research. In this section, we review the current labeled datasets available as well as talk about the SemEval Task 4 – Hyperpartisan News Detection1.

Current Datasets In Practice

In the subsequent sub-sections, we review the current datasets in practice used in deception detection – fake news detection, clickbait detection, stance detection and hyperpartisan news detection.



Vlachos and Riedel[33] proposed using data from fact-checking websites like POLITIFACT.COM2, a Pulitzer prize-winning website and FULLFACT.ORG3 (2014). In 2017, Ferrara et al.[17] created a rumor-debunking dataset containing 300 rumored claims and 2595 associated news articles, manually annotated by journalists. These claims were collected by journal-ists from websites like and twitter accounts like @Hoaxalizer. Each claim has sourced news articles, a stance (for, against, observing), article headline and veracity (true or false). Although a pioneer in creating a dataset for stance detection, the number of samples are quite low in this dataset.

LIAR Dataset

The LIAR dataset by Wang[34] is a publicly available dataset for fake news detection. The dataset consists of 12800 manually annotated short statements in various states of affair from POLITIFACT.COM collected over a decade. This dataset has a comprehensive collection of detailed analysis reports and source links to all the documents. The truthfulness ratings are based on 6 fine-grained labels: pants-fire, false, barely-true, half-true, mostly-true and true. The distribution of articles can be seen in Table 3.1.
The dataset contains snippets from speakers affiliated to both Democrats and Republicans as well as social media posts, such as facebook posts. In addition to party affiliation for speakers, the dataset also contains other metadata like their current job, credit history, home and state.

BuzzFeed News Dataset

BuzzFeed published an article ‘Inside The Partisan Fight For Your News Feed4 which identified 667 websites as partisan news outlets along with the associated Facebook pages (452). The dataset is available in the GitHub repository5.

Fake News Challenge Dataset (FNC-1)

FNC-1 dataset aims to use Artificial Intelligence(AI) to find if a headline and the body of the text are related to each other, i.e., stance detection. The data consists of (headline, body, stance) where stance could be any of the following: (agrees, disagrees, discusses, unrelated). The dataset extends the work of Ferrara et al.[17] and the dataset can be found in this GitHub repository6.

 Kaggle Fake News Dataset

Kaggle’s ‘Getting Real about Fake News’ dataset consists of articles scraped from 244 web-sites amounting to a total of 12,999 posts. These websites were scraped using a chrome extension – BS Detector7. The samples in the dataset seem rather too obvious and/or extremely fake, which is different from the problem at hand – detecting intentionally mis-


SemEval 2019 Task 4 – Hyperpartisan News Detection

We start our work with SemEval 2019 Task 4 – Hyperpartisan News Detection9. The task is as follows:
Given a news article text, decide whether it follows a hyperpartisan argumen-tation, i.e., whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person.

SemEval 2019 Task 4 Dataset

Our initial modeling is based on SemEval 2019 Task 4 dataset since the size of the dataset is quite large and the articles are labeled by the credibility of the publishers as well as based on the articles.
The dataset is split into two parts – labeled by publishers, and labeled by articles. The distribution of the data is as given in Table 3.2.
split between training and validation. These articles are labeled by the overall bias of the publisher assigned by journalists at BuzzFeed or MEDIABIASFACTCHECK.COM. 50% of the dataset is hyperpartisan while the other half is not. Out of the 375,000 articles which are hyperpartisan, half of them(187,500) fall in the left-spectrum of hyperpartisanship while the other half fall in the right-spectrum. Thus, this dataset is quite balanced as opposed to the second part of the dataset labeled by articles.
In the ‘by articles’ dataset, there are a total of 645 articles labeled through crowd-sourcing on the basis of content in the articles. The labeling of the articles in this group are agreed upon by a consensus. The distribution of hyperpartisanship in this dataset is 238 articles (37%) and the remaining 407 articles (63%) are not hyperpartisan.

Midterm Elections 2018 Dataset

The goal of this study is to use the model developed using the preliminary work done on SemEval Task 4 dataset, to comprehend the occurrence of hyerpartisanship in web-searched articles related to U. S. midterm elections in 2018.
We decided to aggregate trending queries related to the query: ‘United States midterm election’ using Google Trends10. A list of 86 queries were curated, less than the initial goal of 100 queries as the queries became repetitive and started bottoming out. The queries can be seen in Table 3.3
Once the queries were curated, BING web-search API11 was used to scrape web-pages for the queries. A maximum of 100 web-pages were collected for each of the queries, based on the results returned. This resulted in a total of 6616 web articles.

Methods and Results

In this chapter, we discuss the various text-features and machine learning algorithms used in the classification of our data to detect hyperpartisanship.
Section 4.1 describes the standard preprocessing techniques done over the text before any feature engineering and/or modeling. The standard process of text mining is shown in the Figure 4.1.


Preprocessing of text is the preliminary step to convert text into a format feasible for input to an algorithm. The steps involved are explained in the subsequent sub-sections as well as in Figure 4.2.


This step involves splitting paragraphs into sentences and sentences into words. Sentence boundary detection is used to get a list of sentences. We have used PunktSentenceTo-keniser1 from NLTK to perform sentence tokenization. It is an implementation of un-supervised multilingual sentence boundary detection by Kiss and Strunk[21]. For word tokenization, white spaces are used as delimiters. Additionally, all the words are either converted to lowercase or uppercase so that capitalized words are not considered different from non-capitalized one. Abbreviations are kept capitalized, and there should be rules implemented to keep them as is, e.g., US (United States) and us (pronoun).

Stop-words Removal

Stopwords are the often-occurring words in a language which connect sentences, and do not hold any importance. We used NLTK’s stopwords list 2 to filter out stopwords like ‘the’, ‘a’, ‘so’, etc. from the corpus. Furthermore, we also removed numbers and punctuation since they are not relevant to our analysis.


Normalization of text refers to reduction of inflectional forms of words into their base forms. While stemming chops off the rear ends of inflections, lemmatization uses lexical knowledge bases to convert them to their root forms. For example, chopping off ‘es’ in ‘studies’ becomes ‘studi’ instead of ‘study’ after stemming while it retains the base form ‘study’ in lemmatization.
We employed lemmatization in our corpus using WordNet Lemmatizer3.


In this work, we explore different features which could contribute in automated detection of hyperpartisanship in our dataset, and use the following machine learning algorithms to test our hypotheses. Here, we explain the algorithms used in this work. We have used the standard implementation of the algorithms from scikit-learn4.

List of Figures 
List of Tables
1 Introduction 
1.1 Motivation
1.2 Contribution
1.3 Problem Statement
1.4 Outline of Thesis
2 Review of Literature 
3 Dataset 
3.1 Current Datasets In Practice
3.2 SemEval 2019 Task 4 – Hyperpartisan News Detection
4 Methods and Results 
4.1 Preprocessing
4.2 Algorithms
4.3 Feature Engineering
4.4 Headlines of articles
4.5 Publishers
4.6 Body of articles
5 Discussion 
6 Conclusions and Future Work 
Hyperpartisanship in Web Searched Articles

Related Posts