Beyond the bag-of-words model: using dependency graphs

Get Complete Project Material File(s) Now! »

Opinion holder and target identification

Another basic task of opinion mining is identification of opinion holder and target. In other words, we need to know who holds the opinion and what the opinion is about. The purpose of this task is to filter opinions that are relevant to the given topic, since there can be several opinions in a text on different subjects. For example, if it is a movie review and the author writes about the experience of going to a cinema, we want to extract only opinions related to the movie and not about the cinema. Knowing the opinion holder helps us to estimate the demographics or collect opinions of a specific person.
The latter is useful for user personalization, i.e. selecting topics that a specific user prefers and avoid things the user does not like. Coreference resolution is the process of determining if two expressions in natural language refer to the same entity in the world (Soon et al., 2001)
The task of identifying the opinion holder and the opinion target is relatively difficult because of coreference chains, i.e. the same entity can be addressed in many ways. For example, a movie can be referred to by its title, by a pronoun (it), by a noun phrase (movie, motion picture). The title of the movie may also have its variations. A good coreference resolution system is needed to handle this problem (Stoyanov and Cardie, 2008).
In many cases, however, the task is obsolete. When working with the data from the Web, the opinion holder is often the author of the review or the blog post. The topic is usually also known. Movie reviews are usually located on a page dedicated to a specific movie (Figure 3.1), thus we can assume that all the reviews on this page are about the same movie. Nevertheless, for an advanced opinion mining system, we want to make sure that the opinions we extract are related to the movie itself and also belong to the review author, for example a review may contain a quote or a reference to other critic’s opinion.

Manual annotation of raw text (the DOXA project annotation)

Manual data annotation is the easiest way to fill the lack of annotated data, although it is as well the most resource consuming. To construct an annotated corpus, one needs first to collect raw data, next to apply chosen in advance annotating scheme and finally to validate the produced annotations (Wiebe and Cardie, 2005; Toprak et al., 2010). competitiveness center CAP DIGITAL of Île-de-France region which aims among other things at defining and implementing an OSA semantic model for opinion mining in an industrial context. We have developed the annotation scheme for the manual annotation and performed the evaluation of participants’ systems.

Collecting annotated data

Annotation schemes depend on the task for which they are designed. Minimum annotation requirements for polarity classification is to know the polarity of a text, which can be simply specified as positive or negative. Thanks to such a simple scheme, it is quite easy to collect an annotated corpus automatically from the Web.
Many web resources such as e-commerce websites provide a functionality for their users to rate products or services (movies, hotels, restaurants, etc.) to facilitate the purchase of the reviewed entity. In many such websites, users can also leave a text comment describing their experience with the product or service. Movie fans write reviews about movies they have watched, travellers describe hotel service of where they were staying, restaurant goers give critics on restaurants. All this information can be easily collected using a simple web crawler, thus obtaining opinionated texts with their polarity value usually given as a discrete value on a fixed scale (star rating).
In addition to this, it is also possible in most cases to capture the opinion target and the opinion holder. This method, however, has its own issues:
• Rating interpretation To separate reviews, one need to decide which reviews to consider as positive or negative given the user rating. In general, websites use star rating system with 1-5 (or 1-10) scale. In this case, researchers usually consider reviews with 1-2 stars as negative, and 4-5 as positive. It is always an issue how to interpret intermediate values (e.g. 3 stars), whether to consider them as neutral opinions, mixed or weakly positive/negative.
• Content extraction Collecting documents from the Web always involves the process of extracting content from HTML page as our final target is usually a raw text. It means, that we need to consider only the part with the review and disregard other elements of the page (e.g. navigation, advertising, irrelevant text, etc.). We often need to filter HTML tags, entities, fix broken character encoding.
• Copyrights issues The website content are often subject to copyright. Thus, before collecting the data, one need to make sure it does not violates the website’s terms of use.

READ Ethiopia: Water Resources, Sanitation Coverage and Urbanization

WordNet and graph based methods

WordNet is one of the largest lexical resource for English language which is extensively used in scientific research. According to the 1.WordNet project homepage: description from the project homepage1 , WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptualsemantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.
Figure 5.1 shows a graph representation of WordNet synsets and relations between them. Many researchers use WordNet for sentiment analysis as well as other resources based on it, such as Word- Net Affect and SentiWordNet. It has been also localized in other languages (Vossen, 1998, Navigli and Ponzetto, 2010, Vetulani et al., 2009, Fišer and Sagot, 2008).

Table of contents :

Contents
List of Figures
List of Tables
1 Introduction
Sentiment Analysis
2 Definitions and terminology
2.1 Opinions, sentiments, and friends
2.2 Summing up
3 Opinion mining and sentiment analysis tasks
3.1 Subjectivity analysis and opinion detection
3.2 Polarity classification
3.3 Opinion holder and target identification
3.4 Opinion summarization
3.5 Irony identification
3.6 Opinion spam identification
4 Polarity classification in detail
4.1 Problem definition
4.2 Issues
4.3 Data
4.4 Evaluation
5 Approaches to polarity classification
5.1 Lexicon based approaches
5.2 Statistical based approaches
Automation and Adaptivity
6 Automatic lexicon construction from microblogs
6.1 Microblogging
6.2 Corpus collection and analysis
6.3 Lexicon construction from Twitter
6.4 Polarity classification
6.5 Conclusions
7 Beyond the bag-of-words model: using dependency graphs
7.1 Motivation
7.2 Related work
7.3 D-grams
7.4 Experiments
7.5 Conclusion
8 Improving weighting schemes for polarity classification
8.1 Data
8.2 Our method
8.3 Experiments and results
8.4 Should a sentiment analysis system be objective?
Applications
9 Disambiguating sentiment ambiguous adjectives in Chinese
9.1 SemEval 2010 task description
9.2 Our approach to sentiment disambiguation
9.3 Experiments and results
9.4 Conclusion
10 Polarity classification of Russian products reviews
10.1 ROMIP 2011 task description
10.2 Our approach to polarity classification
10.3 Experiments and results
10.4 Conclusions
11 Emotion detection in suicide notes
11.1 I2B2 2011 task description
11.2 Related textual analysis of suicide notes
11.3 Our approach to emotion detection
11.4 Experiments and results
11.5 Conclusion
Summary
12 Conclusion
13 Future work
14 Authors’ publications
14.1 International Journals
14.2 Domestic Journals
14.3 International conferences
14.4 Domestic conferences
14.5 Book chapters
14.6 International workshops
14.7 Talks
Bibliography