(SENTIment Mining and ExtractioN of Web 2.0 Data)
Contact Info: Jürgen Broß
A great share of text mining applications such as information extraction, text classification or sentiment analysis is based on supervised machine learning techniques. However, these generally require large expert-annotated (and therefore costly) training corpora to be created from scratch, specifically for the application at hand.
Obviously, it would be of great advantage if this manual effort could be reduced to a minimum. One way to achieve this goal would be to not let experts compile the dataset, but rather leverage a (potentially existing) dataset that is constructed and in some way annotated by a huge mass of non-experts. Generally, there exist two approaches of how to exploit the potency of collaborating non-experts in the context of creating an annotated training set for machine learning algorithms:
A first approach is to set up an environment that is specifically tuned for the sole purpose of collaboratively creating an annotated dataset. Such a dataset therefore contains explicitly labeled data and can directly be used as input to a machine learning environment.
As an example take the Google Image Labeler [1,2] or the Microsoft PictureThis [3,4] application. Both are online games that encourage participants to label presented images with tags describing their content. Here, the intention is to use the labeled dataset as input for a machine learning process that is able to tag images based on their content. While Google considers the gaming character of the application as appealing enough, the Microsoft game also allows to exchange virtual points for real world prizes as an incentive. Other examples of labeling games are [5,6].
In contrast, a second approach is to leverage collaboratively generated information that is readily available in the web. This in general has not been collected with the intention of being input to a machine learning environment and thus typically does not exhibit any explicit annotations or labels. The idea is then to apply robust heuristics and high precision extraction techniques to derive accurate annotations which serve as input to a machine learning procedure.
To make things clear, consider the following example: The Internet Movie Database (IMDB) is probably the most prominent and hugest dataset regarding movie reviews. It is created in a collaborative fashion, users of the site are free to write reviews on any movie and additionally provide a rating on a 1-10 scale. Doubtless, the intention of the users is everything else than providing a dataset for machine learning techniques in the context of some web mining application. However, the existing data can for example easily be leveraged as a training set for a sentiment classification task. Input to a machine learning-based classifier is the text of the review (encoded as a set of appropriate features) and the rating provided by the user (taken as label). In the most basic case such a classifier is trained to classify movie reviews into containing positive or negative sentiment. Systems exist that exactly follow this idea [7,8].
There are more examples in other domains where an existing dataset is exploited in the described manner. The Kylin project  for example utilizes the Wikipedia dataset. It heuristically combines the structured information contained in so-called "infoboxes" with the associated text of a Wikipedia article to train an information extraction component capable of extracting typed relations between named entities. A further example is the heuristically constructed training set used in the information extraction task described in .
The strong believe of the author is that the advent of user-generated content (UGC) in the world wide web (so-called Web 2.0 content) is an enabling factor for the kind of training corpus construction outlined in the second approach. To support this believe, consider the following advantageous aspects of UGC (compared to traditional web content creation):
The SENTIMENT.O project is about to examine the potential of the second approach (in the following "UGC-meta-data approach") to training set generation in the context of sentiment analysis applications.
In the following, first a short introduction to the task of sentiment analysis is given and subsequently the project is described in some more detail.
Sentiment analysis is a recent field of study, borrowing from work in the research areas information retrieval, text mining, machine learning and computational linguistics. Research in the area of sentiment analysis examines process of discovering and evaluating opinion or sentiment expressions on specific objects such as persons, organizations, topics, etc. from a large number of textual resources.
Fields of application for a sentiment analysis component are manifold. For instance, business intelligence (BI) systems can be augmented to additionally evaluate information from unstructured data sources, e.g. analysis of customer feedback data. Another often cited application is the mining of product reviews. Such a system supports a potential consumer in overcoming the plethora of information available. Typically one has to wade through dozens of reviews until one feels to be in the position to make an informed product choice. A sentiment analysis system would render this effort almost unnecessary by presenting the user a compact, informative summary of all opinions expressed in the reviews.
The most common techniques to the task of sentiment analysis can be roughly subdivided into linguistic, machine learning / statistics and lexicon-based approaches. Hybrid systems exist, that combine different approaches.
Sentiment analysis is chosen as an example of use for the studies conducted in this project for the following reasons:
Within the SENTIMENT.O project the UGC-meta-data approach is primarily examined within the context of fine-grained product review mining (PRM).
Briefly speaking, fine-grained PRM is an information extraction task. The goal of this task is to extract sentiment expressions and corresponding product features mentioned in a corpus of product reviews. However, extending the traditional IE paradigm, the goal is also to determine the polarity of the identified sentiment expressions, i.e. whether the opinion is positive or negative. Thus extractions within this framework can be modeled as triples, containing the sentiment target (here: product feature), the sentiment expression and the sentiment polarity. Depending on the level of detail of the model, sentiment polarity is modeled either as binary (positive vs. negative) or as integer (representing a rating scale).
Some meta-data available in product reviews is depicted in Figure 1 and described in the following:
Figure 1: Some meta-data of a typical product review from epinions.com
As indicated above, product reviews (as an example for user-generated content) are a meta-data-rich domain and thus many ideas on how to leverage these annotations within a machine learning context pop up. However, most of this meta-data cannot be directly used and needs to be preprocessed. Take for example the idea of using paragraph captions as labels for learning a topic classifier: Different authors may use different captions for the same topic. E.g. author A uses caption "Image Quality" while author B prefers caption "Picture Quality". Due to the fact that "labels" are expressed in unnormalized (no conventions) natural language, typical issues such as synonymy and polysemy need to be dealt with. In this specific case a preprocessing step would include a grouping of most frequently used captions. One of the objectives of the Sentiment.o project is to study which preprocessing and cleansing steps are necessary to exploit novice-generated annotations available in web content. How much effort is needed for preprocessing and how much influence does potentially noisy data have are further questions in this context.
Product review mining has been introduced as an extended information extraction task. Triples of sentiment expressions, sentiment polarity and sentiment targets (product aspects) need to be identified. One approach might be to subdivide this task into several steps. One could separately learn an extractor for product aspects and one for sentiment expressions including their polarity. These are then used in conjunction to extract the desired triple structure. The Sentiment.o project aims at examining which machine learning techniques are best suited (robust to noisy data) for this approach and how well they perform. As an alternative one could follow a more holistic approach. The phrase holistic in this context means to adopt an integrated view on the extraction task instead of subdividing it into different (independent) steps. This may more naturally fit the way a review is authored, as it reflects the interdependence of words chosen to express sentiment on a specific product aspect. Some previous work in this direction is  and  where generative models are introduced. Within the Sentiment.o project such a holistic approach is to be studied. Focus is laid on the question on how to beneficially integrate the different annotations and text sources available in a review into a single (generative) model. In this context a further question is also how to incorporate existing, structured information such as sentiment lexica or product taxonomies into the model.
For more information on this project feel free to contact me via email...
 Google Image Labeler: http://images.google.com/imagelabeler/
 Luis von Ahn and Laura Dabbis. Labeling Images with a Computer Game. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2004, Vienna, Austria
 Microsoft PictureThis: http://club.live.com/Pages/Games/GameList.aspx?game=Picture_This
 P. Bennett, D. Chickering and A. Mityagin. Learning Consensus Opinion: Mining Data from a Labeling Game. Proceedings of the International World Wide Web Conference (WWW), 2009, Madrid, Spain
 L. von Ahn, R. Liu, and M. Blum. Peekaboom: A Game for Locating Objects in Images. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2006,
 E. Law, L. von Ahn, R. Dannenberg and M. Crawford. Tagatune: A Game for Music and Sound Annotation. In Proceedings of the 8th International Conference on Music Information Retrieval, 2007
 B. Pang, L. Lee and S. Vaithyanathan. Thumbs Up?: Sentiment Classification Using Machine Learning Techniques. In Proceedings of the ACL Conference, 2002
 B. Pang, L. Lee. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. In Proceedings of the ACL Conference, 2004
 F. Wu and D. Weld. Autonomously Semantifying Wikipedia. In Proceedings of CIKM07, 2007
 M. Banko and O. Etzioni. The Tradeoffs Between Open and Traditional Relation Extraction. Proceedings of the ACL, 2008
 I. Titov and R. McDonald. A Joint Model of Text and Aspect Ratings for Sentiment Summarization. In Proceedings of ACL-08: HLT, 2008
 Q. Mei, X. Ling, M. Wondra, H. Su, CX. Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. In Proceedings of the 16th international conference on World Wide Web (WWW), 2007