Springe direkt zu Inhalt

SENTIment Mining and ExtractioN of Web 2.0 Data


(SENTIment Mining and ExtractioN of Web 2.0 Data)


Contact Info: Jürgen Broß


  • Study to which extent annotations, meta-data or text structure available in user-generated web content can be exploited to automatically extract large training corpora for machine learning algorithms in the context of web mining applications.
  • Analyze and categorize different types of available annotations and meta-data.
  • Examine which preprocessing and cleansing steps are necessary to exploit novice-generated annotations available in web content.
  • Examine which machine learning techniques are best suited for such potentially 'noisy' training sets (and explain why!).
  • As an example of use, conduct the study in the application field of sentiment analysis: Build a software system that is capable of fine-grained extraction and rating of sentiment expressions found in product reviews.



A great share of text mining applications such as information extraction, text classification or sentiment analysis is based on supervised machine learning techniques. However, these generally require large expert-annotated (and therefore costly) training corpora to be created from scratch, specifically for the application at hand.

Obviously, it would be of great advantage if this manual effort could be reduced to a minimum. One way to achieve this goal would be to not let experts compile the dataset, but rather leverage a (potentially existing) dataset that is constructed and in some way annotated by a huge mass of non-experts. Generally, there exist two approaches of how to exploit the potency of collaborating non-experts in the context of creating an annotated training set for machine learning algorithms:


A first approach is to set up an environment that is specifically tuned for the sole purpose of collaboratively creating an annotated dataset. Such a dataset therefore contains explicitly labeled data and can directly be used as input to a machine learning environment.

As an example take the Google Image Labeler [1,2] or the Microsoft PictureThis [3,4] application. Both are online games that encourage participants to label presented images with tags describing their content. Here, the intention is to use the labeled dataset as input for a machine learning process that is able to tag images based on their content. While Google considers the gaming character of the application as appealing enough, the Microsoft game also allows to exchange virtual points for real world prizes as an incentive. Other examples of labeling games are [5,6].


In contrast, a second approach is to leverage collaboratively generated information that is readily available in the web. This in general has not been collected with the intention of being input to a machine learning environment and thus typically does not exhibit any explicit annotations or labels. The idea is then to apply robust heuristics and high precision extraction techniques to derive accurate annotations which serve as input to a machine learning procedure.

To make things clear, consider the following example: The Internet Movie Database (IMDB) is probably the most prominent and hugest dataset regarding movie reviews. It is created in a collaborative fashion, users of the site are free to write reviews on any movie and additionally provide a rating on a 1-10 scale. Doubtless, the intention of the users is everything else than providing a dataset for machine learning techniques in the context of some web mining application. However, the existing data can for example easily be leveraged as a training set for a sentiment classification task. Input to a machine learning-based classifier is the text of the review (encoded as a set of appropriate features) and the rating provided by the user (taken as label). In the most basic case such a classifier is trained to classify movie reviews into containing positive or negative sentiment. Systems exist that exactly follow this idea [7,8].

There are more examples in other domains where an existing dataset is exploited in the described manner. The Kylin project [9] for example utilizes the Wikipedia dataset. It heuristically combines the structured information contained in so-called "infoboxes" with the associated text of a Wikipedia article to train an information extraction component capable of extracting typed relations between named entities. A further example is the heuristically constructed training set used in the information extraction task described in [10].


The strong believe of the author is that the advent of user-generated content (UGC) in the world wide web (so-called Web 2.0 content) is an enabling factor for the kind of training corpus construction outlined in the second approach. To support this believe, consider the following advantageous aspects of UGC (compared to traditional web content creation):

  • Homogeneity and Structure:
    • A UGC-enabled website typically uses structured interfaces (e.g. kinds of HTML-forms) to let uses create content on their site. This structure is then reflected by HTML-templates which render the content. In effect UGC is more likely to exhibit some structure at all and the use of templates promotes homogeneity and thus easy and accurate content extraction.
    • In order to structure the content, UGC-enabled sites encourage their users to provide meta-data about their entries. These meta-data for instance may be given in form of textual tags, mappings to an existing topic (e.g. a forum entry),  one-sentence summaries or a rating on some scale (e.g. a product review).
    • Often contributors to an UGC-based site agree on style and structure conventions (the Wikipedia "Manual of Style" being a good example). This obviously enhances homogeneity and therefore promotes the use of extraction heuristics.
  • Accessibility:
    • Web 2.0 platforms typically provide structured access to their content, e.g. by means of web-services, RSS-feeds or ATOM-feeds.
  • Amount of Data:
    • Due to their openness, UGC-enabled websites provide the potential of massive online collaboration. The huge amount of information available through popular sites promotes the use of high precision/low recall heuristics for the extraction of labeled data.


The SENTIMENT.O project is about to examine the potential of the second approach (in the following "UGC-meta-data approach") to training set generation in the context of sentiment analysis applications.

In the following, first a short introduction to the task of sentiment analysis is given and subsequently the project is described in some more detail.


Sentiment Analysis:

Sentiment analysis is a recent field of study, borrowing from work in the research areas information retrieval, text mining, machine learning and computational linguistics. Research in the area of sentiment analysis examines process of discovering and evaluating opinion or sentiment expressions on specific objects such as persons, organizations, topics, etc. from a large number of textual resources.

Fields of application for a sentiment analysis component are manifold. For instance, business intelligence (BI) systems can be augmented to additionally evaluate information from unstructured data sources, e.g. analysis of customer feedback data. Another often cited application is the mining of product reviews. Such a system supports a potential consumer in overcoming the plethora of information available. Typically one has to wade through dozens of reviews until one feels to be in the position to make an informed product choice. A sentiment analysis system would render this effort almost unnecessary by presenting the user a compact, informative summary of all opinions expressed in the reviews.


The most common techniques to the task of sentiment analysis can be roughly subdivided into linguistic, machine learning / statistics and lexicon-based approaches. Hybrid systems exist, that combine different approaches.


Sentiment analysis is chosen as an example of use for the studies conducted in this project for the following reasons:

  • Machine learning techniques play a major in the automated extraction and analysis of sentiment from textual resources.
  • The so-called Web 2.0 contains a huge amount of "sentiment-charged" data. Sentiments are for instance predominantly expressed in blogs, micro-blogs, reviews or forum posts.
  • Strong motivations for sentiment analysis of Web 2.0 data exist (see above, e.g. BI).
  • Sentiment analysis is a hot topic ;-)


Project Details:

Within the SENTIMENT.O project the UGC-meta-data approach is primarily examined within the context of fine-grained product review mining (PRM).

Briefly speaking, fine-grained PRM is an information extraction task. The goal of this task is to extract sentiment expressions and corresponding product features mentioned in a corpus of product reviews. However, extending the traditional IE paradigm, the goal is also to determine the polarity of the identified sentiment expressions, i.e. whether the opinion is positive or negative. Thus extractions within this framework can be modeled as triples, containing the sentiment target (here: product feature), the sentiment expression and the sentiment polarity. Depending on the level of detail of the model, sentiment polarity is modeled either as binary (positive vs. negative) or as integer (representing a rating scale).

Some meta-data available in product reviews is depicted in Figure 1 and described in the following:

  • Rating Annotation: Authors summarize their overall opinion by providing a numerical value on some rating scale (here: 1-5 stars). Some review sites allow to give ratings on selected aspects of a product (here: ease of use, durability,...).
    Rating annotations can easily be used as labels in a sentiment classification task, i.e. when the whole review is to be classified as conveying either positive or negative sentiment.
  • Pros / Cons: Authors summarize the main advantages and disadvantages of the reviewed product. Typically this meta-data is provided as a comma-separated list. Authors either enumerate simply the positive product aspects (or negative respectively) without further description or provide some more detail by briefly giving reason for mentioning in pros or cons (e.g. cons: display vs. cons: too small display).
    Listings of pros and cons provide useful information that can be exploited to (machine-) learn expressions of positive and negative sentiment with regard to specific product aspects. They can also be utilized as labels in a topic-sentiment-detection task, i.e. predicting which sentiments on which topics are expressed in a review.
  • Summary: In addition to compacting their overall opinion to a numerical value, authors also summarize in a textual form. The part typically contains the authors conclusions.
    This annotation in conjunction with the numerical rating may be used as a hint in learning positive or negative expressions.
  • Captions: A great share of authors structure their reviews by separating the text into several paragraphs each focusing on a specific aspect of the product. To enhance readability, authors entitle these paragraphs with convenient captions. E.g. in the exemplary camera review given in Figure 1, three paragraphs with captions "Viewfinder/LCD", "Zoom Lens" and "Optical Image Stabilization (OIS)" are depicted.
    The combination of paragraph text and associated caption can be utilized for learning a topic classifier.
  • Publication Date: This meta-data can be used to track the sentiments over time. Take note that it is not considered as being helpful in any training set generation process.
  • Author: Many review sites offer some meta-data about the author of a review. Besides some demographic data (e.g. location, age, interests), primary intent is to provide some trust-oriented information such as the number of reviews written by the author or the ratio of readers who have voted positive for reviews of this author.
    Take note that this meta-data is not considered as being helpful in any training set generation process, but maybe leveraged to filter out unhelpful or spam reviews.


Figure 1: Some meta-data of a typical product review from epinions.com



As indicated above, product reviews (as an example for user-generated content) are a meta-data-rich domain and thus many ideas on how to leverage these annotations within a machine learning context pop up. However, most of this meta-data cannot be directly used and needs to be preprocessed. Take for example the idea of using paragraph captions as labels for learning a topic classifier: Different authors may use different captions for the same topic. E.g. author A uses caption "Image Quality" while author B prefers caption "Picture Quality". Due to the fact that "labels" are expressed in unnormalized (no conventions) natural language, typical issues such as synonymy and polysemy need to be dealt with. In this specific case a preprocessing step would include a grouping of most frequently used captions. One of the objectives of the Sentiment.o project is to study which preprocessing and cleansing steps are necessary to exploit novice-generated annotations available in web content. How much effort is needed for preprocessing and how much influence does potentially noisy data have are further questions in this context.

Product review mining has been introduced as an extended information extraction task. Triples of sentiment expressions, sentiment polarity and sentiment targets (product aspects) need to be identified. One approach might be to subdivide this task into several steps. One could separately learn an extractor for product aspects and one for sentiment expressions including their polarity. These are then used in conjunction to extract the desired triple structure. The Sentiment.o project aims at examining which machine learning techniques are best suited (robust to noisy data) for this approach and how well they perform. As an alternative one could follow a more holistic approach. The phrase holistic in this context means to adopt an integrated view on the extraction task instead of subdividing it into different (independent) steps. This may more naturally fit the way a review is authored, as it reflects the interdependence of words chosen to express sentiment on a specific product aspect. Some previous work in this direction is [11] and [12] where generative models are introduced. Within the Sentiment.o project such a holistic approach is to be studied. Focus is laid on the question on how to beneficially integrate the different annotations and text sources available in a review into a single (generative) model. In this context a further question is also how to incorporate existing, structured information such as sentiment lexica or product taxonomies into the model.


For more information on this project feel free to contact me via email...



[1] Google Image Labeler: http://images.google.com/imagelabeler/

[2] Luis von Ahn and Laura Dabbis. Labeling Images with a Computer Game. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2004, Vienna, Austria

[3] Microsoft PictureThis: http://club.live.com/Pages/Games/GameList.aspx?game=Picture_This

[4] P. Bennett, D. Chickering and A. Mityagin. Learning Consensus Opinion: Mining Data from a Labeling Game. Proceedings of the International World Wide Web Conference (WWW), 2009, Madrid, Spain

[5] L. von Ahn, R. Liu, and M. Blum. Peekaboom: A Game for Locating Objects in Images. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2006,

[6] E. Law, L. von Ahn, R. Dannenberg and M. Crawford. Tagatune: A Game for Music and Sound Annotation. In Proceedings of the 8th International Conference on Music Information Retrieval, 2007

[7] B. Pang, L. Lee and S. Vaithyanathan. Thumbs Up?: Sentiment Classification Using Machine Learning Techniques. In Proceedings of the ACL Conference, 2002

[8] B. Pang, L. Lee. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. In Proceedings of the ACL Conference, 2004

[9] F. Wu and D. Weld. Autonomously Semantifying Wikipedia. In Proceedings of CIKM07, 2007

[10] M. Banko and O. Etzioni. The Tradeoffs Between Open and Traditional Relation Extraction. Proceedings of the ACL, 2008

[11] I. Titov and R. McDonald. A Joint Model of Text and Aspect Ratings for Sentiment Summarization. In Proceedings of ACL-08: HLT, 2008

[12] Q. Mei, X. Ling, M. Wondra, H. Su, CX. Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. In Proceedings of the 16th international conference on World Wide Web (WWW), 2007