Semantics Empowered Social Computing
Semantics Empowered Social Computing
Amit Sheth and Meenakshi Nagarajan
The Social Web, one created by user-generated content and the Semantic Web, a vision of a web of machine-understandable documents and data are fast approaching to embrace a Semantic Social Web. On this Semantic Social Web, also hailed as Web 3.0, principles of knowledge representation in Ontologies and document level metadata will be used to organize and analyze Social Media content. Vital to the success of the Semantic Social Web is to understand how the Social Web will be enriched by the Semantic Web.
The Social Web of today includes not only data or web pages and links between them, but also includes people, connections between people and the connections that people make with data. Popular Web 2.0 technologies such as tagging, blogging and bookmarking sites, review sites, social networking sites, image and video sharing sites etc. have made it very easy for people to consume, produce and share information. This new class of content, also called user-generated content is now one of the richest forms of content on the Web.
The Semantic Web is a vision where data is made more meaningful by labeling (marking up, tagging, or annotating) it. This is often achieved by using an agreed-upon reference model such as nomenclatures, dictionaries, taxonomies, folkonomies or ontologies that represent a model of a domain. The annotations to agreed vocabularies make documents and data machine-understandable as well as easier to integrate and analyze. Both machines and humans can make better use of data by utilizing the rich relationships between data formally expressed using annotations. When an ontology, a richer form of modeling is used, simple to complex rules explicitly stated or inferred from the properties of classes and relationships in the Ontology help in reasoning over annotated data. Today, communities in varied domains such as life sciences, health care, finance, and music have begun to provide ontologies with associated knowledge or instance bases (i.e., populated ontologies) that richly describe their domain. Services that allow the use of populated ontologies for annotation and smarter applications that exploit annotations and rules over have been available since early in the decade and are becoming increasingly common.
While the Social Media has been very successful in simplifying the process of content production, consumption and sharing, it has not been sophisticated enough to allow users to add or preserve the semantics behind the content. So, when a user is writing about an object, say a ‘Wii Microphone’, there is no way for him to refer to a unique agreed-upon identifier for that object. Consequently, when someone is looking for information about a ‘Wii Microphone’, it is not possible on the Web or the Social Web to bring everything we know about the object to one place (see Figure 1). One of the goals of the Semantic Web was to bridge this content gap on the Web.
The goal of Web 3.0 or the Semantic Social Web is to provide a similar capability on the Social Web, except now, we are met with a set of newer challenges owing to the nature of user-generated content.
Using Background Knowledge for Semantic Metadata Creation
User-generated online content (UGC) has been around in one form or the other since the earliest days of the Internet. The difference now is that the volume and variety of user-generated content that has surpassed traditional content with the ubiquitous availability of high-speed internet and easy to use social software.
An important challenge that Web 3.0 applications will face is the process of creating markups or annotations from user-generated content to common referenced models or Ontologies. We believe that this is not going to be the same problem as it was for the traditional content on the Web.
User-generated content on social media has unique characteristic that sets it apart from the traditional content we find in news or scientific articles. Due to the interpersonal and interactional purpose to communication on social media, UGC is inherently less formal and non-mediated. Off-topic discussions are common place making the automatic identification of contexts harder. Content is often fragmented, not following rules of English grammar, especially those generated by the teen and tween demographic. Some UGC are also terse by nature, such as twitter posts, leaving minimal cues to automatically identify what the content is talking about. In addition, domain and demographic specific slangs, abbreviations and entity variations (‘skik3’ for ‘SideKick 3’), make the process of identifying what a document is talking about more challenging.
We posit that the role of Ontologies and knowledge bases in content-analysis will be more important that before. They will not only act as common reference models but will also play a key role in inferring semantics behind user generated content, while supplementing well-known statistical and natural language processing (NLP) techniques.
Semantics reinforced by background knowledge can make it possible to deal with complexities of user generated content.
Here, we show a few examples using data from online user-generated textual content to show the challenges faced by well known tasks like named entity identification and how background knowledge can help.
1. Ambiguity in Entity Mentions: Consider this post on a Music Group, “Lily I loved your cheryl tweedy do..heart Amy”. The post is referring to artist Lily Allen’s music track ‘Cheryl Tweedy’. The poster Amy also shares a first name with a popular artist ‘Amy Winehouse’. Assuming that the end goal is to annotate artist and track/album mentions, the task here is to decide whether entities Lily, Cheryl Tweedy and Amy in the post are of interest.
In such cases of ambiguity, a knowledge base along with explicated relationships will provide context in addition to word distributions in a corpus. A domain model such as Music Brainz for example, will inform that that “Cheryl Tweedy” is a track by artist ‘Lily Allen’. `Amy Winehouse’ and `Lily Allen’ are different artists from different genres – Pop and Jazz respectively. The lack of additional support for `Amy’ from the knowledge base in spite of capitalized first letters and the sentence parse assigning a noun tag (see Figure 2) could be taken into consideration before annotating the mention.
2. Identifying Entities: In this post - “Lils smile so rocks”, a knowledge base will tell us that `Smile’ is a track by ‘Lily Allen’ (with a high string similarity between Lily and Lils) and is a possible entity of interest. This can be considered as a strong support in spite of the verb part of speech tag for ‘Smile’ (see Figure 3) and lack of first letter capitalization of the word is a strong cue.
Similarly, in the tweet “Steve says: All Zunes and OneCares must go, at prices permanently slashed!”, it is safe to conclude that `Steve’ here is referring to `Steve Ballmer’, the CEO of Microsoft given that a knowledge base mentions Zunes and OneCares as Microsoft products and Steve as its CEO.
3. Off-topic Noise: Given the tendency for users to digress in informal settings, removing off-topic noise is an important task toward understanding what the content is about.
Consider the post from a social network forum, shown in Figure 4, where the user is talking about a project using ‘Sony Vegas Pro 8’ but is digressing to other topics. Keywords ‘Merrill Lynch’, ‘food poisoning’ and ‘eggs’ are clearly off-topic to this context.
Figure 4 A user post from MySpace Computers Forum with off-topic keywords
In addition to association strengths between words (measured using corpus statistics), a knowledge base of computer softwares (generated from http://computers.shop.ebay.com/Computers-Networking__W0QQ_sacatZ58058, for example) will readily tell us that none of the off-topic keywords are relevant to the discussion. Often times consulting such rich factual information can also be less expensive than the more rigorous statistical+NLP techniques.
The presence of off-topic noise affects results of content analysis applications especially when there is a strong monetary value associated with the content . User activity on Social Networking forums that contain explicit purchase intents are excellent contenders for monetization. Advertisements shown on this medium have high visibility and also higher chances of being clicked provided they are relevant to the user content. Figure 5 shows an example of the targeted nature of advertisements delivered before and after removing off-topic noise.
Figure 5 Contextual Advertisements showing the importance of eliminating off-topic noise in user-generated content Annotating user-generated content using common reference models will undoubtedly empower applications that need to present a holistic view of all information available to a user. Advertising programs will be able to use user messages within a network as potential advertisements. Content delivery application such as Zemanta (http://www.zemanta.com/) that matches keywords in content to provide additional related information, can not only serve content with higher accuracy but also suggest other content via related concepts in the model. Proof in the Pudding – Using background knowledge to analyze user comments In a recent work , we implemented a content-analysis system that mined popularity of music artists from user comments on MySpace artist pages . We designed two annotators, (a) an artist and music annotator that spots artist, album, track, and other music related (e.g. labels, tours, shows, concerts) mentions in a user post and (b) a sentiment annotator to detect sentiment expressions and measure their polarities. The artist and music annotator was backed by MusicBrainz3, the results of a natural language parse of the comment and corpus statistics to spot track and artist mentions. The sentiment annotator used natural language parse results and a slang dictionary - UrbanDictionary to spot and identify polarities of sentiment expressions. For both annotators, the combination of techniques proved to be more useful than using techniques in isolation. Positive and negative sentiments were aggregated for all artists to generate a ranked list of the top X artists ordered by number of positive sentiment comments. By observing trends over time and patterns that stand out among user activity in such online communities, we were also able to forecast what was going to be popular tomorrow.
The Winning Combination With background knowledge, statistical and linguistic techniques each providing different levels and types of support for the analysis of user-generated content, the important questions are what combination of these to use and when. This will in turn depend on the end goal of the application and also the data they are working with. Blogs, for example tend to be longer and have sufficient information to assess the meaning behind the content. Analysis of twitter posts and forum messages on the other hand might need help from background knowledge, especially when there is not sufficient support from corpus based approaches. We believe this will be an important focus of investigation as more Web applications begin to combine domain knowledge with their existing content-analysis frameworks.
Other Issues with Attention Metadata User-generated textual content such as reviews, posts, discussions are only one example of Attention Metadata, i.e. any information generated as a result of a user’s interaction with content also signifying interest or attention to content. Some others include - Descriptions, tags, user placed anchor links - Page views, access logs - Star ratings, Diggs - Images, Audio, Video (Multimedia content) etc. Today, applications that aggregate user activity typically operate only one type of attention metadata. They aggregate topical blogs (http://www.sifry.com/alerts/Slide0008.gif), visualize connections between people and content produced within a network (http://www.neuroproductions.be/twitter_friends_network_browser/), aggregate music listens on lastfm (http://lastgraph3.aeracode.org/) etc. Aggregating all known attention metadata for an object is more complicated because we are dealing with multimodal information. In the domain of music, for example, user interest that generates a song listen is not the same as that which generates a video view or a textual comment. In a recent work , we used principles of Voting Theory to aggregate user activity from MySpace and Bebo comments, as well as LastFM listens and YouTube comments to measure overall artist popularity in the music community. With the need to measure the pulse of a population across all available information sources, we suspect this will be an important area of investigation. A Newer Breed of Applications Perhaps the most interesting phenomenon on the Social Web is that people are not only connected to one another by means of a social tie (friends on social networks or referrals on LinkedIn), but are also connected via a piece of information and context. A user links to someone’s blog post, follows someone’s tweet (user-generated posts on Twitter.com), responds to someone’s posting, multiple users tweet from the same location and so on. User-generated content now comes with a social context which is the network it was generated in. Tapping this machine accessible people-content network empowers a new breed of applications and provides new opportunities for building social-aware systems. Imagine a scenario where you are looking to get more information about a camera you heard about on the radio, but do not remember the exact model number. However, you remember that your friend was recently talking about his blog post that discussed a review he had read on his favorite gadget discussion forum for the same product. On the Semantic Social Web where all user-generated content is annotated, an intelligent search program would be able to sift through all of your friend’s posts and all gadget forum posts that are annotated with the same camera object and return those pages to you. Now, consider the following scenario where user-generated content is used along with the people-content network, background domain models and situational cues to support tasks that might not be easily achieved today. Imagine an event-tracker system maintains a knowledge base of music events (such as eventful.com), their time and locations; a knowledge base of artists and their work (such as MusicBrainz); and continually tracks and annotates Twitter messages related to the events. Now imagine a user tweets “Hitting traffic jam. Looks like im missin lilys opening” from his iPhone (that also provides time and location information). Using situational context information and identifying ‘Lily’ in the tweet, the system can associate this message with the event where `Lily Allen’ is performing. The application can now alert users who have signed up for the same event and share similar location coordinates with a “watch out for a traffic jam” message. Conclusion The role of users in driving the Social Media of today is undeniable. The wealth of information that is being created spans multiple content types, multiple people networks and multiple people-content interactions. In order to effectively exploit this avalanche of information and build applications that enrich user online experiences, it is important that we bring some level of organization to the otherwise loosely categorized content on the Social Web. We see great potential for a Web where the Social Web meets the Semantic Web; where objects are treated as ‘first class citizens’, making it easier to search, integrate and exploit the multitude of information surrounding them. Applications using this underlying semantic infrastructure will significantly enhance the business potential behind user-generated content as well as enrich user experience associated with the social media. Citation information: Amit Sheth and Meenakshi Nagarajan, Semantics-Empowered Social Computing. IEEE Internet Computing Jan/Feb 2009, pages 76-80