Personalized Filtering of the Twitter Stream
Online Social Networks have become a popular way to communicate and network in the recent times, well known ones include Facebook, MySpace, Twitter, Google+, etc. Twitter, in specific, has rapidly grown in the recent years, reaching 460,000 average number of new users per day in the month of March 2011. These numbers have in turn played a crucial role to increase the number of tweets from 65 million to 200 million3 in the past year. This proves that the interested users are therefore facing the problem of information overload. Filtering uninteresting posts for users is a necessity and plays a crucial role  to handle the information overload problem on Twitter.
On Twitter it is necessary to follow another user in order to receive his/her tweets. The user who receives the tweets is called a follower and the user who generates the tweet is called a followee. However, they receive all the tweets from the users that are also not of their interests. Twitter by itself provides features such as keyword/hashtag search as a solution for the information overload problem, but these filters are not sufficient to provide complete personalized information for a user. Although Twarql  improved the filtering mechanism or Twitter by leveraging Semantic Web technologies, the user still needs to track information by manual selection or formulation of SPARQL Query using Twarql’s interface. So far applications such as TweetTopic and “Post Post” focus on filtering the stream of tweets generated from the people who are followed by the user. Instead of limiting the user experience only to his/her personal stream we propose a Semantic Web approach to deliver interesting tweets to the user from the entire public Twitter stream. This helps filtering tweets that the user is not interested in, which in turn reduces the information overload.
Our contributions include (1) automatic generation of user profiles (primarily interests) based on the user’s activities on multiple social networks (Twitter, Facebook, Linkedin). This is achieved by retrieving users’ interests, some implicit (analyzing user generated content) and some explicit (interests mentioned by the user in his/her SN profile). (2) Collecting tweets from the Twitter stream and mapping (annotating) each tweet to its corresponding topics from Linked Open Data. (3) Delivering the annotated tweets to users with appropriate interests in (near) real-time.
Our architecture can be separated into three modules (1) Semantic Filter (SF) (2) Profile Generator (PG) (3) Semantic Hub (SemHub) as illustrated in Figure 1. In this section we first explain the interaction between the three modules, later each one is explained in detail.
In the above architecture two processes run in parallel (a) Filtering of tweets (b) Subscription to the System. The sequence for each process is represented by different types of arrows in Figure 1. The Subscription to the system is included in the Semantic Distributor. The Semantic Distributor (SD) comprises of both SH and PG. Once the user requests for the subscription (Seq. i in Figure 1) he/she is redirected to the PG (Seq. ii ). PG generates the profiles based on the the user’s activities on multiple social networks (Seq. iii ). These profiles are stored in the SemHubs’ RDF store (Seq. iv ) using PuSH vocabulary 5 . On the other hand, Filtering of tweets is performed by annotating tweets from Twitter stream in SF. The annotations are further transformed to a representation of groups (SPARQL queries) of users who have interests corresponding to the tweet (Seq. 1 ). These SPARQL Queries are termed as Semantic Groups (SG) in this paper. The tweet with its SG is updated as an RSS feed (Seq. 2 ) and notified to SemHub (Seq. 3 ). SemHub then fetches the updates (Seq. 4 ) and retrieves the list of subscribers whose interests match the group representation of the tweet (Seq. 5 ). Further the tweet is pushed to the filtered subscribers (Seq. 6 ).
Semantic Filter (Figure 1), primarily performs two functions: (1) Representing tweets as RDF (2) Forming interested groups of users for the tweet.
First, information about the tweet is collected to represent the tweet in RDF. Twitter provides information of the tweet such as author, location, time, “reply-to”, etc. via its streaming API. Including this, extraction of entities from the tweet content (content-dependent metadata) is performed using the same technique used in Twarql. The extraction technique is dictionary-based, which provides flexibility to use any dictionary for extraction. In our system the dictionary used to annotate the tweet is a set of concepts6 from the Linked Open Data (LOD). The same set is also used to create profiles, as described in the next Section 2.2. After the extraction of entities, the tweets are represented in RDF using lightweight vocabularies such as FOAF, SIOC, OPO and MOAT. This transforms the unstructured tweet to a structured representation using popular ontologies. The triples (RDF) of the tweet are temporarily stored in an RDF store.
The annotated entities represent the topic of the tweet. These topics act as the key in filtering the subset of users who receive the tweet. Topics are queried from the RDF store to be included in SGs that are created to act as the filter. The SG once executed at the Semantic Hub fetches all the users whose interests match to the topic of the tweet. If there are multiple topics for the tweet then the SG is created to fetch the union of users who are interested in at least one topic of the tweet.
User Profile Generator
The extraction and generation of user profiles from social networking web-sites is composed of two basic parts: (1) data extraction and (2) generation of application-dependent user profiles. After this phase other important steps for our work involve the representation of the user models using popular ontologies, and then, finally, the aggregation of the distributed profiles.
First, in order to collect private data about users on social websites it is necessary to have access granted to the data by the users. Then, once the authentication step is accomplished, the two most common ways to fetch the profile data is by using an API provided by the system or by parsing the Web pages. Once the data is retrieved the next step is the data modeling using standard ontologies. In this case, a possible way to model profile data is to generate RDF based profiles described using the FOAF vocabulary . We then extend FOAF with the SIOC ontology  to represent more precisely online accounts of the person on the Social Web. Additional personal information about users’ affiliation, education, and job experiences can be modeled using the DOAC vocabulary. This allows us to represent the past working experiences of the users and their cultural background. Another important part of a user profile is represented by the user’s interests. In Figure 2 we display an example of an interest about “Semantic Web” with a weight of 0.5 on a specific scale (from 0 to 1) using the Weighted IntListingerests Vocabulary (WI)9 and the Weighting Ontology (WO). In order to compute the weights for the interests common approaches are based on the number of occurrences of the entities, their frequency, etc.
The Semantic Distributor module comprises of Semantic Hub  and Profile Generator. Semantic Hub (SemHub) is an extension of Google’s PubSubHubbub (PuSH) using Semantic Web Technologies to provide publisher-controlled real-time notifications. PuSH is a decentralized publish-subscribe protocol which extends Atom and RSS to enable real-time streams. It allows parties understanding it to get near-instant notifications of the content they are subscribed to, as PuSH immediately pushes new data from publisher to subscriber(s) where tra- ditional RSS readers periodically pull new data. The PuSH ecosystem consists of a few hubs, many publishers, and a large number of subscribers. Hubs enable (1) publishers to offload the task of broadcasting new data to subscribers; and (2) subscribers to avoid constantly polling for new data, as the hub pushes the data updates to the subscribers. In addition, the PuSH protocol is designed to handle all the complexity in the communication easing the tasks of publishers and subscribers.
The extension from PuSH protocol to Semantic Hub is described in . In our work, SemHub performs the functionality of distributing the tweets to its interested users corresponding to the Semantic Groups generated by SF. The SemHub will have only one publisher as shown in Figure 1 which is the SF, and there can be multiple subscribers. SemHub, as in our previous work, does not focus on creating a social graph of the publisher, the PG is responsible to store the subscribers’s FOAF profile in the RDF store accesssed by the SemHub.
In this section we provide the implementation details for each module in the architecture. Firstly to collect tweets we use the twitter4j Streaming API. Starting with SF, the entity extraction of tweets is dictionary-based similar to the extraction technique used in Twarql . This technique is opted due to performance requirements for real-time notifications. A set of 3.5 million entities from DBpedia is built as an in-memory representation for time-efficient and longest sub-string matching. The in-memory representation is known as ternary interval search tree (Trie) and the longest sub-string match using trie is performed at time complexity of O(LT) where L is the number of characters and T is the number of tokens in the tweet.
As mentioned in section 2.1, tweets are transformed into RDF using some lightweight vocabularies, see Figure 3 for an example. The RDF is then stored in an RDF store using SPARQL Update via HTTP. For performance issues it is preferable to have the RDF Store on the same server. However, architecturally it can be located anywhere on the Web and accessed via HTTP and the SPARQL Protocol for RDF. Presently, this RDF generated for each tweet is stored in a temporary graph and topics/concepts of the tweet are queried. These concepts are then used to formulate the SPARQL representation of the group (SG) of users who are interested in the tweet. The RSS is updated as per the format specified in  with the SG and the Semantic Hub is notified. The SG for the tweet in Figure 3 will retrieve all the users who are interested in at least one of the extracted interests (dbpedia:Kim Kardashian, dbpedia:Kris Humphries, db- pedia:Hollywood ).
The Semantic Hub used for our implementation is hosted at http://semantichub.appspot.com. The SemHub executes the SG on the graph that contains the FOAF profiles of subscribers generated by PG. The corresponding tweets are pushed to the resulting users.
Profile Generator considers three different social networking sites: Twitter, LinkedIn and Facebook for generating user profiles. In order to collect user data from each of those platforms, we developed three different types of applications. For Twitter and Facebook we implemented similar PHP scripts that makes use of the respective query API publicly accessible on the Web. For LinkedIn we use a XSLT script that parses the LinkedIn user profile page and generates an XML file containing all the attributes found on the page. The user information collected from Twitter is the publicly available data posted by the user, i.e. his/her latest 500 microblog posts. The technique used for entity recognition in the tweets of the user is the same one used in SF for annotating the tweets. The extracted concepts are then ranked and weighted using their frequency of occurrences. A similar approach is described in .
While on Twitter we create profiles with implicitly inferred interests, on LinkedIn and Facebook we collect not only interests that have been explicitly stated by the users, but also their personal details such as contacts, workplace and education. The user personal data is fetched through the Facebook Graph API as well as the interests (likes) that are then mapped to the related Facebook pages representing the entities. We represent the entities/concepts on which the user is interested in using both DBpedia and Facebook resources.
The weights for the interests are calculated in two different ways depending on whether or not the interest has been implicitly inferred by the entity extraction algorithm (the Twitter case) or explicitly recorded by the user (the LinkedIn and Facebook cases). In the first case, the weight of the interest is calculated dividing the number of occurrences of the entity in the latest 500 tweets by the total number of entities identified in the same 500 tweets. In the second case, since the interest has been manually set by the user, we assume that the weight for that source (or social networking site) is 1 (on a scale from 0 to 1). So we give the maximum possible value to the interest if it has been explicitly set by the user.
Our approach as regards the computation of the new weights as a result of the aggregation of the profiles is straightforward. We consider every social website equally in terms of relevance, hence we multiply each of the three weights by a constant of 1/3 (approximately 0.33) and then we sum the results. According to the previously described formula (1) in this case we use the following values: ws = 1/3.∀s.
Conclusion and Future Work
In this paper we described an architecture for filtering the public Twitter stream and delivering the interesting tweets directly to the users according to their multi-domain user profile of interests. We explained how we generate comprehensive user profiles of interests by fetching and aggregating user information from different sources (i.e. Twitter, Facebook and LinkedIn). Then, we detailed how we extract entities and interests from tweets, how we model them using Semantic Web technologies, and how is possible to automatically create dynamic groups of users related to the extracted interests. According to the user groups the tweets are then “pushed” to the users using the Semantic Hub architecture.
In future, we want to extend our work to handle social streams in general (not only Twitter). Also, leveraging inferencing (category - subcategory relationships) on LOD, rather than just filtering based on concepts. Our extention would also include users not only subscribe to concepts from LOD as interests but also subscribe to a SPARQL Query as in Twarql. We are also working on providing interesting information and ranking based on the user’s social graph.
- M.S. Bernstein, Bongwon Suh, Lichan Hong, Jilin Chen, Sanjay Kairam, and E.H. Chi. Eddi: interactive topic-based browsing of social status streams. In The 23rd annual ACM symposium on User interface software and technology, 2010.
- Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data - the story so far.Int. J. Semantic Web Inf. Syst., 5(3):1–22, 2009.
- John Breslin, Uldis Bojars, Alexandre Passant, Sergio Fern`ndez, and Stefan Decker.SIOC: Content Exchange and Semantic Interoperability Between Social Networks.In W3C Workshop on the Future of Social Networking, January 2009.
- Dan Brickley and Libby Miller. FOAF Vocabulary Specification 0.98. Namespace Document 9 August 2010 - Marco Polo Edition. http://xmlns.com/foaf/spec/, 2010.
- Pavan Kapanipathi, Julia Anaya, Amit Sheth, Brett Slatkin, and Alexandre Passant. Privacy-Aware and Scalable Content Dissemination in Distributed Social Networks. In ISWC 2011 - Semantic Web In Use, 2011.
- Pablo N. Mendes, Alexandre Passant, and Pavan Kapanipathi. Twarql: tapping into the wisdom of the crowd. I-SEMANTICS ’10, 2010.
- Pablo N. Mendes, Alexandre Passant, Pavan Kapanipathi, and Amit P. Sheth. Linked Open Social Signals. In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2010.
- D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. In ICWSM, 2010.
- Ke Tao, Fabian Abel, Qi Gao, and G.J. Houben. TUMS: Twitter-based User Modeling Service. In Workshop on User Profile Data on the Social Semantic Web (UWeb), ESWC 2011, 2011.