Continuous Semantic Crawling Events

From Knoesis wiki
Jump to: navigation, search


The need to tap into the wisdom of the crowd" via social networks in real-time has already been demonstrated during critical events such as the Arab Spring and the recently concluded US Elections. As Twitter becomes a platform of choice for streaming event related information in real-time, we face several challenges in the related to filtering, realtime monitoring and tracking of the dynamic evolution of an event. We present a novel approach to continuously track an evolving event on Twitter by leveraging hashtags that are filtered using an evolving background knowledge (Wikipedia). Our approach (1) collects evolving hashtags by adapting tag co-occurrence information; (2) exploits the semantics of events for selecting hashtags by monitoring and leveraging the corresponding Wikipedia event pages; and (3) filters tweets using hashtags that are determined to be semantically relevant to the event. We evaluated our approach on two recent events: United States Presidential Elections 2012 and Hurricane Sandy. The results demonstrate that Wikipedia can be leveraged to determine, rank, and evolve small, high quality event-related hashtags in real-time to filter event-relevant tweets stream.

Hashtag Analysis

We performed a preliminary analysis of hashtags, prior to architect a solution to this problem. The analysis includes answering a couple of questions

  • How many hashtags contribute in retrieving the event-related tweets?
  • Can these hashtags be detected automatically?

In order to answer these questions, we utilized the dataset for two events from Twitris<ref>A. Jadhav, H. Purohit, P. Kapanipathi, P. Ananthram, A. Ranabahu, V. Nguyen, P.N. Mendes, A.G. Smith, M. Cooney, and A. Sheth. Twitris 2.0: Semantically empowered system for understanding perceptions from social data. Semantic Web Challenge, 2010.</ref> system. The two events are (1) Occupy Wall Street (OWS) (2) Colorado Shooting (CMS). The details of the dataset is provided below table.

Dataset for Analysis from Twitris
Event Tweets Hashtags (Distinct) Start Date End Date
CMS 122062 192512 (12350) 7/20/12 9/10/12
OWS 6077378 15963209 (191602) 9/29/11 9/20/12
Total 6199440 16155721

How many hashtags contribute in retrieving the event-related tweets?

We analyzed the frequency of hashtags in the event-relevant tweets and discovered that the hashtag frequencies follow a power law <ref>Zipf, G.k. Human Behavior and the Priciple of Least Effort, 1949</ref> as shown in the below Figure. Although, the hashtags involved in the event are many, as shown in above table, the number of hashtags that can be used to index the whole dataset are fewer. In other words, the distinct hashtags in the descending order of frequencies, that are sufficient as search queries to retrieve the whole dataset (Hashtag Queries) are (1) 7763 for CMS and (2) 21314 for OWS. The majority of the rest of the hashtags co-occur with one of these Hashtag Queries. However, less than 1% of these Hashtag Queries actually make a significant impact in retrieving the tweets, i.e., on an average more that 85% of the tweets can be retrieved using the one percent of the top Hashtag Queries. We refer to these hashtags as Impacting Hashtags.

caption=Power Law of Hashtag Frequencies

Can these hashtags be detected automatically?

We employed Tag co-occurrence technique to analyze the Impacting Hashtags. The Tag co-occurrence networks for both the events are as shown in the below figures. We discovered that, the impacting hashtags that are relevant to the event co-occur with at-least one other impacting hashtag.


Intuitively, from the Figures above, we can note that more relevant hashtags for the event are towards the center and well clustered than the hashtags at the periphery. To formalize this, we utilized Average Clustering Co-efficient(AvgCC)<ref>S. Wasserman and K. Faust. Social network analysis: Methods and applications, volume 8. Cambridge university press, 1994.</ref> for co-occurrence networks of hashtags. We determined the AvgCC by incrementing the number of top hashtags by 0.1% in the network. By this analysis we found that, the top hashtags are better clustered with each other than the addition of hashtags with lower frequencies. Therefore, starting with a popular hashtag for an event, we will be able to find the other popular hashtags easily than the other less occurring hashtags. The analysis of the AvgCC is shown in the below figure.



By leveraging the hashtag analysis in the previous section, we present a novel approach to detect hashtags in real-time to continuously monitor an event. In order to detect semantically relevant hashtags in real-time, we need an evolving background knowledge that is updated with the latest happenings of the event. Therefore, we use Wikipedia as a graph structure that is continuously updated by the crowd based on the changes in the event. The Figure below shows the architecture of our approach. We use tag co-occurrence in streaming mode to detect candidate tags that has to be further filtered as an event relevant tag.

The whole approach can be explained in two phases. (1) Processing Background Knowledge (Event Wiki Processor) (2) Determining semantic similarity for Hashtags (Hashtag Analyzer). Once the background knowledge is processed by leveraging the Wikipedia event page, a stream with manually input hashtags as the initial filtering hashtag set is set up. The system then adopts an expand and reduce paradigm to find hashtags to be added to the filtering hashtag set as shown in the Figure 5. Firstly, we expand our choices of hashtags (candidate tags) by employing the tag co-occurrence technique with the input hashtags in the stream and later reduce these candidate tags to only the relevant ones by determining its semantic similarity with the event. The semantic similarity is determined by leveraging the background knowledge of the corresponding event on Wikipedia. Finally, the hashtags used for filtering are updated for streaming more timely relevant tweets. As shown in the Figure 5 the above process tracks the evolving event. Also, the hashtags in the filtering hashtag set are periodically checked for semantic relevancy to the event to remove the hashtags that are outdated and are crawling tweets that are irrelevant to the event.


Processing Wikipedia Event Page

In order to determine the relevancy of a hashtag to an event, we generate a weighted list of wikipedia concepts as background knowledge. The weights determine the relatedness to the event. The weights of the concepts are determined based on the following two propositions

  1. Mutual importance of the concept to the event
  2. Overlap of the discussion in the concept page to the event page

Filtering Semantically Relevant Hashtags

Once the candidate tags are detected, we determine its semantic relevancy to the event by leveraging the background knowledge (weighted concepts). In order to determine the semantic relevancy, we generate another set of weighted concepts that represent the hashtag. The weighted concepts are generated by extracting Wikipedia concepts that co-occur with the hashtag in tweets. Due to the real-time nature, the latest tweets of the hashtag are considered to represent the timely relevance of the hashtag. The weights are the normalized frequency of the concepts in the tweets.

The semantic relevancy is determined by finding the similarity between the background knowledge and the weighted concepts that represent the hashtag. The similarity between the weighted concepts is experimented using three similarity measures

  1. Jaccard Similarity (Symmetric, Set based)
  2. Cosine Similarity (Symmetric, Vector Space Model)
  3. Weighted Subsumption Similarity (Asymmetric, weights based)

The intuition behind using an asymmetric similarity measure is that an hashtag rarely represent the whole event. For example, hashtags guide for the US Elections 2012 on the web mentions hashtags such as<ref> and</ref> #Election2012, #TeamObama, #TeamRomney, #Obama2012, #IACaucus, and #NC2012. Most of these hashtags represent a part (named entity, sub-event) of the event rather than representing the whole event.


We evaluated our approach by simulating a real-time process for two of the most recent events. Firstly, we streamed using the most relevant hashtags for both the events (#election2012, #sandy). The below table provides the number of tweets streamed and the hashtags that have co-occurred with the initial hashtags. Also, the background knowledge is generated for both the events using the corresponding Wikipedia Event Pages<ref>,_2012</ref><ref></ref>.

Event Tag Tweets Co-occ Tags (Distinct) Wiki Entities Date
US Elections 2012 #election2012 4855 12361 (1460) 614 2/11/2012
Hurricane Sandy #sandy 4818 6592 (837) 419 2/11/2012

We transform the problem into a ranking problem to detect the semantic relevance of an hashtag to the event. We opt for tag co-occurrence as the baseline for our evaluation and compare it with our approach experimented with all the three similarity measures. Therefore, we selected the top 25 occurring hashtags in the tweets streamed for the initial hashtags. For each of these hashtags, we crawled the latest 500 tweets for US Elections and 200 tweets for Hurricane Sandy using the Twitter Search API. Further, in order to extract the wikipedia concepts from the tweets we used Dbpedia Spotlight<ref>P. Mendes, M. Jakob, A. Garc-a-Silva, and C. Bizer. Dbpedia spotlight: Shedding light on the web of documents. 2011.</ref> and Trie extractor<ref>P. Mendes, A. Passant, and P. Kapanipathi. Twarql: tapping into the wisdom of the crowd. 2010.</ref>.

Manually Accessed Dataset

For evaluation, we manually accessed all the tweets that are crawled using the top hashtags. The below table gives the details of the number of tweets manually accessed.

Event Tags Tweets (Distinct) Rel Ir-rel Tweet Entities
US Elections 2012 25 11504 (10084) 7086 2998 27558 (4255)
Hurricane Sandy 25 4905 (4850) 2691 2159 10719 (2359)
Total 50 15409 14934 9777 38219

The manually accessed dataset can be obtained using either git of knoesis-svn with

  • username - guest
  • password - guest

The README file provides further information about the dataset.

Evaluation Results

We use two of the most popular ranking evaluation metrics in Information Retrieval.

  1. Mean Average Precision (MAP)
  2. Normalized Discounted Cumulative Gain (NDCG)<ref>K. Jarvelin and J. Kekalainen. Ir evaluation methods for retrieving highly relevant documents. SIGIR '00.

ACM, 2000. </ref>

The below figure and the table depicts the evaluation results. The figure provides the MAP comparing tag co-occurrence technique with our approach experimented with all the three similarity measure. The benchmark is the ranking with the manually accessed dataset.

NDCG Subsump(HS) Cosine(HS) Jaccard(HS) Co-occ(HS) Subsump(USE) Cosine(USE) Jaccard(USE) Co-occ(USE)
NDCG@10 0.93 0.86 0.85 0.65 0.91 0.85 0.89 0.83
NDCG@25 0.97 0.93 0.92 0.89 0.98 0.95 0.97 0.94


From the results we can find that the approach with weighted subsumption similarity measure formalized by us performs better than the other two similarity measure and way better than the baseline (tag co-occurrence technique). Therefore, our hypothesis proved.


Implementation details and the code is at [github]. Updated 2 months ago because of the deadlines of paper and vacation. (Will be completed soon -- Estimated Feb 1st Week 2013)



Thanks to

  • Twitris Team (Dataset for Analysis)


<references />