Hierarchical Interest Graph

From Knoesis wiki
Jump to: navigation, search


Industry and researchers have identified numerous ways to monetize microblogs for personalization and recommendation. A common challenge across these different works is identification of user interests. Although techniques have been developed to address this challenge, a flexible approach that spans multiple levels of granularity in user interests has not been forthcoming.

In this work, we focus on exploiting hierarchical semantics of concepts to infer richer user interests expressed as Hierarchical Interest Graph. To create such graphs, we utilize user's Twitter data to first ground potential user interests to structured background knowledge such as Wikipedia Category Graph. We then use an adaptation of spreading activation theory to assign user interest score (or weights) to each category in the hierarchy. The Hierarchical Interest Graph not only comprises of user's explicitly mentioned interests determined from Twitter, but also their implicit interest categories inferred from the background knowledge source. We demonstrate the effectiveness of our approach through a user study which shows an average of approximately eight of the top ten weighted categories in the graph being relevant to a given user's interests.


The goal of our approach is to construct a Hierarchical Interest Graph (HIG) for a Twitter user. To accomplish this our system performs the following steps:

  1. Hierarchy Pre-processor converts the Wikipedia Category Graph (WCG) into a hierarchy
  2. User Interests Generator spots and scores the Primitive Interests from tweets of a user.
  3. Interest Hierarchy Generator maps the scored Primitive Interests to Wikipedia Hierarchy to infer the HIG of the user.

Step 1 of our approach is performed if and when there are updates to Wikipedia Category Graph. Step 2 and 3 are performed for the set of tweets of each user to generate the corresponding HIG. Below Figure illustrates our approach.



The system is implemented in Java. It is modular to plug and play different ontologies, entity resolution services (Zemanta, Alchemy and Dbpedia Spotlight), spreading activation functions (Bell, Bell Log and Priority Intersect). Our present implementation is on Wikipedia and Zemanta with all the activation functions. Each of these are explained in the below sections.

Hierarchy Preprocessor

In order to generate a Hierarchical Interest Graph it was necessary to transform the Wikipedia Category Graph to a Hierarchy. First, creating the Wikipedia Category Graph was non trivial. In the below section we detail the challenges involved. Further we explain the transformation that comprises of assigning hierarchical levels to each category in the hierarchy and removing the non-hierarchical nodes.

Preparing Wikipedia Category Graph

Totally, Wikipedia Category graph comprises of 884,838 categories with 2,074,173 category-subcategory relationships among them. Representing and creating it as a directed graph was difficult. We started with Dbpedia category system that is freely available for download. Once we constructed a directed graph of the category system, we realized that around 4K categories were disconnected from the graph. We reaffirmed the disconnections by extracting the category system directly from the, then latest Wikipedia dump<ref>http://en.wikipedia.org/wiki/Wikipedia:Database_download</ref> dated April 2013. In order to keep the category graph uptodate, for each of the disconnected categories we used the Wikipedia API11 to get its hierarchy. We found that, approximately 43% of the disconnected categories had connections to the category system, where as the rest 57% were not present in Wikipedia. A justification we can think of is that the disconnected categories were deleted from Wikipedia after the creation of Dbpedia version 3.8 or the dump of Wikipedia. As shown in Table 1 (first row) we ended up with 4.6M articles, 884,838 Categories with 2.074M links among categories.

Wikipedia administration<ref>http://en.wikipedia.org/wiki/Category:Wikipedia_administration</ref> have included a set of categories in the category structure to maintain Wikipedia9. However, these categories are not necessary and are hence excluded from the category structure. The articles are filtered based on the following strings in their labels<ref>S. P. Ponzetto and M. Strube. Deriving a large scale taxonomy from wikipedia. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 2, AAAI’07, pages 1440–1445. AAAI Press, 2007.</ref>: wikipedia, wikiprojects, lists, mediawiki, template, user, portal, categories, articles, pages. To summarize, around 64,362 categories with 151,732 cat-subcategory relationships were removed from the category graph of Wikipedia. The category structure does not contain any of the hidden categories10 assigned by Wikipedia.

Hierarchy Assignment

Transforming to a hierarchy

Wikipedia Category Graph comprises of cycles and hence determining the abstractness of each category in the graph is non trivial. For example, to know whether Category:Baseball is more abstract than Category:Major League Baseball automatically is not easy in the Wikipedia Category Graph. Therefore, we first use the rooted spanning tree algorithm based on the hypothesis that the closest distance from the root to the category is the hierarchical level (abstractness) of the category. The root node from Wikipedia is the Category:Main Topic Classifications. As shown in the Figure, from the root node (node A) we assign the hierarchical levels of each node based on the distance to the root node. Then we remove the non-hierarchical links. The non hierarchical links are those that are from a specific node to an abstract node. For example: Node E with hierarchical level 4 in the figure has a link to Node B with hierarchical level 2, which means that E is a category of B. This contradicts our hypothesis, hence we delete such links. This outcome of this process is a hierarchy with a height of 15 and 802,194 categories with 1,177,558 links among them.

User Interest Generator

In this module we generate the primitive interests of the user and score them to reflect the extent of user's interests. We use the user's tweets and perform entity resolution on them. The entities (Wikipedia articles) extracted from the tweets are considered to be the interests of the user. Since this is not the main task of our work, we use an external web service. Although our work is modular to use any external web services, we use Zemanta because of its precision (58%) and the rate limit the web service provides (10,000 per day). Details of the evaluation of entity resolution web services can be found in work by Derczynski et.al <ref>L. Derczynski, D. Maynard, N. Aswani, and K. Bontcheva. Microblog-genre noise and impact on semantic annotation accuracy. In Proceedings of the 24th ACM Conference on Hypertext and Social Media, HT ’13, pages 21–30, New York, NY, USA, 2013. ACM.</ref> We have also implemented a dictionary based Trie (Ternary Interval Search Tree) extractor that spots Wikipedia entities. The Trie extractor was implemented to overcome the rate limit problem and do a large scale analysis. However due to the low precision (31%) we decided on using Zemanta.

To score the entities, we perform a simple normalized frequency of the entities mentioned by the user. The normalization is done using the maximum occurring entity in the user's tweets. Therefore each entity will have a interest score between 0-1. We term the interests generated from this module as Primitive Interests.

Interest Hierarchy Generation

Given the primitive interests (User Interest Generator) and the Wikipedia Hierarchy (Hierarchy Preprocessor) , in this module we generate the subgraph (hierarchy) of user interests. The hierarchy comprises of categories with appropriate scores that represent the interest level of the user. First the primitive user interests are linked to appropriate categories in the Wikipedia Hierarchy. Further, it is not difficult to infer the categories of the primitive interests to get the hierarchy. However, the propagation of the scores from the primitive interests to select the most appropriate nodes in the hierarchy is a non trivial problem. In order to solve this problem we implemented an adaptation of the Spreading Activation Theory<ref>M. R. Quilian. Semantic Memory. In: M. Minski (ed.). Semantic Information Processing. MIT Press, Cambridge, MA, 1968.</ref>. We use a breadth based spreading activation where the interest score propagates from the higher hierarchical level nodes to lower hierarchical level nodes constrained by an activation function. We constrained the activation based on different parameters that includes:

  • No Constraint: Propagating the scores of primitive interest up the hierarchy
  • The node distribution in the hierarchy: Since few nodes that have many categories in their child-hierarchical-level received higher scores that were propagated to higher. We used the raw distribution of nodes and log scale distribution to experiment.
  • Preferential Path: Since each category had many categories, we use Wiki category order (left-right) to prioritize the most suitable category.
  • Intersect Booster: Boosting categories that were the interesection of user's primitive interests with an intuition that those categories formed the hubs of interests gained better performance.

The combination of these parameters as activation function are experimented and evaluated as explained in Evaluation


The evaluation was done with three variations of spreading activation theory used in our approach. The three activation functions are as follows;

  • Bell: Normalizes the propagated value based on the number of nodes in the next level of specificity.
  • Bell Log: Builds upon Bell by reducing the impact with Log based reduction.
  • Priority Intersect: Builds upon Bell Log with prioritizing the most suitable categories and boosting the entity-intersecting nodes.

In order to evaluate our system we conducted a users study. Attracting participants for a user study was difficult. Although 40 people agreed to participate in the user study, 37 of those responded. We created a set of the top 50 interest categories that were the outcome of the above mentioned variations of our approach. The users marked yes/no/maybe based on their level of interest for the appropriate categories.

User Study Analysis

The user study stats are shown in the below table. Following the stats we discuss some analysis of the user study data.

Table 1: User Study Stats
Users Tweets Entities Distinct Entities Tweets with Entities Interests Categories
Total 37 31,927 29,146 13,150 16,464 111,535
Average 864 787 355 445 3014

Distribution of Resolved Entities

Figure 1: Accuracy of location prediction

As shown in the above table, we were able to resolve around 30K entities using Zemanta<ref>http://developer.zemanta.com/</ref>. The distribution of these extracted entities are shown in the Figure 1. The graph is a log-log scale with x-axis representing the rank of the entity based on its occurrence and the y-axis represents the frequency of its occurrence among the 37 users. Not surprisingly the distribution follows a power law<ref>Adamic, Lada A. "Zipf, power-laws, and pareto-a ranking tutorial." Xerox Palo Alto Research Center, Palo Alto, CA</ref>.

Users-Tweets-Entities-Categories Distribution

Figure 2: Distribution of Entities from User Study

In the Figure 2, we can see the distribution based on users. The x-axis are users ordered based on the number of tweets (ascending). The secondary y-axis is for Categories in the Hierarchy where as the primary y-axis if for the rest of the series in the graph. The graph comprises of the distribution of the following:

  1. Number of Tweets of each user
  2. Tweets comprising atleast one entity per user
  3. Entities per user
  4. Distinct Entities per user
  5. Categories of Interest generated by our best variation (Priority Intersect)

We can derive some insights from the figure related to our approach:

  • The total number of categories generated by the system does not show a proportional increase with that of the number of tweets with a user. However, we can see a co-relation (better to provide formal evidence) between the distinct entities of each user to that of the number of categories generated for the user. This shows that the number of interest categories inferred from Wikipedia Hierarchy (bigger sub-graph) is based on the diversity of topics the user discusses in his/her tweets.

Results and Analysis

Figure 3: Evaluation of the three activation functions using Graded Precision, Mean Average Precision, Mean Reciprocal Rank

We evaluated our system using standard ranking evaluation metrics such as Precision@k, Mean Average Precision, Mean Reciprocal Rank. Details of the metrics can be found in Manning's Information Retrieval book<ref>Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA. </ref>. Employing these evaluation metrics we answered the following questions:

  1. Precision@k (Modified to Graded Precision): How many relevant/irrelevant Hierarchical Interests are retrieved at top X ranks?
  2. Mean Average Precision: How well are the retrieved relevant Hierarchical Interestsranked at top X?
  3. Mean Reciprocal Rank: How early in the ranked Hierarchical Interests can we find a relevant result?

The results for the above question is illustrated in Figure 3. Our approach with Priority Intersect performs the best compared to the others.

NDCG and MAP: Ranking Evaluations Comparisons

Figure 4: Evaluation of the three activation functions using NDCG, MAP-Y (Maybe considered Yes), MAP-N (Maybe considered No)

Ranking of the Hierarchical Interests for the user can be measured using the following two metrics

  1. Mean Average Precision (MAP)
  2. Normalized Discounted Cumulative Gain (NDCG).

NDCG is usually used for graded results, whereas MAP is used for binary results. Although, we have a graded scale (Yes/No/Maybe), it is non-trivial to empirically assign a value to Maybe. Maybe was marked to a category when it seemed to be too broad an interest for a user. Broadness for a user is subjective and in-turn difficult to evaluate. However, we evaluated all the options regarding MAP and NDCG as shown in Figure 4. The options are as follows:

  • MAP-Y: Mean Average Precision considering Maybe as a relevant interest.
  • MAP-N: Mean Average Precision considering Maybe as a irrelevant interest.
  • NDCG: NDCG is a metric for graded precision. We considered the grade scale between 0-1 where Yes:1, Maybe: 0.5, No: 0.

The results do not vary much, with all illustrating the superior performance of Priority Intersect

Statistical Significance Test

Table 2: Statistical Significance tests (t-tests). PI: Priority Intersect, BL: Bell Log, B: Bell
Comparisons @10 @20 @30 @40 @50
PI vs B 5.56 E-06 6.19 E-06 7.10 E-06 1.28 E-07 8.36 E-07
PI vs BL 0.01 0.006 0.014 0.001 0.0006
BL vs B 0.0007 1.14 E-05 3.95 E-06 3.58 E-06 3.04 E-05

We used two tailed paired t-tests to analyze the statistical significance between the different activation functions. T-test was performed for each combination of activation functions using the GradedPrecision_{yes}@k for each users. In T-Test, values lesser than 0.05 (5%) rejects the null-hypothesis. Therefore, the results from Table 2 shows that there is significant difference between the GradedPrecision_{yes}@k obtained using Priority Interesect to that of the other two activation functions.

Comparison of Hierarchical Level Distribution

Figure 5: Average Frequency of Categories based on Hierarchical Levels at Top-50

In this analysis, we examine the hierarchical levels of the top-50 categories generated by each of our approach. As shown in the Figure 5, we can see that bell-log and priority-intersect has a bell curve where the maximum number of categories that has been spotted are between the hierarchical level 4-7. Whereas Twopics<ref>Michelson, Matthew, and Sofus A. Macskassy. "Discovering users' topics of interest on twitter: a first look." Proceedings of the fourth workshop on Analytics for noisy unstructured text data. ACM, 2010.</ref>, a early implementation to generate user interest categories from Wikipedia, ranks most of the abstract categories higher. As we can see, Twopics has ranked the categories of level 2-3 most in top-50. This is similar to our no-weight no-decay approach.

Another aspect to note is that, Wikipedia has around 36 categories at Level-2 (Most abstract categories of interests). Further, Twopics on an average ranks 50% (18/36) of the most abstract Wikipedia categories (level 2) in top-50 as user interests, which is intuitively unlikely.

Entity Resolution - Error Propagation

Entity resolution on microblogs is a non trivial and ongoing research problem. Due to the dependency on entity resolution, we can presume that this affects the overall performance of the system. In order to confirm this error propagation, we performed a simple experiment by plugging in a dictionary based longest sub-string match (LSS) entity extractor. The dictionary comprised of the Wikipedia entities. We evaluated the performance of LSS extractor using the same dataset with which the other extractors were evaluated in <ref>Leon Derczynski, Diana Maynard, Niraj Aswani, and Kalina Bontcheva. Microblog-genre noise and impact on semantic annotation accuracy. HT '13</ref>. The performance of LSS is as shown in Table 3.

We restricted the evaluation to just six users who had already evaluated our system with Zemanta. The LSS was plugged in to our approach with Priority Intersect activation function. The evaluation results did confirm our intuition with a dip in average precision of 26.6% (0.75 to 0.55) compared to Zemanta.

Table 3: Evaluation of Entity Resolution Techniques
Extractors Precision Recall F-Measure
Dbpedia Spotlight 20.1 47.5 28.3
Text Razor 64.6 26.9 38.0
Zemanta 57.7 31.8 41.0
LSS 31.4 21.2 25.2

Limitations due to Wikipedia & Twitter

Wikipedia contains a lot of generic categories such as Category: Living People, Category:Republics, Category:Buzzwords that were present in our ranked Hierarchical Interests. These were marked as a 'No' by users in the evaluation. Although this limitation is partially addressed by weighing the relevance of \textit{Hierarchical Interests} as edge weights using Priority Path Constraint, it still persists in the results. One solution we experimented is by constraining the activation based on the out-degree (number of children) of a Hierarchical Interests. Although, this did help in reducing the number of generic Hierarchical Interests, the overall results were negatively impacted. Addressing this limitation is a task for our future work.

Another type of Wikipedia designed categories are of the form Category:YYYY Births, Category:YYYY Introductions where YYYY is the year. These as Hierarchical Interests were consistently and understandably marked irrelevant by the users. We did not remove these Categories due to the impact on the category structure of Wikipedia for our approach.

Similar to the Wikipedia Structure, the informal language used in tweet text did play a role in ranking some possibly irrelevant Hierarchical Interests. Some examples of which are Category:Text Messaging, Category:SMS, Category:Years etc.

Lack of Content

Our user study involves 7 users with lesser than 100 tweets, of which 5 users had lesser than 50 tweets. If we remove these users, our approach performs significantly better. The Average Precision (37 Users) at rank 10 for Priority Intersect activation function improved from 0.76 to 0.81. Due to low number of tweets, the frequency and hence the scoring of Primitive Interests had no significant impact on prioritizing interests. Therefore, this propagated in generating a HIG with not so accurate results.


Wikipedia Category System

Dataset Used in our approach: In order to generate the Wikipedia Category Graph, we first started with the Dbpedia Category System the can be downloaded from the following links Dbpedia-Category-System, Dbpedia-Instance-Categories with details on Dbpedia 39 downloads. The Dbpedia-Category-System comprises of links between categories and the Dbpedia-Instance-Categories comprises of links between Wikipedia Articles and Categories. Once we found that the Dbpedia-Category-System is incomplete, we used the Wikipedia Dump to re-confirm the disconnected categories. The Wikipedia Dump can be downloaded from this link with details on wikipedia database page. Finally we used the Wikipedia API to complete the Wikipedia Category System. Further, this Wikipedia Category Graph was used to generate the Wikipedia Hierarchy that forms the base of each Hierarchical Interest Graph generated by our approach.

Dataset Generated by the approach: The list of Wikipedia admin categories that are irrelevant in Wikipedia Category System can be downloaded here. The admin categories constituted to around 67K categories. Further, the Wikipedia Hierarchy generated by our approach with priority of each links can be downloaded as a tab separated file. The format is <subcategory><tab><category><tab><priority>.

Entity Resolution Evaluation Dataset

The evaluation of entity resolution, specifically for the Trie extractor was done using the dataset created by Meij et al<ref> Edgar Meij, Wouter Weerkamp, and Maarten de Rijke. 2012. Adding semantics to microblog posts. In Proceedings of the fifth ACM international conference on Web search and data mining (WSDM '12). ACM</ref>.


<references />