Location Prediction of Twitter Users
Location prediction of Twitter users using Wikipedia as a knowledge-base
The existing approaches to predict the location of a Twitter user can be broadly grouped in two categories:
- Network based solutions
- Content based solutions
Our approach comprises of three primary components:
- Knowledge Base Generator extracts local entities for each city from Wikipedia and scores them based on their relevance to the city
- User Profile Generator extracts the Wikipedia entities from the tweets of a user
- Location Predictor uses the output of Knowledge Base Generator and User Profile Generator to predict the location of a user
Knowledge Base Generator
We use the following four measures to score the local entities of a city, with respect to the city:
- Pointwise Mutual Information
In information theory, pointwise mutual information of two random variables is a measure of their mutual dependence. We use this idea to determine the association between a city and its local entities.
- Betweenness Centrality
We build a directed graph for each city using its internal links. The internal links correspond to the nodes of a graph. For a link from the Wikipedia page of one local entity to another, we draw an edge from the former to the latter in this graph. For example, in the graph of New York City an edge between Statue of Liberty and Manhattan indicates a link from the Wikipedia page of Statue of Liberty to the Wikipedia page of Manhattan. The betweenness centrality of each node (representing a local entity) gives the importance of the node relative to the rest of the nodes in the graph.
- Semantic Overlap Measures
We use the hyperlink structure of Wikipedia to compute the semantic relatedness of a city and its local entities. We use the following set based measures to compute the semantic overlap between a city and its local entities:
- Jaccard Index is a symmetric, set based measure that defines the similarity of two sets in terms of their overlap and is normalized for their sizes. We use this measure to find the similarity between a city and its local entities.
- Tversky Index is an asymmetric measure of given two sets. While the Jaccard Index determines the similarity between a city and a local entity, a local entity generally represents a part of the city. Thus we use Tversky Index which is a unidirectional measure of similarity of the local entity to the city.
User Profile Generator
In order to use the local entities from our knowledge base to predict a user's location, we need to map the entities from the user's tweets to Wikipedia articles. Linking entities in tweets to Wikipedia articles has been well researched. This involves mapping named entities mentioned in tweets to be linked to the corresponding real world entities in Wikipedia. We use Zemanta  for this task. We chose Zemanta because of their relatively superior performance and the rate limit extension (10,000 requests per day) provided for research purposes.
To predict the location of a user, we compute a score for each city with overlapping local entities from the tweets of a user as a product of the score of the local entity with respect to the city and the frequency of occurrence of the local entity in the tweets of the user. Further, by ranking the scores in descending order, the top k cities for the user are predicted.
We conducted our experiments on the test data set created by Cheng et al. This data set was created in 2010 and contains 5119 active users from the continental United States, with 1000+ tweets of each user. Their locations are listed in the form of latitude and longitude co-ordinates which are generally more reliable than the location information from Twitter profile. To create the knowledge-base, we used all the cities listed in the 2012 US Census with a population estimate greater than 5000. We extracted the hyperlink structure of Wikipedia using the XML dump . Finally, we had 4661 cities and 500714 local entities in our knowledge-base.
We evaluated our approach using two metrics - Accuracy and Average Error Distance. Accuracy (ACC) is the percentage of users identified within 100 miles of their actual location. Error distance is the distance between the actual location of the user and the estimated location by our algorithm. Average Error Distance (AED) is the average error distance across all users in the dataset.
Table 1 shows the results of our approach, based on ranking of local entities using Pointwise Mutual Information (PMI), Betweenness Centrality (BC), Jaccard Index (JI) and Tversky Index (TI). The results show that the local entities ranked using Tversky Index are the most accurate in predicting the location of a user. Our approach also performs better than two other approaches tested on the same dataset. Cheng et al. <ref>Cheng, Zhiyuan, James Caverlee, and Kyumin Lee. "You are where you tweet: a content-based approach to geo-locating twitter users." Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 2010.</ref> showed 51% accuracy and 535.564 miles of average error distance. Chang et al. <ref>Chang, Hau-wen, et al. "@ Phillies Tweeting from Philly? Predicting Twitter User Locations with Spatial Word Usage." Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society, 2012.</ref> showed 49.9% accuracy and 509.3 miles of average error distance
|Method||ACC (%)||AED (in Miles)||ACC@2||ACC@3||ACC@5|
Figure 1 shows the accuracy of our algorithm at different miles of radius. As shown, we can locate 27% of the users within 10 miles of their actual location.
We also applied our algorithm for users in the top 100 most populated cities of United States. In the test dataset, there are 2172 users from these cities. We were able to locate 54.65% of these users exactly at the city level. Furthermore, we were able to locate 60.63% of these users within 50 miles of their actual location.
Figure 2 shows the local entities of San Francisco scored using Tversky Index.
Source Code is available at