Understanding User's Geographic Context
Much effort on social media content analysis seeks to find out what people do and think. Such as event detection/tracking and sentiment analysis, the former focuses on finding out "what people do" and the latter one aims at understanding "what people think". The information about "where" can play important roles on work of both issues (i.e., event and sentiment). To be more specific, location is one of the basic descriptors of an event. In many scenarios, “where it happens” is not less important than “what happens”. Capturing the location information can be crucial for identifying events. The relation between location and sentiment (or emotion/mood) is not as obvious as it between location and event. Intuitively, people in Disneyland are supposed to be happier than people in hospital. Does it suggest that location can be cues for sentiment analysis? Before digging into these topics about event, sentiment and how location information can be helpful, it is necessary to understand user’s geographic context.
User’s geographic context consists of various location information about the user. Generally, there are three different kinds of location information about a user, the locations mentioned in the content generated by the user, the locations attached with the content as tags, and the locations provided in user’s profile. Take Twitter for example, the locations mentioned in tweets can be any locations relevant to the user, which might not suggest where the user is (now). The geo tag of one tweet indicates where the tweet is sent from (i.e., where the user is when the tweet is sent). The user location provided by the user in his/her profile usually suggests where the user is living. I am interested in how those different kinds of locations relate to each other, and how they can help to disambiguate or even predict each other.
At this time, I focus on identifying and disambiguating the locations mentioned in the content. As I mentioned previously, the locations mentioned in the content are not necessarily related to where the user is. For example, people might discuss some news events that happened in another continent or a trip to another city in their plan. If it is the case, to what extent other kinds of location information can be helpful to the disambiguation. On the other hand, there are different approaches for location disambiguation (toponym resolution). Without using any context, it is still possible to resolve the place names by some prior knowledge, such as the popularity or populations of the locations. Context information can serve as the constraints to narrow down the search space. Two common ways to use the context information are: (1) use the relations between the context and the candidate locations. For example, “is_in” relation can be used to disambiguate “Dayton” with context “Ohio”. (2) minimize the average (or sum) of the pair-wise distances. In this work, I investigate the performance of those different approaches.
I use LinkedGeoData  as the background knowledge for location entity spotting. LinkedGeoData uses the information collected by the OpenStreetMap project and makes it available as an RDF knowledge base according to the Linked Data principles. I use the RelevantNodes dataset which contains 66 million triples. LinkedGeoData provides a Sparql endpoints for online access. I choose to download the data set and distribute it on our own Virtuoso server. We now can access the data from our own Sparql endpoint .
Based on the linguistic heuristic (e.g., the names of places are usually capitalized in the text, etc.) and knowledge from LinkedGeoData, the toponyms (place names) can be identified. The identified toponyms can be ambiguous. For example, “Washington” could be a name of person or a name of state. This type of ambiguity is considered as geo/non-geo. In this work, I focus on deal with another type of ambiguity, namely geo/geo, i.e., one toponym could refer to different geo locations. For each of the extracted toponym, I obtain the possible geo locations it refers from LinkedGeoData as candidates. The disambiguation algorithm aims at assigning one geo location from the candidates to the toponym.
Four different types of context in social media (e.g., Twitter, etc) can be used for the disambiguation: (1) the other place names in the same piece of text, namely local context, (2) the place names in other piece of text generated by the same user, (3) the geo-tag attached with the text, and (4) user location in the profile. Following the similar way described above, the toponyms can be extracted from each type of the context. I use them separately to investigate their performances. As I discussed earlier in the introduction, there are different approaches for disambiguation, based on prior knowledge (without context), relations, and distances. I apply them in the following steps with each type of context. Firstly, for all the toponyms (the target and the context), I obtain the candidate geo locations with their information on type, population, is_in, latitude and longitude from LinkedGeoData. Secondly, narrow down the search space according to the is_in relations. If there are is_in relations between some candidates of the target and some candidates of the context (e.g., “Dayton” and “Ohio”), or some candidates of the target and some candidates of the context have the same is_in relations (e.g., “Dayton” and “Columbus”), then only remain them and remove other candidates. Thirdly, for each candidate of the target toponym, calculate the minimum average of the pair-wise distances from it to the candidates of all the topomyms as context, and select the one that minimizes the minimum average distances. Here I set a threshold (500 miles) for the minimum distance, since it does not make much sense if two locations are too far away from each other. If after the three steps there is no solution obtained, apply the type and population information to make the choice, for example, locations with type as city, or with bigger populations.
I collect data from Twitter for experiments. The data are collected as follows using Twitter Streaming API: (1) as the first stage, use the most common place names in the U.S. (according to the Wikipedia page ) as keywords to query Twitter, (2) start the second stage when collected 5000 user ids from the first stage. Track the 5000 users, obtain their profile and tweets created. In one week period, totally collected 2,187,205 tweets, among which, 7.12% (155,705) tweets mention place names, 1.97% (42,988) tweets have geo-tags, and 57.36% (2,868) users provide location information (might be invalid).
As a preliminary evaluation, a small testing set of 100 tweets manually labeled as gold standard. Each tweet in the gold standard mentions at least one place name and has at least one type of context available. To assign a geo-location to the place name in each tweet, the annotator has to take all available context information into consideration. To be disambiguous, the geo-location is represented as the node in LinkedGeoData. Run the disambiguation algorithm on the testing set with each type of context separately. The result is concluded in the following Figures.
Figure 1 shows the percentage of tweets disambiguated by different methods with each type of context. With the local context (place names in the same tweet), more than half of the toponyms were disambiguated by the is_in relations, which suggests the is_in relations highly possible exist among the place names in the same tweet. For the global context (place names mentioned in different tweets generated by the same user) and geo-tags with the tweets, the distance turns to be the one that contributes most to the disambiguation. It is not difficult to understand. Comparing with the local context, other context information tends to be at the same layer in the hierarchy with the target (e.g., different cities, but not a city and a state). It is surprised to find that the majority of toponyms were disambiguated by prior knowledge (i.e., type and population) but not using the user location in the last case. It might suggest the low quality of user location information, or users discussed less about what happened around them but more about others.
Observing from figure 2, which shows the accuracy of the result obtained with different context and methods. As the baseline, I applied only prior knowledge about type and population to resolve the toponym without any context information. The accuracy is 57% with the baseline. It is again surprised to find that the baseline achieved better performance than using the global context (accuracy 51.32%). Seems global context is not good disambiguator. The best performance is achieved by using the local context as 75.56%, and it is only 1.14% better than using the geo-tags. It makes a lot sense. Intuitively, the place names mentioned adjacent to each other in the text are highly possible relevant to each other. And the place names mentioned in the tweets are highly possible the place where the tweets are sent from. From the method perspective, is_in relations step out as the most convincing way for disambiguation.
Conclusion and Future Work
In this project, I explored the problem of toponym resolution (location disambiguation) in social media, and investigated the effectiveness of different context and different methods in the disambiguation task. There are several interesting findings that could be helpful for my future work. According to the experiments, place names mentioned in the same piece of text and geo-tags with the text turn to be the most useful context, and is_in relations is found to be the most effective way to disambiguate the in-text place names. No matter which kind of context and which method is used, the background knowledge is the foundation of disambiguation. I am currently working on event-centric opinion summarization, and I will further investigate how locations can be helpful for this task.
- S.Kinsella, V.Murdock, and N.O'Hare. "I'm Eating a Sandwich in Glasgow": Modeling Locations with Tweets. In Proc. of the 3rd Workshop on Search and Mining User-generated Contents, Glasgow, UK,2011.
- Daniel Gruhl,Meenakshi Nagarajan,Jan Pieper,Christine Robson,Amit Sheth, '"Context and Domain Knowledge Enhanced Entity Spotting in Informal Text"', ISWC 2009.
- Wenbo Zong, Dan Wu, Aixin Sun, Ee-Peng Lim, and Dion Hoe-Lian Goh. 2005. On assigning place names to geography related web pages. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries (JCDL '05). ACM, New York, NY, 2005.
- Zhu Zhu, Lidan Shou, Kuang Mao, and Gang Chen. 2011. Location disambiguation for geo-tagged images. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information (SIGIR '11). ACM, New York, NY, USA, 2011.