Difference between revisions of "Continuous Semantics to Analyze Real Time Data"

From Knoesis wiki
Jump to: navigation, search
Line 35: Line 35:
 
mobile, and sensor webs. Here, we look at how
 
mobile, and sensor webs. Here, we look at how
 
continuous semantics can help us model those
 
continuous semantics can help us model those
domains and analyze the related real-time data.<br />
+
domains and analyze the related real-time data.
  
<span style="font-size:12pt;color:purple">The Challenge of Modeling Dynamic Domains</span>
+
<span style="font-size:12pt;color:purple">The Challenge of Modeling Dynamic Domains</span><br />
 +
 
 +
Increasingly popular social, mobile, and sensor
 +
webs exhibit five characteristics. First, they’re
 +
spontaneous (arising suddenly). Second, they
 +
follow a period of rapid evolution, involving
 +
real-time or near real-time data, which requires
 +
continuous searching and analysis. Third, they
 +
involve many distributed participants with fragmented
 +
and opinionated information. Fourth,
 +
they accommodate diverse viewpoints involving
 +
topical or contentious subjects. Finally, they
 +
feature context colored by local knowledge as
 +
well as perceptions based on different observations
 +
and their sociocultural analysis.
 +
 
 +
<b>Minimizing the Need for Commitment</b>
 +
 
 +
The formal modeling of ontologies for such evolving
 +
domains or events is infeasible for two reasons.
 +
First, we don’t have many starting points
 +
(existing ontologies). Second, a diverse set of
 +
users or participants will have difficulty committing
 +
to the shared worldview we’re attempting to
 +
model. Modeling a contentious topic might lead
 +
to rejection of the ontology or failure to achieve
 +
common conceptualization. On one hand, users
 +
often agree on a domain’s concepts and entities,
 +
such as the lawmakers involved in drafting a bill,
 +
the bill’s topic, an earthquake’s spatial location,
 +
and key dates. On the other hand, users often
 +
contest the interpretation of how these entities
 +
are related, even taxonomically.
 +
So, models that require less commitment are
 +
preferable. Models that capture changing conceptualizations
 +
and relevant knowledge offer
 +
continuous semantics to improve understanding
 +
and analysis of dynamic, event-centric activities
 +
and situations.
 +
To build domain models for these situations,
 +
we must pull background knowledge from
 +
trusted, uncontroversial sources. Wikipedia, for
 +
instance, has shown that it is possible to collaboratively
 +
create factual descriptions of entities
 +
and events even for contentious
 +
topics such as abortion. Wikipedia
 +
articles show information agreed
 +
upon by most contributors. Separate
 +
discussion pages show how the contributors
 +
resolved disagreements to
 +
arrive at a factual, unbiased description.
 +
Such wide agreement combined
 +
with a category structure and link
 +
graph makes Wikipedia an attractive
 +
candidate for knowledge extraction.
 +
That is, we can harvest the wisdom
 +
of the crowds, or collective intelligence,
 +
to build a folksonomy — an
 +
informal domain model.
 +
 
 +
<b>Anticipating What We’ll Want to Know</b>
 +
 
 +
Traditional conceptual modeling is
 +
also inadequate for dynamic domains
 +
owing to their topicality. News,
 +
blogs, and microblog posts deliver
 +
descriptions of events in nearly real
 +
time. Twitter, for example, delivers
 +
information as short “tweets” about
 +
events as they unfold. Only a model
 +
with social media as its knowledge
 +
source will be up-to-date when modeling
 +
events that are unfolding in
 +
a similar medium. A domain model
 +
that doesn’t significantly lag behind
 +
the actual events is crucial for accurate
 +
classification, which will result
 +
in maximum information gain.
 +
The past few years have seen
 +
explosive growth in services offering
 +
up-to-date and, in many cases,
 +
real-time data. Leading the way
 +
is Twitter and a variety of socialmedia
 +
services (see http://gnip.com/
 +
sources), followed by blogs and traditional
 +
news media. We want to be the
 +
first to know about change—ideally,
 +
before it happens, or at least shortly
 +
after. The paradigm for information
 +
retrieval is thus, “What will you
 +
want to know tomorrow?”
 +
A recent paper showed success
 +
in predicting German election
 +
results using tweets.3 However, there
 +
is more to elections than just the
 +
results. An event or situation can be
 +
multifaceted and can be spatially,
 +
temporally, and thematically sliced
 +
and analyzed. For example, you
 +
could time-slice the 2009 Iranian
 +
election discussion on Twitter into
 +
events surrounding election campaign
 +
rallies and protests (starting
 +
12 June), Mahmoud Ahmadinejad’s
 +
victory speech (14 June), the decision
 +
to recount (16 June), Ayatollah
 +
Khamenei’s endorsement of Ahmadinejad’s
 +
win (19 June), Neda’s brutal
 +
killing (22 June), and so on.
 +
An approach to Web document
 +
search that can leverage billions of
 +
documents to deliver useful patterns4
 +
probably won’t be very useful
 +
here. Our challenge involves extracting
 +
signals from thousands of tweets
 +
or posts (that is, a small corpus)
 +
containing informal text.5 Furthermore,
 +
the discussion focus will often
 +
shift frequently, with new knowledge
 +
or facts generated along with
 +
the events. For example, regarding
 +
a natural disaster, the focus could
 +
shift from rescue to recovery. So,
 +
we’re intrigued by the possibility
 +
of dynamic model extraction that
 +
can be tied to a situation’s context
 +
and can keep up with context shifts
 +
(for example, response and rescue
 +
to recovery and, later, rehabilitation).
 +
We would like to use such an
 +
extracted model to organize (search,
 +
integrate, analyze, or even reason
 +
about) data relating to real-time discourse
 +
or relating to dynamic, eventcentric
 +
activities and situations.
 +
Traditional classification approaches
 +
based on corpus learning or
 +
user input can only react to domain
 +
changes. More recently, however, we
 +
find that social-knowledge aggregation
 +
sites such as Wikipedia quickly
 +
contain descriptions of events, emergent
 +
situations, and new concepts.
 +
For example, for some recent events
 +
such as US Representative Joe Wilson’s
 +
“You lie!” outburst, the Mumbai
 +
terrorist attack, and the Haiti earthquake,
 +
anchor pages with significant
 +
details were available in less than an
 +
hour to less than a day. Furthermore,
 +
these pages continued to evolve as
 +
the event or situation unfolded.
 +
Technology lets us create snapshots
 +
of this evolution. So, if automatic
 +
techniques can tap such social
 +
knowledge to create a model, we can
 +
gain the ability to better understand
 +
the more unruly informal text that
 +
largely constitutes real-time data.
 +
 
 +
<span style="font-size:12pt;color:purple">Continuous Semantics</span><br />

Revision as of 16:07, 4 October 2010

Continuous Semantics to Analyze Real-Time Data

Amit Sheth, Christopher Thomas, and Pankaj Mehra • Wright State University

We’ve made significant progress in applying semantics and Semantic Web technologies in a range of domains. A relatively well-understood approach to reaping semantics’ benefits begins with formal modeling of a domain’s concepts and relationships, typically as an ontology. Then, we extract relevant facts — in the form of related entities — from the corpus of background knowledge and use them to populate the ontology. Finally, we apply the ontology to extract semantic metadata or to semantically annotate data in unseen or new corpora. Using annotations yields semanticsenhanced experiences for search, browsing, integration, personalization, advertising, analysis, discovery, situational awareness, and so on.1 This typically works well for domains that involve slowly evolving knowledge concentrated among deeply specialized domain experts and that have definable boundaries. A good example is the US National Center for Biomedical Ontologies, which has approximately 200 ontologies used for annotations, improved search, reasoning, and knowledge discovery. Concurrently, major search engines are developing and using large collections of domain-relevant entities as background knowledge, to support semantic or facet search. However, this approach has difficulties dealing with dynamic domains involved in social, mobile, and sensor webs. Here, we look at how continuous semantics can help us model those domains and analyze the related real-time data.

The Challenge of Modeling Dynamic Domains

Increasingly popular social, mobile, and sensor webs exhibit five characteristics. First, they’re spontaneous (arising suddenly). Second, they follow a period of rapid evolution, involving real-time or near real-time data, which requires continuous searching and analysis. Third, they involve many distributed participants with fragmented and opinionated information. Fourth, they accommodate diverse viewpoints involving topical or contentious subjects. Finally, they feature context colored by local knowledge as well as perceptions based on different observations and their sociocultural analysis.

Minimizing the Need for Commitment

The formal modeling of ontologies for such evolving domains or events is infeasible for two reasons. First, we don’t have many starting points (existing ontologies). Second, a diverse set of users or participants will have difficulty committing to the shared worldview we’re attempting to model. Modeling a contentious topic might lead to rejection of the ontology or failure to achieve common conceptualization. On one hand, users often agree on a domain’s concepts and entities, such as the lawmakers involved in drafting a bill, the bill’s topic, an earthquake’s spatial location, and key dates. On the other hand, users often contest the interpretation of how these entities are related, even taxonomically. So, models that require less commitment are preferable. Models that capture changing conceptualizations and relevant knowledge offer continuous semantics to improve understanding and analysis of dynamic, event-centric activities and situations. To build domain models for these situations, we must pull background knowledge from trusted, uncontroversial sources. Wikipedia, for instance, has shown that it is possible to collaboratively create factual descriptions of entities and events even for contentious topics such as abortion. Wikipedia articles show information agreed upon by most contributors. Separate discussion pages show how the contributors resolved disagreements to arrive at a factual, unbiased description. Such wide agreement combined with a category structure and link graph makes Wikipedia an attractive candidate for knowledge extraction. That is, we can harvest the wisdom of the crowds, or collective intelligence, to build a folksonomy — an informal domain model.

Anticipating What We’ll Want to Know

Traditional conceptual modeling is also inadequate for dynamic domains owing to their topicality. News, blogs, and microblog posts deliver descriptions of events in nearly real time. Twitter, for example, delivers information as short “tweets” about events as they unfold. Only a model with social media as its knowledge source will be up-to-date when modeling events that are unfolding in a similar medium. A domain model that doesn’t significantly lag behind the actual events is crucial for accurate classification, which will result in maximum information gain. The past few years have seen explosive growth in services offering up-to-date and, in many cases, real-time data. Leading the way is Twitter and a variety of socialmedia services (see http://gnip.com/ sources), followed by blogs and traditional news media. We want to be the first to know about change—ideally, before it happens, or at least shortly after. The paradigm for information retrieval is thus, “What will you want to know tomorrow?” A recent paper showed success in predicting German election results using tweets.3 However, there is more to elections than just the results. An event or situation can be multifaceted and can be spatially, temporally, and thematically sliced and analyzed. For example, you could time-slice the 2009 Iranian election discussion on Twitter into events surrounding election campaign rallies and protests (starting 12 June), Mahmoud Ahmadinejad’s victory speech (14 June), the decision to recount (16 June), Ayatollah Khamenei’s endorsement of Ahmadinejad’s win (19 June), Neda’s brutal killing (22 June), and so on. An approach to Web document search that can leverage billions of documents to deliver useful patterns4 probably won’t be very useful here. Our challenge involves extracting signals from thousands of tweets or posts (that is, a small corpus) containing informal text.5 Furthermore, the discussion focus will often shift frequently, with new knowledge or facts generated along with the events. For example, regarding a natural disaster, the focus could shift from rescue to recovery. So, we’re intrigued by the possibility of dynamic model extraction that can be tied to a situation’s context and can keep up with context shifts (for example, response and rescue to recovery and, later, rehabilitation). We would like to use such an extracted model to organize (search, integrate, analyze, or even reason about) data relating to real-time discourse or relating to dynamic, eventcentric activities and situations. Traditional classification approaches based on corpus learning or user input can only react to domain changes. More recently, however, we find that social-knowledge aggregation sites such as Wikipedia quickly contain descriptions of events, emergent situations, and new concepts. For example, for some recent events such as US Representative Joe Wilson’s “You lie!” outburst, the Mumbai terrorist attack, and the Haiti earthquake, anchor pages with significant details were available in less than an hour to less than a day. Furthermore, these pages continued to evolve as the event or situation unfolded. Technology lets us create snapshots of this evolution. So, if automatic techniques can tap such social knowledge to create a model, we can gain the ability to better understand the more unruly informal text that largely constitutes real-time data.

Continuous Semantics