Event identification in social media using classification-clustering framework
[Thesis]
Alsaedi, Nasser
Cardiff University
2017
Thesis (Ph.D.)
2017
In recent years, there has been increased interest in real-world event detection using publicly accessible data made available through Internet technology such as Twitter, Facebook and YouTube. In these highly interactive systems the general public are able to post real-time reactions to "real world" events - thereby acting as social sensors of terrestrial activity. Automatically detecting and categorizing events, particularly smallscale incidents, using streamed data is a non-trivial task, due to the heterogeneity, the scalability and the varied quality of the data as well as the presence of noise and irrelevant information. However, it would be of high value to public safety organisations such as local police, who need to respond accordingly. To address these challenges we present an end-to-end integrated event detection framework which comprises five main components: data collection, pre-processing, classification, online clustering and summarization. The integration between classification and clustering enables events to be detected, especially "disruptive events" - incidents that threaten social safety and security, or that could disrupt social order. We present an evaluation of the effectiveness of detecting events using a variety of features derived from Twitter posts, namely: temporal, spatial and textual content. We evaluate our framework on large-scale, realworld datasets from Twitter and Flickr. Furthermore, we apply our event detection system to a large corpus of tweets posted during the August 2011 riots in England. We show that our system can perform as well as terrestrial sources, such as police reports, traditional surveillance, and emergency calls, even better than local police intelligence in most cases. The framework developed in this thesis provides a scalable, online solution, to handle the high volume of social media documents in different languages including English, Arabic, Eastern languages such as Chinese, and many Latin languages. Moreover, event detection is a concept that is crucial to the assurance of public safety surrounding real-world events. Decision makers use information from a range of terrestrial and online sources to help inform decisions that enable them to develop policies and react appropriately to events as they unfold. Due to the heterogeneity and scale of the data and the fact that some messages are more salient than others for the purposes of understanding any risk to human safety and managing any disruption caused by events, automatic summarization of event-related microblogs is a non-trivial and important problem. In this thesis we tackle the task of automatic summarization of Twitter posts, and present three methods that produce summaries by selecting the most representative posts from real-world tweet-event clusters. To evaluate our approaches, we compare them to the state-of-the-art summarization systems and human generated summaries. Our results show that our proposed methods outperform all the other summarization systems for English and non-English corpora.