Twitter geo-located clustering and topic analysis, now opensource!2011-09-11 17:59:42 +0100
A year has passed since the beginning of the trial of Flux of MEME, the project I have presented during the Working Capital tour, and it is now time to analyze what has been learned and show what has been developed to conclude this R&D phase and deliver results to Telecom Italia.
the initial idea
It’s worthwhile giving a quick description of the context: Twitter is a company formed in 2006 which has received several rounds of funding by venture capitals over the past few years, this leading to today's valuation of $1.2B, still during the summer of 2009 the service was not yet mature and widespread as it may look now. At that time the development of the Twitter API had just started, this probably being one of the few sources, if not the only one, for geo-referenced data. The whole concept of communication in the form of public gossip, mediated by a channel that accepts 140 characters per message, was appearing in the world of social networks for the first time.
This lead to the base idea of crunching this data stream, which most importantly include the geographical source, then summarize the content, so as to analyze the space-time evolution of the concepts described and, ultimately, make a prediction of how they could migrate in space and time.
A practical use
It could allow you to control and curb the trend of potentially risky situations (such as social network analysis has been useful during the recent riots in London) or even define marketing strategies targeted to the local context.
A consistent initial phase of research allowed to have an overview on different aspects: the ability to capture the information from Twitter, the structure of captured data, the ability to obtaining geo-located information, the classification of languages of the tweets, the enrichment of content through discovery of related information, the possible functions for spatial clustering, the algorithms for topic extraction, the definition of views useful for an operator and finally the ability to perform a trend analysis on the information extracted. All of this has resulted in a substantial amount of programming code, its outcome being a demonstrator for the validity of the initial theory.
space-time evolution of the concept "earthquake" in a limited subset of data captured during the period May 2011"
distribution of groups of tweets source languages over Switzerland and northern Italy
The future of the project
The development done so far has had two important results: firstly, it allowed to demonstrate the validity of the initial idea, and secondly it has revealed the requirements needed by the system to be fully functional. The main problem lays in the architecture implemented for the demonstrator, which at the moment relies on a limited amount of data (for obvious reasons of availability of resources): this immediately proved the necessity of scaling up the application environment in a more complex architecture for distributed computing The market and/or Telecom Italia will eventually decide if this second phase of development can be faced.
- Source code and documentation - https://github.com/grudelsud/fom/
- Algorithm for the classification of space - http://en.wikipedia.org/wiki/Cluster_analysis
- Algorithm for extracting topic - http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation