Twitter data mining and clustering results

It has been a while since my last update on the Flux of MEME project, but my team was not idle: we have worked a lot both on algorithms and architecture, and now it is time to analyse the first results. Thanks to the research grant awarded by @workingcapital, a first prototype of data mining, topic extraction and clustering application was developed, using Twitter as its main data source.

Our software is structured in 3 main modules:

  1. data acquisition: uses Twitter streaming API to fetch contents and store them on our database. Data is filtered in order to store geo-located references only, representing around 1% of the total amount of tweets. Data elements are "enriched" when possible, thanks to a web crawler that fetches content referenced by links inside the tweets body
  2. data analysis: performs a 2 step iteration on content elements, first creating a set of geo-located clusters with K-means algorithm, then extracting topics with Latent Dirichlet allocation (LDA). A lot of work still needs to be done on this module in order to increase meaningfulness of results and analysis of clusters correlation and prediction
  3. data visualization: it is an AJAX based web front-end for analysis and verification of experimental results

The algorithms and architecture are still undergoing a lot of development, but the first results are really encouraging and we have planned another 6 months of activities under the current grant. [EDIT] please check below this non-technical presentation of the main project features.



We are completely open to discussion, recommendations and funding proposals, so anyone interested in this topic, please feel free to enquire to learn more about this project.