Twitter data mining and clustering results

It has been a while since my last update on the Flux of MEME project, but my team was not idle: we have worked a lot both on algorithms and architecture, and now it is time to analyse the first results. Thanks to the research grant awarded by @workingcapital, a first prototype of data mining, topic extraction and clustering application was developed, using Twitter as its main data source.

Our software is structured in 3 main modules:

  1. data acquisition: uses Twitter streaming API to fetch contents and store them on our database. Data is filtered in order to store geo-located references only, representing around 1% of the total amount of tweets. Data elements are "enriched" when possible, thanks to a web crawler that fetches content referenced by links inside the tweets body
  2. data analysis: performs a 2 step iteration on content elements, first creating a set of geo-located clusters with K-means algorithm, then extracting topics with Latent Dirichlet allocation (LDA). A lot of work still needs to be done on this module in order to increase meaningfulness of results and analysis of clusters correlation and prediction
  3. data visualization: it is an AJAX based web front-end for analysis and verification of experimental results

The algorithms and architecture are still undergoing a lot of development, but the first results are really encouraging and we have planned another 6 months of activities under the current grant. [EDIT] please check below this non-technical presentation of the main project features.

</embed>

</div>

We are completely open to discussion, recommendations and funding proposals, so anyone interested in this topic, please feel free to enquire to learn more about this project.