6 months down the road of development for my 2012 research project with Telecom Italia, codename STAMAT - Social Topics and Media Analysis Tool - and we are now getting to the point of sharing the first encouraging results.
going through several iterations of research & development, STAMAT looks more and more an intelligent feed reader, where information is conveyed with several levels of enrichment: starting from clean text and images as published on RSS feeds, we add a layer of knowledge measuring the buzz and related media.
this way the information is not passively ingested by users, but examined across several measures and connected to other related elements, making thus possible an active and effective acquisition of knowledge.
key technologies involved to achieve this result are mainly: named entity extraction, topic analysis and visual similarity.
our tool is still in private alpha and we are going out in the cold to pitch our product to VCs, in the meanwhile you can check the screenshots posted here and anyone who wants to stay in touch can subscribe to our amazing signup page on memeflux.co and/or take a 5 minutes survey to help us building the next generation trend analysis tool.
So the little storm has arrived! I am now the proud (and tired) father of little Benjamin, and for this reason blocked at home with little or no time for doing anything but changing nappies and cooking super-proteic food for Val. But somehow I found some time amuse myself with a little piece of Processing and wrote a simple code to visualize tweets on a map of London during the Royal Wedding. Easy enough to foresee, tweets during the day are creating nice clusters around Buckingham Palace, Westminster Abbey and the super posh hotel where the Middleton's used to stay.
Here is the result:
To display this data I reused the information stored during the first phase of my Flux of MEME project, fetched from twitter with the Streaming API implementation in its Java flavour twitter4j. Processing is reading the information in XML directly from the database, hence a little PHP backend is providing the XML descriptor for all the posts locations.
So it is now time to present the results obtained during the first year of research and development on the Flux of Meme project, and I was glad to fly to Milan for the presentation at Telecom Italia last friday 30th. Thanks-a-mil to Laurent-Walter Goix and Carlo Alberto Licciardi at Telecom for the constant support, reviews and recommendations: it immensely helped to achieve this result. And thanks-two-mils to Giuseppe Serra and Marco Bertini (also with the help of Federico Frappi) at the Media Integration and Communication Center for the help provided in the definition and fine-tweaking of algorithms. Looking forward to starting Flux phase 2!
This is a quick keynote that highlights the main elements of this geo-clustering and topic extraction tool, using twitter as a main data source but willing to expand to proper context-based data heterogeneous sources.
A year has passed since the beginning of the trial of Flux of MEME, the project I have presented during the Working Capital tour, and it is now time to analyze what has been learned and show what has been developed to conclude this R&D phase and deliver results to Telecom Italia.
the initial idea
It’s worthwhile giving a quick description of the context: Twitter is a company formed in 2006 which has received several rounds of funding by venture capitals over the past few years, this leading to today's valuation of $1.2B, still during the summer of 2009 the service was not yet mature and widespread as it may look now. At that time the development of the Twitter API had just started, this probably being one of the few sources, if not the only one, for geo-referenced data. The whole concept of communication in the form of public gossip, mediated by a channel that accepts 140 characters per message, was appearing in the world of social networks for the first time.
This lead to the base idea of crunching this data stream, which most importantly include the geographical source, then summarize the content, so as to analyze the space-time evolution of the concepts described and, ultimately, make a prediction of how they could migrate in space and time.
A practical use
It could allow you to control and curb the trend of potentially risky situations (such as social network analysis has been useful during the recent riots in London) or even define marketing strategies targeted to the local context.
A consistent initial phase of research allowed to have an overview on different aspects: the ability to capture the information from Twitter, the structure of captured data, the ability to obtaining geo-located information, the classification of languages of the tweets, the enrichment of content through discovery of related information, the possible functions for spatial clustering, the algorithms for topic extraction, the definition of views useful for an operator and finally the ability to perform a trend analysis on the information extracted. All of this has resulted in a substantial amount of programming code, its outcome being a demonstrator for the validity of the initial theory.
space-time evolution of the concept "earthquake" in a limited subset of data captured during the period May 2011"
distribution of groups of tweets source languages over Switzerland and northern Italy
The future of the project
The development done so far has had two important results: firstly, it allowed to demonstrate the validity of the initial idea, and secondly it has revealed the requirements needed by the system to be fully functional. The main problem lays in the architecture implemented for the demonstrator, which at the moment relies on a limited amount of data (for obvious reasons of availability of resources): this immediately proved the necessity of scaling up the application environment in a more complex architecture for distributed computing The market and/or Telecom Italia will eventually decide if this second phase of development can be faced.
It has been quite a while since our first "java -jar fom.jar" and the complexity of the project has constantly grown since the beginning, its tiny team facing every day a new challenge and trying to solve it with the limited amount of available resources. Now it is time to deliver the first prototype, draw a line and define the milestones ahead, but we are reasonably confident that there is room for improvements, and our clustering and topic extraction tool can provide good results. Special thanks go to @fedefrappi who received an email bombarding over the past few months and never lost his temper, great job man!
It has been a while since my last update on the Flux of MEME project, but my team was not idle: we have worked a lot both on algorithms and architecture, and now it is time to analyse the first results. Thanks to the research grant awarded by @workingcapital, a first prototype of data mining, topic extraction and clustering application was developed, using Twitter as its main data source.
Our software is structured in 3 main modules:
data acquisition: uses Twitter streaming API to fetch contents and store them on our database. Data is filtered in order to store geo-located references only, representing around 1% of the total amount of tweets. Data elements are "enriched" when possible, thanks to a web crawler that fetches content referenced by links inside the tweets body
data analysis: performs a 2 step iteration on content elements, first creating a set of geo-located clusters with K-means algorithm, then extracting topics with Latent Dirichlet allocation (LDA). A lot of work still needs to be done on this module in order to increase meaningfulness of results and analysis of clusters correlation and prediction
data visualization: it is an AJAX based web front-end for analysis and verification of experimental results
The algorithms and architecture are still undergoing a lot of development, but the first results are really encouraging and we have planned another 6 months of activities under the current grant. [EDIT] please check below this non-technical presentation of the main project features.
It was the last of a sequence: the project was briefed, submitted for an elevator pitch, then checked in a couple of meetings, refined, presented again, and at the end it won. The project will be funded by Telecom Italia @workingcapital and was presented during the last Working Capital event in Bologna, June 9th 2010.
Goal of this project is to create a semantic web application capable of predicting the future through the analysis and clustering of concepts.
Memes flow through the social networks and can be paired with their geo-reference information to understand how they move in physical space. Terms used in a single meme describe its semantic domain and images, when present, can enrich their interpretation.
Now the research begins. Part of the semantic engine for term extraction and expansion is already working. From my personal point of view I think that two outcomes will be really important:
applying results on mobile devices - contexts still have to be defined, but mobile networks and entertainment is an immediate transfer for this project;
data visualization - huge amounts of data are useless if not digested and visualized in a simple and attractive way; at the moment I am exploring Processing, as it appears a good desktop environment, and this is the reason why I cited blprnt.com in my presentation.
I was recently involved in the presentation of an ERC Advanced Grant. The list of currently open calls are published by the European Research Council official website, here.
The project aims at realizing a wide and exhaustive study on the Mycenaean culture diffusion within the Eastern Mediterranean basin in the Late II Millennium B.C. with the support of a proper cross-medial research, in order to build up a user friendly macro DataBase of the whole archaeological evidences coming from the wide area of interest.
Principal investigator of the proposed grant is Anna Margherita Jasink. Her research activity addressed a variety of topics in continuous evolution with time, with a distinct interdisciplinary approach. In fact her field research has been concerned with both philological and historical themes connected to the Ancient Aegean and Eastern Mediterranean/Near Eastern civilisations.
The following investigators are the other members of the team:
Prof. Stefania Mazzoni, professor of Archaeology ad Arts History of Near-Eastern civilisations in the University of Florence. With her expertise in Levantine world she will be in charge of coordinating the aspects concerning the presence of Mycenaeans in Syro-Palestinian area and in Anatolia;
Prof. Gloria Rosati, Prof. of Egyptology in the University of Florence. With her expertise in Egyptian world she will be in charge of coordinating the aspects concerning the presence of Mycenaeans in Egypt;
Prof. Giampaolo Graziadio, Prof. of Aegean civilisations in the University of Pisa. With his expertise in Mycenaean and Cypriote civilisations he will be in charge of coordinating the aspects concerning from one side the presence of Mycenaeans in Cyprus and from the other the role of Cyprus as an intermediary between Mycenaeans and the near-eastern countries;
Dr. Luca Bombardieri, postdoc researcher in the University of Florence. With his expertise in field archaeology in Cyprus and Syria he will be in charge of coordinating the problematic question of the indirect contacts between the Mycenaean world and far away countries like Mesopotamia and Jordan.
Eng. Thomas Alisi, postdoc consultant for the University of Florence. With his expertise in the field of interactive media environments and semantic web he will be in charge of the design of the overall software architecture, the definition of interaction models, coordination of the development team and delivery of project demonstrators.
The time frame has a schedule for delivering the first results in August, fingers crossed.