Back from #PyData2017 London where I gave a talk called “Show me the failures!”, showing how the Data Science and Analytics team at Pirelli approach the problem of designing and implementing data products at shop floor.
Thank you again to all the crew, you guys have been amazing: venue, food, schedule were absolutely spot on. Bring it on and looking forward to coming over for #PyData2018! Below the slides and outline of my talk.
The Data Science and Analytics group at Pirelli has to deal with factories’ day to day that can’t be further from the aseptic crunching of data from a keyboard in an office. Our group took the lift, went down at shop floor and started asking questions to try and make their life better: turns out questions flowed the other way round and results were startling.
Pirelli has a 140 year old tradition of manufacturing with 20 factories across 14 countries and headquarter office in Milan. Production flows, logistic, machinery and the whole extended value chain has morphed through decades across a broad range of needs and circumstances.
The creation of a Data Science and Analytics department at the beginning of 2016 has the goal of speeding up change and innovation, starting from areas that are harder to tackle. Some of the most interesting challenges include:
bring data products at shop floor to increase efficiency while being aware of UX principles
keep 2-sided communication alive with wide number of actors, particularly with IT, quality and engineering
encourage active participation by providing accessible analytics tools and an internal Academy training program
activate the virtuous circle of prototyping, feasibility check and production releases for sound product lifecycles
introduce Agile development methodologies in traditional waterfall environments
shape a roadmap with principal stakeholders starting from off-line through live analysis and heading to ahead-of-time predictions
opening a steady communication channel across groups is progressively eroding barriers between white and blue collars, allowing teams to better understand each other requirements and kicking off a broader conversation. At the end of the first year since releasing the first prototype, there is much more on the plate, and groups are now more familiar with concepts of User Experience, release lifecycle, data exploration and agile development.
In this talk we are going to show the data science team approach to prototyping and implementation of data products for Pirelli factories, both at shop floor and quality / engineering offices. Different needs or - taking a UX approach - different personas, lead to different outcomes: from large displays mounted in wide warehouses to detailed descriptions of statistical distributions, from near real-time processing of streams of data coming out of sensors to large computations for statistical models made on millions of rows stored in sql tables.
The sheer variety of technologies involved in the process is probably the biggest challenge when deploying at production level: aside standard data processing and machine learning packages, such as Pandas and scikit-learn, our Flask and Django based web infrastructures interact with MsSQL servers, JBoss data virtualisers, a Hadoop cluster and Oracle data warehouse, responsively adapting their output for different contexts with Angular and React front-ends.
P.S. link to a gist with my notes from the conference here
[This post also appears on Medium] After spending long hours to have a fully working Circle CI / Elastic Beanstalk / Docker integration, I thought it would be useful to put my notes down so I will save some time next time I need to go through a similar setup and avoid spending long nights cursing the gods of the (internet) clouds.
use Circle as CI engine, I am pretty sure that other services (e.g. Travis, Codeship, etc.) would offer equivalent functionalities,
use a private registry to store compiled Docker images: I use Quay.io because it is cheaper than Docker Hub and works just fine for what I need,
deploy a Python Django application, however this is not strictly necessary as most of the steps apply to any kind of web application.
There is plenty of documentation related to EB and a few samples that can be used as good starting points to understand what EB needs to deploy a Docker image to a single-container environment (e.g. here and here), but they did not seem to cover all the aspects of real-world implementations.
The aim of what follows is to provide a quick step by step guide to create and deploy an app to EB and describe some details hidden across different documents on Circle, EB and Docker.
Step 1. Elastic Beanstalk on AWS
Using the AWS Web Console, go to IAM and create a new user, give it full access to EB, take note of key+secret and edit ~/.aws/config on your machine to create a new profile, it should look something like:
still in AWS Web Console, go to Elastic Beanstalk and create a new WEB application and select the following settings:
set it to auto-scale,
do NOT use RDS that comes with EB, it is more flexible to setup your own RDS and hook it to EB later,
sat yes to VPC,
set some other reasonable settings, assuming you have basic knowledge of AWS to handle security groups, key-pairs, etc.
Now, back to your machine, create a python virtual environment (using virtualenvwrapper here) and install AWS command line tools:
# probably not strictly necessary, but will come useful later…
pip install awscli
pip install awsebcli
Time to setup EB on the local machine and hook it to the application created in the AWS Web Console! Go where your web applications lives (in my case where Django’s manage.py sits), then up 1 folder and run (assuming we run EB on Amazon’s EU-West data centre):
eb init --region eu-west-1 --profile eb-admin
you will be prompted a few questions by the init process:
it should show the EB applications you have created in your AWS account, select the one you want to use from the list, it should match the one just created with AWS Web Console,
all other options should be automagically detected from the online settings.
If the wizard does not work (e.g. because you have weird files that trigger auto detections), go through the init questions by choosing the following:
skip platform auto detection and choose Docker (if a Dockerfile is found, it probably will not ask for platform details)
choose an existing keypair ssh configuration
After this step, you should now have a fresh .elasticbeanstalk directory containing a single config.yml file reflecting the app settings you have just created. Add a line to your .gitignore file as eb init gets a bit too enthusiast with ignoring files that you might need to share with your team:
Now you need to tweak AWS a bit to allow EB to be able to deploy by reading a configuration file from S3. Go back to IAM in the AWS Web Console, you should find 2 newly created Roles (have a look at this article for further information):
instance role (or EC2 role) is used by EB to deploy and have access to other resources
service role is used by EB to monitor running instances and send notifications
add READ permissions to S3 to the instance role so EB knows how to fetch the relevant files when deploying. Finally go to S3, there should be a new bucket with a few EB related files in it, take note of the name, you will use it later.
Step 2. Docker
Create a private repo on Quay.io: I am assuming that we are going to run some super secret code that we do not want to host on the docker public registry, so I am adding additional information to allow Elastic Beanstalk to authenticate on the private registry.
on Quay.io create a robot account and assign write credentials to the private repo
download the .dockercfg file associated to the robot account and put it BOTH in ~ on your local machine (so you can authenticate on Quay.io from the command line) and in your repository’s root (will be used later by Circle)
Now create your Dockerfile — there are plenty of good examples around, I quite like one from Rehab Studio that runs a simple Flask test app with Gunicorn and NGINX, you can find it here — and be sure it contains the two following instructions:
your-startup-file.sh is basically something that starts the web server, or web application, or any other service that would listen on 80 and produce some sort of output. In my case it is Supervisor running NGINX and Gunicorn.
When you are happy with your docker image, push it to Quay.io with
docker push quay.io/my_account/my_image:latest
Now it is time to integrate an automatic test and build on Circle.
Step 3. Circle CI
Open an account on Circle by authenticating with Github and allow it to connect the Github repo you want to use for continuous integration.
Create a custom circle.yml file in your repository’s root: this file will tell Circle to run tests and deploy your app to EB when committing to release branches, it should look more or less like the following:
This yml file instructs Circle to run tests on the docker image, and deploy if pushed to release/stg branch if tests are successful. The interesting aspect of this approach is that deploy.sh can be run (and, more importantly, tested!) locally.
Note! you might need to use private ssh keys to access your repo and build the docker deploy image, this is totally doable by adding keys in the Circle project environment (check this out).
Step 4. Deploy
OK almost there! Now we need to create the deploy.sh file mentioned in the previous step: this will automate the process to build our deploy docker image and put it somewhere so that EB can go and fetch it: we use the AWS command line interface for it. Steps here are fairly straight forward:
build docker image and push to Quay.io
create a Dockerrun file for EB (read details below) and push it to S3
tell EB to create a new application version and deploy it
So what is this mysterious Dockerrun.aws.json? it is a simple descriptor that tells AWS where to pull the Docker image from, what version and using which credentials. Below is the template file where , and are replaced with live variables by the *deploy.sh* script, and *dockercfg* tells EB where to find credentials for private docker images.
Step 5. Tweaks, env variables, etc.
Docker on EB needs environment variables to run properly! you can either set them up directly in EB, or run a script that automates the process by using a standard format accepted by the AWS command line interface option update-environment. Here is an example format of AWS options (say it’s stored with name EB-options.txt.template):
which can be processed by replacing local environment variables and send them to EB:
quite a bit of a headache, but Woohoo! when you see it running on EB you feel like you are the god of AWS!
Certi momenti capitano a tutti: chiedi il conto al ristorante, metti la mano in tasca e
ti accorgi di aver lasciato soldi e carta a casa. Altro scenario: atterri in un paese straniero e
aspetti la valigia finchè realizzi di essere l’ultimo passeggero rimasto intorno al nastro
ma il tuo bagaglio non è mai comparso. Il primo istante in cui realizzi di essere fuori
dalla usuale comfort zone si traduce solitamente in un brivido che sale lungo la schiena
allertando tutti i sensi e sollevando la disperata domanda: e Adesso Come Faccio? (da qui in poi ACCF)
Durante il mio ultimo viaggio verso l’Italia e ritorno mi sono capitati numerosi momenti di questo
genere, il primo dei quali si è materializzato quando ero ad appena 70 miglia da casa: scendo
di moto per la fotografia di rito mentre sono ad aspettare la chiamata del mio shuttle
per l’Eurotunnel e mi accorgo di non avere più la targa. Strana sensazione di vergogna e
paura mi pervade, cerco di coprire con il mio corpo il retro della moto per nascondere
ad eventuali interceptor di passaggio che mi manca un pezzo fondamentale per proseguire
il resto del viaggio attraverso l’Europa. Scambio due chiacchiere con un gruppo di
motociclisti diretti ad Assen per il MotoGP che con noncuranza mi consigliano: carry on,
ed è proprio quello che faccio. Mentre attraverso la Francia inizio ad
organizzare un piano: chiamo Valentina per chiederle di contattare la BMW e sentire se
possono procurarmi una targa temporanea, oltre a preparare mentalmenente una spiegazione plausibile
da dare alla poco socievole gendarmerie. Sta di fatto che arrivo al confine italiano senza che
nessuno mi fermi o dica una parola a riguardo e questo mi fa sentire molto sollevato.
Proseguendo con l’obiettivo di arrivare a Mantova alla fine del secondo giorno mi rendo
conto di due cose:
gli italiani sono molto preoccupati del fatto che stia viaggiando senza targa e mi affiancano continuamente per farmelo notare, inoltre
BMW Italia non ha idea di come procurarmi un sostituto, quindi dovrò arrangiarmi con una soluzione artigianale.
Il pensiero rilevante che ho iniziato a costruire nella mia mente è però slegato dalla
circostanza: l’importante è conservare l’approccio positivo, analizzare il problema e
contemplare una serie di possibili soluzioni.
Con questa attitudine e l’aiuto di un buon ferramenta riesco a trovare tutto il necessario per fabbricare una stupenda targa falsa e sono pronto a proseguire il mio viaggio.
Dopo qualche giorno di riposo in Italia riprendo la strada verso Nord e accade un altro momento ACCF: in autostrada vicino a Bologna un’automobile sta viaggiando a circa 110KM/h
davanti a me quando gli esplode una ruota facendola rallentare di colpo. Il cerchione striscia per terra
facendo scintille, il pneumatico vola via passando a pochi metri da me, il guidatore sembra
molto reattivo e riesce ad accostare senza conseguenze: il suo viso è molto concentrato e guardandolo
mi rendo conto che entro pochi secondi, passato l’iniziale shock, inizierà a pensare come risolvere la situazione
con un misto di vergogna e paura.
Il percorso continua con svariati altri momenti ACCF: l’attraversamento della Svizzera e Germania è caratterizzato da un caldo insopportabile, oltre i 40C, che mi obbliga a frequenti soste per inzupparmi di acqua e consentirmi di proseguire; arrivato a Ghent in tarda serata di un affollato Sabato non riesco a trovare un posto dove dormire, poi risolto all’ultimo tuffo con booking.com; le notizie dei rifugiati in cerca di un transito clandestino danno per impossibile il passaggio da Calais, notizia che non trova riscontro a parte qualche posto di blocco con gendarmi armati fino ai denti; infine appena tocco terreno britannico vengo investito da un fortunale potentissimo che saluta il mio ritorno in patria obbligandomi ad una velocità massima di 20mph.
Come al solito cerco di fare un bilancio della mia piccola avventura e trarne qualche insegnamento: non importa il livello di difficoltà nel quale ci si trova, è cruciale mantenere un’attitudine positiva, cercare di ridurre al minimo i momenti di panico, guardare oggettivamente il problema e rapidamente definire le possibili strategie per risolverlo. E mi rendo conto, anche in modo abbastanza scontato, che lo stesso approccio vale per qualsiasi altro momento della vita (bollette, multe, gravidanze, malattie, sfighe accidentali: you name it! come si dice da queste parti) in cui per quei brevi attimi si sente salire un brivido lungo la schiena e ci si chiede: e Adesso, Come Cazzo Faccio?
I’ve recently had to develop a web app that shows Tweets locations on a map. It’s simple enough to extract a tweet’s location (when present), just check the API Docs for a Tweet object and you’ll find a coordinates field that reportedly:
Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as geoJSON (longitude first, then latitude).
The next step is to visualize it on a flat map with the widely accepted Mercator projection. There are a few useful references on StackOverflow and Wolfram that gave me the hints to write these simple python functions:
where width and height are the size in pixels of the flat projection. The formula works fine, translating the reference from Wolfram to the get_y function was simple enough, but the reason behind some details of the function found on StackOverflow (e.g. multiplying width by 1.5) seemed a bit arbitrary to me and I was too lazy to find the answers.
Turns out my Postgresql database also has PostGIS extensions installed, so I’ve decided to put them at use. I found that what we usually simply call lng-lat has a formal definition with the standard WGS84, this mapping to PostGIS’ spatial reference id.4326. On the other hand, the Mercator Projection is also a standard transformation known as EPSG:3785 mapping to PostGIS id.3785 (same id, thank god).
It’s then possible to transform a WGS84 reference to EPSG:3785 by calling PostGIS functions directly in the SQL query:
nice! just be aware that transforming lng-lat to EPSG:3785 returns points where the axis origin is at the centre of the map, and boundaries are defined by the standard as -20037508.3428, -19971868.8804, 20037508.3428, 19971868.8804. It’s simple to translate the origin of axis to the top left corner and normalize the size in pixels to obtain the same results of the Python function.
uh, one last thing I never managed to permanently store in my brain: LONGITUDE is the X on the map, while LATITUDE is the Y. For me it’s easier to remember by visualizing the equivalence X-Y -> LNG-LAT.
I’m moving all my content out of my previous Wordpress to host it on github pages with Jekyll. Everything is still a bit broken, but all the content is here. I don’t intend to spend time setting up 301 codes for the few posts I have on my old blog, but hopefully the permalink structure should be the same. If you happen to search anything in particular and cannot find it, please contact me via a twitter mention @grudelsud
here a few links I found quite useful while moving things around and getting acquainted with Jekyll: