My short-term goal is to have simple graphs of some basic metrics over time:
- Facebook page likes and shares, posts, post likes and shares.
- Twitter followers (and following), tweets, favorites, retweets.
- Google+ followers, views, +1’s, shares.
I’m not sure why I want this data, but if I do find a good reason in the future, I’d better start tracking it now!
In this post, I describe how I went from the idea to start collecting the data, to a deployed data collecting app using Google App Engine, in under 5 hours.
The case for quick & dirty
Yesterday I wasn’t collecting any metrics. Every passing day, I lose more uncollected metrics.
If I can throw together a quick solution that starts collecting metrics today, it has value, even if next week I’ll throw it away for something completely different.
This is similar to what I mentioned in my side-projects workflow post. I referred to quick prototyping for learning there, but it seems appropriate here as well.
The case for using Google App Engine
This is not an introduction to App Engine 🙂 .
I want to get some metrics collection app up and running quickly. I don’t have any code written yet.
What are my basic requirements here?
- Something that can fetch the data I want to collect periodically. So it has to be online, preferably all the time.
- Use a hosted & managed service, so I don’t need to take care of infrastructure myself. This means a home computer with some scheduled jobs is out.
- It needs to support cron-like action scheduling, for periodic fetching.
- It needs to have some kind of datastore / database to collect the metrics.
- It needs to run code in a technology I’m familiar with, so I can do it quickly.
These requirements point to Platform as a Service (PaaS). Basically, at the cost of some limitations, most PaaSes make it easy to rapidly develop and deploy something. They take care of keeping it online, up and running, and all that other operations stuff.
Google App Engine meets the requirements. It also helps that I have some experience working with it 🙂 .
First 3 hours – trying to use Facebook & Twitter APIs
The right way to collect the metrics I want from various social thingies is to use their official APIs.
- Facebook has The Graph API. It is the primary way for apps to read and write to the Facebook social graph.
- Twitter has their REST Twitter APIs. It provides programmatic access to read and write Twitter data.
- Google+ has their Google+ API. This one is read-only, which is sufficient for what I’m trying to do here.
My natural inclination is towards the right way to solve problems. So, naturally, I thought I can use these APIs to get a quick metrics collection app up and running.
After about 3 hours of tinkering with the Facebook and Twitter APIs, I realized it’s not going to be the quickest route for me.
Getting the data I want using the APIs seems simple enough.
- For Facebook, I found the Insights metric. It has a lot of available metrics, including those I want to track.
- For Twitter, the user/show has the data I want.
The hard part is all the hoop jumping required to be able to use the API! Every service has its own setup process, with steps like “register as a developer” and “create an app”.
Seeing that I spent about 3 hours on trying to figure out how to get to use the APIs, I decided to go with a dirtier approach for the v0 prototype.
Quick & dirty approach – fetch public pages like a user and parse them
I figured that the basic metrics I want to collect are public. Anyone can view the public profiles of my social thingies, and see metrics like followers count and likes count.
If an unauthenticated user can see this data, then an unauthenticated App Engine app can “see” it too!
This is exactly what I did, for the initial quick & dirty version. Summary of the solution:
- An App Engine cron task runs a Python function every hour for every social thingie.
- The Python function fetches the public page of the requested social thingie.
- The Python function parses the HTML, extracting the desired metrics, and storing them in the App Engine datastore (NDB).
I published the entire thing as the ostrich-online project on GitHub. Here are some pointers, in case you want to dig into it:
- The cron.yaml file defines the hourly metrics fetching tasks. Behind the scenes, App Engine issues a HTTP GET request to the specified URL on the specified schedule.
- The app.yaml file maps URLs to Python scripts that handle them. In this case, everything is handled by main.py. The app.yaml file also adds the lxml library to the App Engine Python sandbox, so it can be used to parse fetched HTML.
- The main.py file does all the work.
- It maps specific URLs to handler classes.
- All handler classes are derived from the SocialStatsFetcherBase class. This class implements the get method. This method is called to handle HTTP GET requests for the routed URL.
- Each of the derived classes define a couple of properties to customize the handling and processing.
- The models.py file defines NDB models (similar to a database schema) for collected metrics.
This, of course, is not the right way to do it! The public pages that I fetch and parse are meant for humans, not for computers. This is the main reason APIs exist. Facebook should be able to change their layout and design without breaking other apps that use its data. APIs can be consistent, stable, and versioned. Web pages for humans don’t have to.
A little more about parsing the fetched pages
I tried using the same approach with Facebook, but couldn’t get it to work. Instead of digging into it, I noticed the likes and talking about metrics are in the meta description tag, so I used regular expressions to extract it.
In all cases, I used Chrome Developer Tools to inspect the DOM of the pages I fetch, and deduce ad-hoc rules to get to the data I want.
This, of course, makes it quite unstable. Minor changes in the way these pages are rendered can break my logic. I actually experienced such issues when developing!
Before deploying my code to App Engine, I used the App Engine Python local development server to test things locally. When fetching URLs from the development server, it makes the requests directly from the local computer. This can cause issues if the fetched page behaves differently when fetched from different locations. For instance, I initially got an Hebrew version of Facebook with the development server, and an English version with deployed App Engine. This is the reason I added an Accept-Language header to the URL fetch.
There you have it. My quick & dirty App Engine powered social metrics collection app.
I don’t yet have anything to show beyond the published code, as I didn’t write front-end yet.
The entire thing took me about 5 hours for defining the need to a deployed App Engine app. Could have been just 2 hours, if I went straight to the quick & dirty approach, skipping the API adventure.
Stay tuned for future developments, that will probably include a front-end for the collected data, and migration to using APIs instead of parsing HTML.
If you are familiar with existing free open-source solutions that do what I’m trying to do here, please let me know! I may keep going with this one, just for fun, but I don’t really have intentions to build a serious analytics platform here…