software engineering

Automating bulk data imports

Parse is a great backend-as-a-service (BaaS) product. It removes much of the hassle involved in backend devops with its web hosting service, SDKs for all the major mobile platforms, and a generous free tier. Parse does have its share of flaws, including various reliability issues (which seem to be getting rarer), and limitations on what you can do (which is reasonable price to pay for working within a sandboxed environment). One such limitation is the lack of APIs to perform bulk data imports. This post introduces my workaround for this limitation (tl;dr: it’s a PhantomJS script).

Update: The script no longer works due to changes to Parse’s website. I won’t be fixing it since I’ve migrated my projects off the platform. If you fix it, let me know and I’ll post a link to the updated script here.

I use Parse for two of my projects: BCRecommender and Price Dingo. In both cases, some of the data is generated outside Parse by a Python backend. Doing all the data processing within Parse is not a viable option, so a solution for importing this data into Parse is required.

My original solution for data import was using the Parse REST API via ParsePy. The problem with this solution is that Parse billing is done on a requests/second basis. The free tier includes 30 requests/second, so importing BCRecommender’s ~million objects takes about nine hours when operating at maximum capacity. However, operating at maximum capacity causes other client requests to be dropped (i.e., real users suffer). Hence, some sort of rate limiting is required, which makes the sync process take even longer.

I thought that using batch requests would speed up the process, but it actually slowed it down! This is because batch requests are billed according to the number of sub-requests, so making even one successful batch request per second with the maximum number of sub-requests (50) causes more requests to be dropped. I implemented some code to retry failed requests, but the whole process was just too brittle.

A few months ago I discovered that Parse supports bulk data import via the web interface (with no API support). This feature comes with the caveat that existing collections can’t be updated: a new collection must be created. This is actually a good thing, as it essentially makes the collections immutable. And immutability makes many things easier.

BCRecommender data gets updated once a month, so I was happy with manually importing the data via the web interface. As a price comparison engine, Price Dingo’s data changes more frequently, so manual updates are out of the question. For Price Dingo to be hosted on Parse, I had to find a way to automate bulk imports. Some people suggest emulating the requests made by the web interface, but this requires relying on hardcoded cookie and CSRF token data, which may change at any time. A more robust solution would be to scriptify the manual actions, but how? PhantomJS, that’s how.

I ended up implementing a PhantomJS script that logs in as the user and uploads a dump to a given collection. This script is available on GitHub Gist. To run it, simply install PhantomJS and run:

$ phantomjs --ssl-protocol any import-parse-class.js <configFile> <dumpFile> <collectionName>

See the script’s source for a detailed explanation of the command-line arguments.

It is worth noting that the script doesn’t do any post-upload verification on the collection. This is done by an extra bit of Python code that verifies that the collection has the expected number of objects, and tries to query the collection sorted by all the keys that are supposed to be indexed (for large collections, it takes Parse a while to index all the fields, which may result in timeouts). Once these conditions are fulfilled, the Parse hosting code is updated to point to the new collection. For security, I added a bot user that has access only to the Parse app that it needs to update. Unlike the root user, this bot user can’t delete the app. As the config file contains the bot’s password, it should be encrypted and stored in a safe place (like the Parse master key).

That’s it! I hope that other people would find this solution useful. Any suggestions/comments/issues are very welcome.

Image source: Parse Blog.

What is data science?

Data science has been a hot term in the past few years. Despite this fact (or perhaps because of it), it still seems like there isn’t a single unifying definition of data science. This post discusses my favourite definition.

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

— Josh Wills (@josh_wills) May 3, 2012

One of my reasons for doing a PhD was wanting to do something more interesting than “vanilla” software engineering. When I was in the final stages of my PhD, I started going to meetups to see what’s changed in the world outside academia. Back then, I defined myself as a “software engineer with a research background”, which didn’t mean much to most people. My first post-PhD job ended up being a data scientist at a small startup. As soon as I changed my LinkedIn title to Data Scientist, many offers started flowing. This is probably the reason why so many people call themselves data scientists these days, often diluting the term to a point where it’s so broad it becomes meaningless. This post presents my preferred data science definitions and my opinions on who should or shouldn’t call themselves a data scientist.

Defining data science

I really like the definition quoted above, of data science as the intersection of software engineering and statistics. Ofer Mendelevitch goes into more detail, drawing a continuum of professions that ranges from software engineer on the left to pure statistician (or machine learning researcher) on the right. data skill continuum This continuum contains two additional roles, which are often confused with data scientists:

  • Data engineer: a software engineer that deals with data plumbing (traditional database setup, Hadoop, Spark and all the rest)
  • Data analyst: a person who digs into data to surface insights, but lacks the skills to do so at scale (e.g., they know how to use Excel, Tableau and SQL but can’t build a web app from scratch)

Data science mixes all these roles. Because of this, there are few true data science positions for people with no work experience. A successful data scientist needs to be able to “become one with the data” by exploring it and applying rigorous statistical analysis (right-hand side of the continuum). But good data scientists also understand what it takes to deploy production systems, and are ready to get their hands dirty by writing code that cleans up the data or performs core system functionality (left-hand side of the continuum). Gaining all these skills takes time. It is still somewhat rare to find people who are true data scientists according to this definition, which is why Ofer Mendelevitch’s post recommends building teams that consist of people with skills from both sides of the continuum.

How is data science different from just science?

Data is everywhere. Extracting knowledge from data is an essential part of any science. Hence, the name data science doesn’t really capture what’s new about the field. The way I see it, the novelty of data science comes from the application of software to model any type of data in a way that generalises across domains. So while a physicist may use software to build models based on data, they won’t become a data scientist until they’ve gone and applied these skills to other fields (as many physicists end up doing). As Kaggle shows, data scientists can work on a wide variety of problems – from biology and physics to marketing, text mining and web search personalisation. It’s often the case in Kaggle competitions that the same people apply similar techniques to very different problems, obtaining results that significantly improve on the state of the art.

However, domain experts such as physicists aren’t going to be made redundant any time soon. Contrary to what Kaggle may have you believe, there is much more to data science than predictive modelling on a well-defined problem. Data scientists typically spend much of their time working with domain experts to define the problem, and chasing down diverse data sources to extract features that enable predictive modelling (also known as “the fun part”). Despite the existence of these less-glamorous aspects of data science, there’s still a lot of fun to be had working in the area. I highly recommend getting into data science to people who enjoy such challenges.

Getting started as a data scientist is actually pretty simple: become a software engineer, become a data analyst, learn how to model data using software (e.g., by participating in Kaggle competitions), and find a job as a data scientist. Obviously, it’s not going to happen overnight. It took me around 10 ten years, and I’m still learning.

Building a recommender system on a shoestring budget (or: BCRecommender part 2 – general system layout)

This is the second part of a series of posts on my BCRecommender – personalised Bandcamp recommendations project.
Check out the first part for the general motivation behind this project.

BCRecommender is a hobby project whose main goal is to help me find music I like on Bandcamp. Its secondary goal is to serve as a testing ground for ideas I have and things I’d like to explore.
One question I’ve been wondering about is: how much money does one need to spend on infrastructure for a simple web-based product before it reaches meaningful traffic?
The answer is: not much at all. It can easily be done for less than $1 per month.
This post discusses my exploration of this question by describing the main components of the BCRecommender system, without getting into the algorithms that drive it (which will be covered in subsequent posts).

The general flow of BCRecommender is fairly simple: crawl publicly-available data from Bandcamp (fan collections and tracks/albums = tralbums), generate recommendations based on this data (static lists of tralbums indexed by fan for personalised recommendations and by tralbum for similarity), and present the recommendations to users in a way that’s easy to browse and explore (since we’re dealing with music it must be playable, which is easy to achieve by embedding Bandcamp’s iframes).

First iteration: Django & AWS

The first iteration of the project was implemented as a Django project. Having never built a Django project from scratch, I figured this would be a good way to learn how it’s done properly. One thing I was keen on learning was using the Django ORM with an SQL database (in the past I’ve worked with Django and MongoDB). This ended up working less smoothly than I expected, perhaps because I’m too used to MongoDB, or because SQL forces you to model your data in unnatural ways, or because I insisted on using SQLite for simplicity. Whatever it was, I quickly started missing MongoDB, despite its flaws.

I chose AWS for hosting because my personal account was under the free tier, and using a micro instance is more than enough for serving a website with no traffic. I considered Google App Engine with its indefinite free tier, but after reading the docs I realised I don’t want to jump through so many hoops to use their system – Google’s free tier was likely to cost too much in pain and time.

While an AWS micro instance is enough for serving the recommendations, it’s not enough for generating them. Rather than paying Amazon for another instance, I figured that using spare capacity on my own laptop (quad-core with 16GB of RAM) would be good enough. So the backend worker for BCRecommender ended up being a local virtual machine using one core and 4GB of RAM.

After some coding I had a nice setup in place:

  • AWS webserver running Django with SQLite as the database layer and a simple frontend, styled with Bootstrap
  • Local backend worker running Celery under Supervisor to collect the data (with errors reported to a dedicated Gmail account), Dropbox for backups, and Django management commands to generate the recommendations
  • Code and issue tracker hosted on Bitbucket (which provides free private repositories)
  • Fabric scripts for deployments to the AWS webserver and the local backend worker (including database sync as one big SQLite file)
  • Local virtual machine for development (provisioned with Vagrant)

This system wasn’t going to scale, but I didn’t care. I just used it to discover new music, and it worked. I didn’t even bother registering a domain name, so it was all running for free.

Second iteration: “Django” backend & Parse

A few months ago, Facebook announced that Parse’s free tier will include 30 requests / second. That’s over 2.5 million requests per day, which is quite a lot – probably enough to run the majority of websites on the internet. It seemed too good to be true, so I had to try it myself.

It took a few hours to convert the Django webserver/frontend code to Parse. This was fairly straightforward, and it had the added advantages of getting rid of some deployment scripts and having a more solid development environment. Parse supplies a command-line tool for deployment that constantly syncs the code to an app that is identical to the production app – much better than the Fabric script I had.

The disadvantages of the move to Parse were having to rewrite some of the backend in JavaScript (= less readable than Python), and a more complex data sync command (no longer just copying a big SQLite file). However, I would definitely use it for other projects because of the generous free tier, the availability of APIs for all major platforms, and the elimination of most operational concerns.

Current iteration: Goodbye Django, hello BCRecommender

With the Django webserver out of the way, there was little use left for Django in the project. It took a few more hours to get rid of it, replacing the management commands with Commandr, and the SQLite database with MongoDB (wrapped with the excellent MongoEngine, which has matured a lot in recent years). MongoDB has become a more natural choice now, since it is the database used by Parse. I expect this setup of a local Python backend and a Parse frontend to work quite well (and remain virtually free) for the foreseeable future.

The only fixed cost I now have comes from registering the domain and managing it with Route 53. This wasn’t required when I was running it only for myself, and I could have just kept it under, but I think it would be useful for other Bandcamp users. I would also like to use it as a training lab to improve my (poor) marketing skills – not having a dedicated domain just looks bad.

In summary, it’s definitely possible to build simple projects and host them for free. It also looks like my approach would scale way beyond the current BCRecommender volume. The next post in this series will cover some of the algorithms and general considerations of building the recommender system.