Automating Parse.com bulk data imports

Parse is a great backend-as-a-service (BaaS) product. It removes much of the hassle involved in backend devops with its web hosting service, SDKs for all the major mobile platforms, and a generous free tier. Parse does have its share of flaws, including various reliability issues (which seem to be getting rarer), and limitations on what you can do (which is reasonable price to pay for working within a sandboxed environment). One such limitation is the lack of APIs to perform bulk data imports. This post introduces my workaround for this limitation (tl;dr: it’s a PhantomJS script).

Update: The script no longer works due to changes to Parse’s website. I won’t be fixing it since I’ve migrated my projects off the platform. If you fix it, let me know and I’ll post a link to the updated script here.

I use Parse for two of my projects: BCRecommender and Price Dingo. In both cases, some of the data is generated outside Parse by a Python backend. Doing all the data processing within Parse is not a viable option, so a solution for importing this data into Parse is required.

My original solution for data import was using the Parse REST API via ParsePy. The problem with this solution is that Parse billing is done on a requests/second basis. The free tier includes 30 requests/second, so importing BCRecommender’s ~million objects takes about nine hours when operating at maximum capacity. However, operating at maximum capacity causes other client requests to be dropped (i.e., real users suffer). Hence, some sort of rate limiting is required, which makes the sync process take even longer.

I thought that using batch requests would speed up the process, but it actually slowed it down! This is because batch requests are billed according to the number of sub-requests, so making even one successful batch request per second with the maximum number of sub-requests (50) causes more requests to be dropped. I implemented some code to retry failed requests, but the whole process was just too brittle.

A few months ago I discovered that Parse supports bulk data import via the web interface (with no API support). This feature comes with the caveat that existing collections can’t be updated: a new collection must be created. This is actually a good thing, as it essentially makes the collections immutable. And immutability makes many things easier.

BCRecommender data gets updated once a month, so I was happy with manually importing the data via the web interface. As a price comparison engine, Price Dingo’s data changes more frequently, so manual updates are out of the question. For Price Dingo to be hosted on Parse, I had to find a way to automate bulk imports. Some people suggest emulating the requests made by the web interface, but this requires relying on hardcoded cookie and CSRF token data, which may change at any time. A more robust solution would be to scriptify the manual actions, but how? PhantomJS, that’s how.

I ended up implementing a PhantomJS script that logs in as the user and uploads a dump to a given collection. This script is available on GitHub Gist. To run it, simply install PhantomJS and run:

$ phantomjs --ssl-protocol any import-parse-class.js <configFile> <dumpFile> <collectionName>

See the script’s source for a detailed explanation of the command-line arguments.

It is worth noting that the script doesn’t do any post-upload verification on the collection. This is done by an extra bit of Python code that verifies that the collection has the expected number of objects, and tries to query the collection sorted by all the keys that are supposed to be indexed (for large collections, it takes Parse a while to index all the fields, which may result in timeouts). Once these conditions are fulfilled, the Parse hosting code is updated to point to the new collection. For security, I added a bot user that has access only to the Parse app that it needs to update. Unlike the root user, this bot user can’t delete the app. As the config file contains the bot’s password, it should be encrypted and stored in a safe place (like the Parse master key).

That’s it! I hope that other people would find this solution useful. Any suggestions/comments/issues are very welcome.


Image source: Parse Blog.

Advertisements

3 comments

  1. Hi, very nice trick! Trying to implement this as we speak, does this code still work? I get to the collections page, but I don’t think the upload is working. I’m new to Phantomjs.
    Thanks!

    Like

    1. Hi Walter! Yeah, the code stopped working when Parse redesigned their website. I never fixed it because I ended up porting my projects away from Parse. If you fix it let me know and I’ll update this post.
      By the way, you may find it easier to use Selenium (or something similar) as a wrapper around PhantomJS, as it should result in cleaner code. For example, check out Python’s Selenium bindings: http://selenium.googlecode.com/svn/trunk/docs/api/py/index.html

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s