How to Increase Retention and Revenue in 1,000 Nontrivial Steps

One of the main projects I worked on last year.

Data for Breakfast

Recently, Automattic created a Marketing Data team to support marketing efforts with dedicated data capabilities. As we got started, one important question loomed for me and my teammate Demet Dagdelen: What should we data scientists do as part of this team?

Even though the term data science has been heavily used in the past few years, its meaning still lacks clarity. My current definition for data science is: “a field that deals with description, prediction, and causal inference from data in a manner that is both domain-independent and domain-aware, with the ultimate goal of supporting decisions.” This is a very broad definition that offers a vague direction for what marketing data scientists should do. Indeed, many ideas for data science work were thrown around when the team was formed. Because Demet and I wanted our work to be proactive and influential, we suggested a long-term marketing data science…

View original post 2,068 more words

Introducing pipe, The Automattic Machine Learning Pipeline

One of the main projects I’ve been working on over the past year.

Data for Breakfast

Screen Shot 2018-11-06 at 14.54.48

A generalized machine learning pipeline, pipe serves the entire company and helps Automatticians seamlessly build and deploy machine learning models to predict the likelihood that a given event may occur, e.g., installing a plugin, purchasing a plan, or churning.

A team effort, pipe provides general, long-term, and robust solutions to common or important problems our product and marketing teams face. When I first joined Automattic almost exactly three years ago, my tasks were two-fold:

  1. I had the autonomy and freedom to delve deep into topics of my choice, which at the time revolved around uncovering the networks hiding withinourcommunitiesusingnetworkscience.
  2. But like most data scientists in the industry, most of my time was spent serving product and marketing teams by providing answers to their data questions which ranged from running simple SQL-like queries to doing more in-depth — but one-off — statistical analyses.

We soon…

View original post 1,024 more words

Start of Overland track

The long road to a lifestyle business

Almost a year ago, I left my last full-time job and decided to set on an independent path that includes data science consulting and work on my own projects. The ultimate goal is not to have to sell my time for money by generating enough passive income to live comfortably. My five main areas of focus are – in no particular order – personal branding & networking, data science contracting, Bandcamp Recommender, Price Dingo, and marine conservation. This post summarises what I’ve been doing in each of these five areas, including highlights and lowlights. So far, it’s way better than having a “real” job. I hope this post will help others who are on a similar journey (there seem to be more and more of us – I’d love to hear from you).

Personal branding & networking

Finding clients requires considerably more work than finding a full-time job. As with job hunting, the ideal situation is where people come to you for help, rather than you chasing them. To this end, I’ve been networking a lot, giving talks, writing up posts and working on distributing them. It may be harder than getting a full-time job, but it’s also much more interesting.

Highlights: going viral in China, getting a post featured in KDNuggets
Lowlights: not having enough time to write all the things and meet all the people

Data science contracting

My goal with contracting/consulting is to have a steady income stream while working on my own projects. As my projects are small enough to be done only by me (with optional outsourcing to contractors), this means I have infinite runway to pursue them. While this is probably not the best way of building a Silicon Valley-style startup that is going to make the world a better place, many others have applied this approach to building a so-called lifestyle business, which is what I want to achieve.

Early on, I realised that doing full-on consulting would be too time consuming, as many clients expect full-time availability. In addition, constantly needing to find new clients means that not much time would be left for work on my own projects. What I really wanted was a stable part-time gig. The first one was with GetUp (who reached out to me following a workshop I gave at General Assembly), where I did some work on forecasting engagement and churn. In parallel, I went through the interview process at DuckDuckGo, which included delivering a piece of work to production. DuckDuckGo ended up wanting me to work full-time (like a few other companies), so last month I started a part-time (three days a week) contract at Commonwealth Bank. I joined a team of very strong data scientists – it looks like it’s going to be interesting.

Highlights: seeing my DuckDuckGo work every time I search for a Python package, the work environment at GetUp
Lowlights: chasing leads that never eventuated

Bandcamp Recommender (BCRecommender)

I’ve written a several posts about BCRecommender, my Bandcamp music recommendation project. While I’ve always treated it as a side-project, it’s been useful in learning how to get traction for a product. It now has thousands of monthly users, and is still growing. My goal for BCRecommender has changed from the original one of finding music for myself to growing it enough to be a noticeable source of traffic for Bandcamp, thereby helping artists and fans. Doing it in side-project mode can be a bit challenging at times (because I have so many other things to do and a long list of ideas to make the app better), but I’ve been making gradual progress and discovering a lot of great music in the process.

Highlights: every time someone gives me positive feedback, every time I listen to music I found using BCRecommender
Lowlights: dealing with Parse issues and random errors

Price Dingo

The inability to reliably compare prices for many types of products has been bothering me for a while. Unlike general web search, where the main providers rank results by relevance, most Australian price comparison engines still require merchants to pay to even have their products listed. This creates an obvious bias in the results. To address this bias, I created Price Dingo – a user-centric price comparison engine. It serves users with results they can trust by not requiring merchants to pay to have their products listed. Just like general web search engines, the main ranking factor is relevancy to the user. This relevancy is also achieved by implementing Price Dingo as a network of independent sites, each focused on a specific product category, with the first category being scuba diving gear.

Implementing Price Dingo hasn’t been too hard – the main challenge has been finding the time to do it with all the other stuff I’ve been doing. There are still plenty of improvements to be made to the site, but now the main goal is to get enough traction to make ongoing time investment worthwhile. Judging by the experience of Booko’s founder, there is space in the market for niche price comparison sites and apps, so it is just a matter of execution.

Highlights: being able to finally compare dive gear prices, the joys of integrating Algolia
Lowlights: extracting data from messy websites – I’ve seen some horrible things…

Marine conservation

The first thing I did after leaving my last job was go overseas for five weeks, which included a ten-day visit to Israel (rockets!) and three weeks of conservation diving with New Heaven Dive School in Thailand. Back in Sydney, I joined the Underwater Research Group of NSW, a dive club that’s involved in many marine conservation and research activities, including Reef Life Survey (RLS) and underwater cleanups. With URG, I’ve been diving more than before, and for a change, some of my dives actually do good. I’d love to do this kind of stuff full-time, but there’s a lot less money in getting people to do less stuff (i.e., conservation and sustainability) than in consuming more. The compromise for now is that a portion of Price Dingo’s scuba revenue goes to the Australian Marine Conservation Society, and the plan is to expand this to other charities as more categories are added. Update – May 2015: I decided that this compromise isn’t good enough for me, so I shut down Price Dingo to focus on projects that are more aligned with my values.

Highlights: becoming a certified RLS diver, pretty much every dive
Lowlights: cutting my hand open by falling on rocks on the first day of diving in Thailand

The future

So far, I’m pretty happy with this not-having-a-job-doing-my-own-thing business. According to The 1000 Day Rule, I still have a long way to go until I get the lifestyle I want. It may even take longer than 1000 days given my decision to not work full-time on a single profitable project, together with my tendency to take more time off than I would if I had a “real” job. But the beauty of this path is that there are no investors breathing down my neck or the feeling of mental rot that comes with a full-time job, so there’s really no rush and I can just enjoy the ride.

Automating bulk data imports

Parse is a great backend-as-a-service (BaaS) product. It removes much of the hassle involved in backend devops with its web hosting service, SDKs for all the major mobile platforms, and a generous free tier. Parse does have its share of flaws, including various reliability issues (which seem to be getting rarer), and limitations on what you can do (which is reasonable price to pay for working within a sandboxed environment). One such limitation is the lack of APIs to perform bulk data imports. This post introduces my workaround for this limitation (tl;dr: it’s a PhantomJS script).

Update: The script no longer works due to changes to Parse’s website. I won’t be fixing it since I’ve migrated my projects off the platform. If you fix it, let me know and I’ll post a link to the updated script here.

I use Parse for two of my projects: BCRecommender and Price Dingo. In both cases, some of the data is generated outside Parse by a Python backend. Doing all the data processing within Parse is not a viable option, so a solution for importing this data into Parse is required.

My original solution for data import was using the Parse REST API via ParsePy. The problem with this solution is that Parse billing is done on a requests/second basis. The free tier includes 30 requests/second, so importing BCRecommender’s ~million objects takes about nine hours when operating at maximum capacity. However, operating at maximum capacity causes other client requests to be dropped (i.e., real users suffer). Hence, some sort of rate limiting is required, which makes the sync process take even longer.

I thought that using batch requests would speed up the process, but it actually slowed it down! This is because batch requests are billed according to the number of sub-requests, so making even one successful batch request per second with the maximum number of sub-requests (50) causes more requests to be dropped. I implemented some code to retry failed requests, but the whole process was just too brittle.

A few months ago I discovered that Parse supports bulk data import via the web interface (with no API support). This feature comes with the caveat that existing collections can’t be updated: a new collection must be created. This is actually a good thing, as it essentially makes the collections immutable. And immutability makes many things easier.

BCRecommender data gets updated once a month, so I was happy with manually importing the data via the web interface. As a price comparison engine, Price Dingo’s data changes more frequently, so manual updates are out of the question. For Price Dingo to be hosted on Parse, I had to find a way to automate bulk imports. Some people suggest emulating the requests made by the web interface, but this requires relying on hardcoded cookie and CSRF token data, which may change at any time. A more robust solution would be to scriptify the manual actions, but how? PhantomJS, that’s how.

I ended up implementing a PhantomJS script that logs in as the user and uploads a dump to a given collection. This script is available on GitHub Gist. To run it, simply install PhantomJS and run:

$ phantomjs --ssl-protocol any import-parse-class.js <configFile> <dumpFile> <collectionName>

See the script’s source for a detailed explanation of the command-line arguments.

It is worth noting that the script doesn’t do any post-upload verification on the collection. This is done by an extra bit of Python code that verifies that the collection has the expected number of objects, and tries to query the collection sorted by all the keys that are supposed to be indexed (for large collections, it takes Parse a while to index all the fields, which may result in timeouts). Once these conditions are fulfilled, the Parse hosting code is updated to point to the new collection. For security, I added a bot user that has access only to the Parse app that it needs to update. Unlike the root user, this bot user can’t delete the app. As the config file contains the bot’s password, it should be encrypted and stored in a safe place (like the Parse master key).

That’s it! I hope that other people would find this solution useful. Any suggestions/comments/issues are very welcome.

Image source: Parse Blog.