How to (almost) win Kaggle competitions

Last week, I gave a talk at the Data Science Sydney Meetup group about some of the lessons I learned through almost winning five Kaggle competitions. The core of the talk was ten tips, which I think are worth putting in a post (the original slides are here). Some of these tips were covered in my beginner tips post from a few months ago. Similar advice was also recently published on the Kaggle blog – it’s great to see that my tips are in line with the thoughts of other prolific kagglers.

Tip 1: RTFM

It’s surprising to see how many people miss out on important details, such as remembering the final date to make the first submission. Before jumping into building models, it’s important to understand the competition timeline, be able to reproduce benchmarks, generate the correct submission format, etc.

Tip 2: Know your measure

A key part of doing well in a competition is understanding how the measure works. It’s often easy to obtain significant improvements in your score by using an optimisation approach that is suitable to the measure. A classic example is optimising the mean absolute error (MAE) versus the mean square error (MSE). It’s easy to show that given no other data for a set of numbers, the predictor that minimises the MAE is the median, while the predictor that minimises the MSE is the mean. Indeed, in the EMC Data Science Hackathon we fell back to the median rather than the mean when there wasn’t enough data, and that ended up working pretty well.

Tip 3: Know your data

In Kaggle competitions, overspecialisation (without overfitting) is a good thing. This is unlike academic machine learning papers, where researchers often test their proposed method on many different datasets. This is also unlike more applied work, where you may care about data drifting and whether what you predict actually makes sense. Examples include the Hackathon, where the measures of pollutants in the air were repeated for consecutive hours (i.e., they weren’t really measured); the multi-label Greek article competition, where I found connected components of labels (doesn’t generalise well to other datasets); and the Arabic writers competition, where I used histogram kernels to deal with the features that we were given. The general lesson is that custom solutions win, and that’s why the world needs data scientists (at least until we are replaced by robots).

Tip 4: What before how

It’s important to know what you want to model before figuring out how to model it. It seems like many beginners tend to worry too much about which tool to use (Python or R? Logistic regression or SVMs?), when they should be worrying about understanding the data and what useful patterns they want to capture. For example, when we worked on the Yandex search personalisation competition, we spent a lot of time looking at the data and thinking what makes sense for users to be doing. In that case it was easy to come up with ideas, because we all use search engines. But the main message is that to be effective, you have to become one with the data.

Tip 5: Do local validation

This is a point I covered in my Kaggle beginner tips post. Having a local validation environment allows you to move faster and produce more reliable results than when relying on the leaderboard. The main scenarios when you should skip local validation is when the data is too small (a problem I had in the Arabic writers competition), or when you run out of time (towards the end of the competition).

Tip 6: Make fewer submissions

In addition to making you look good, making few submissions reduces the likelihood of overfitting the leaderboard, which is a real problem. If your local validation is set up well and is consistent with the leaderboard (which you need to test by making one or two submissions), there’s really no need to make many submissions. Further, if you’re doing well, making submissions erodes your competitive advantage by showing your competitors what scores are obtainable and motivating them to work harder. Just resist the urge to submit, unless you have a really good reason.

Tip 7: Do your research

For any given problem, it’s likely that there are people dedicating their lives to its solution. These people (often academics) have probably published papers, benchmarks and code, which you can learn from. Unlike actually winning, which is not only dependent on you, gaining deeper knowledge and understanding is the only sure reward of a competition. This has worked well for me, as I’ve learned something new and applied it successfully in nearly every competition I’ve worked on.

Tip 8: Apply the basics rigorously

While playing with obscure methods can be a lot of fun, it’s often the case that the basics will get you very far. Common algorithms have good implementations in most major languages, so there’s really no reason not to try them. However, note that when you do try any methods, you must do some minimal tuning of the main parameters (e.g., number of trees in a random forest or the regularisation of a linear model). Running a method without minimal tuning is worse than not running it at all, because you may get a false negative – giving up on a method that actually works very well.

An example of applying the basics rigorously is in the classic paper In defense of one-vs-all classification, where the authors showed that the simple one-vs-all (OVA) approach to multiclass classification is at least as good as approaches that are much more sophisticated. In their words: “What we find is that although a wide array of more sophisticated methods for multiclass classification exist, experimental evidence of the superiority of these methods over a simple OVA scheme is either lacking or improperly controlled or measured”. If such a failure to perform proper experiments can happen to serious machine learning researchers, it can definitely happen to the average kaggler. Don’t let it happen to you.

Tip 9: The forum is your friend

It’s very important to subscribe to the forum to receive notifications on issues with the data or the competition. In addition, it’s worth trying to figure out what your competitors are doing. An extreme example is the recent trend of code sharing during the competition (which I don’t really like) – while it’s not a good idea to rely on such code, it’s important to be aware of its existence. Finally, reading the post-competition summaries on the forum is a valuable way of learning from the winners and improving over time.

Tip 10: Ensemble all the things

Not to be confused with ensemble methods (which are also very important), the idea here is to combine models that were developed independently. In high-profile competitions, it is often the case that teams merge and gain a significant boost from combining their models. This is worth doing even when competing alone, because almost no competition is won by a single model.

Public comments are closed, but I love hearing from readers. Feel free to contact me with your thoughts.

Can you elaborate what you mean in Tip 5 by stating “The main scenarios when you should skip local validation is when the data is too small …”. What I experienced is that with too little observations, the leaderboard becomes very misleading, so my intuition would be to use more local validation for small datasets, not less.

Good point. What I was referring to are scenarios where local validation is unreliable.

For example, in the Arabic writer identification competition (http://blog.kaggle.com/2012/04/29/on-diffusion-kernels-histograms-and-arabic-writer-identification/), each of the 204 writers had only two training paragraphs (all containing the same text), while the test/leaderboard instances were a third paragraph with different content. I tried many forms of local validation but none of them yielded results that were consistent with the leaderboard, so I ended up relying on the leaderboard score.

Ah, thanks, that clarifies what you meant. The (currently still running) Africa Soil Property contest (https://www.kaggle.com/c/afsis-soil-properties) seems a bit similar. I won’t put much more energy into that contest, but I am curious how it will work out in the end, and what things will have worked for the winners (maybe not much except pure luck).
Could you provide some tips on #3(‘Getting to Know your data’) with respect to best practice visualisations to gain insights from data - especially considering the fact that datasets always have a large number of features. Plotting feature vs. label graphs do seem to be helpful, but for a large number of features will be impractical. So how should one go about data analysis via visualisation?

It really depends on the dataset. For personal use, I don’t worry too much about pretty visualisations. Often just printing some summary statistics works well.

Most text classification problems are hard to visualise. If, for example, you use bag of words (or n-grams) as your feature set, you could just print the top words for each label, or the top words that vary between labels. Another thing to look at would be commonalities between misclassified instances – these could be dependent on the content of the texts or their length.

Examples:

  • In the Greek Media Monitoring competition (http://yanirseroussi.com/2014/10/07/greek-media-monitoring-kaggle-competition-my-approach/), I found that ‘Despite being manually annotated, the data isn’t very clean. Issues include identical texts that have different labels, empty articles, and articles with very few words. For example, the training set includes ten “articles” with a single word. Five of these articles have the word 68839, but each of these five was given a different label.’ – this was discovered by just printing some summary statistics and looking at misclassified instances
  • Looking into the raw data behind one of the widely-used sentiment analysis datasets, I found an issue that was overlooked by many other people who used the dataset: http://www.cs.cornell.edu/people/pabo/movie-review-data/ (look for the comment with my name – found four years after the original dataset was published)

I hope this helps.

Thanks a lot, So to summarize the following could be 3

  1. Using summary statistics such as means/stds/variance on the data, looking out for outliers,etc in the data
  2. Looking at misclassified instances during validation to find some sort of pattern in them
  3. Looking at label-specific raw data I apologize for the long overdue response, and thanks for these tips. This will surely be useful in my next Kaggle competition.
Reblogged this on Dr. Manhattan's Diary and commented: Very good read on how to win in a Kaggle competition. Useful hints!

I’m starting to dive into Kaggle competitions right now, and I’m having trouble with some of the simple practical considerations. For example, what IDE should I be using - IPython Notebooks? Where should I store the data? My personal computer surely doesn’t have 50GB+ space to spare. How long should I wait for a script to run before I deem it as “broken”?

Any advice here would be greatly appreciated!

Thanks for your comment, Derek. These are all good questions, but the answers really depend on the problem you’re working on and what you’re comfortable with.

Personally, I find IPython Notebooks useful for playing around with the data and for documenting/storing throwaway code. However, once you have code that is more complex or code that you want to rerun, it’s better to save it in separate files. For editing these files, I use PyCharm.

Not all problems require a large hard drive. If you are working on a large dataset, you can either purchase an external drive, or hire an instance from a cloud provider like AWS or DigitalOcean. The latter is generally cheaper than AWS, but they don’t offer GPUs. If you are working with a remote server, you can run IPython Notebook on the server and work from your browser.

Regarding waiting for a script, for many models you can first build a simple version to test that everything works (e.g., build a random forest with just a few trees or train a neural network for a few iterations). If everything works well, you can run the full version. If the run time is long, it’s a good idea to take snapshots of the model and monitor performance on a hold-out set to ensure that you’re not wasting time overfitting.

Thanks for the quick response! I think I’ll be using Sublime Text on EC2 with S3 in the short term, and possibly move onto Amazon ML with RedShift in the future. I’ll probably take snapshots by outputting results to the console or MatPlotLib every once in awhile, so that’s great advice as well.
Subscribe