Last week, I gave a talk at the Data Science Sydney Meetup group about some of the lessons I learned through almost winning five Kaggle competitions. The core of the talk was ten tips, which I think are worth putting in a post (the original slides are here). Some of these tips were covered in my beginner tips post from a few months ago. Similar advice was also recently published on the Kaggle blog – it’s great to see that my tips are in line with the thoughts of other prolific kagglers.
Tip 1: RTFM
It’s surprising to see how many people miss out on important details, such as remembering the final date to make the first submission. Before jumping into building models, it’s important to understand the competition timeline, be able to reproduce benchmarks, generate the correct submission format, etc.
Tip 2: Know your measure
A key part of doing well in a competition is understanding how the measure works. It’s often easy to obtain significant improvements in your score by using an optimisation approach that is suitable to the measure. A classic example is optimising the mean absolute error (MAE) versus the mean square error (MSE). It’s easy to show that given no other data for a set of numbers, the predictor that minimises the MAE is the median, while the predictor that minimises the MSE is the mean. Indeed, in the EMC Data Science Hackathon we fell back to the median rather than the mean when there wasn’t enough data, and that ended up working pretty well.
Tip 3: Know your data
In Kaggle competitions, overspecialisation (without overfitting) is a good thing. This is unlike academic machine learning papers, where researchers often test their proposed method on many different datasets. This is also unlike more applied work, where you may care about data drifting and whether what you predict actually makes sense. Examples include the Hackathon, where the measures of pollutants in the air were repeated for consecutive hours (i.e., they weren’t really measured); the multi-label Greek article competition, where I found connected components of labels (doesn’t generalise well to other datasets); and the Arabic writers competition, where I used histogram kernels to deal with the features that we were given. The general lesson is that custom solutions win, and that’s why the world needs data scientists (at least until we are replaced by robots).
Tip 4: What before how
It’s important to know what you want to model before figuring out how to model it. It seems like many beginners tend to worry too much about which tool to use (Python or R? Logistic regression or SVMs?), when they should be worrying about understanding the data and what useful patterns they want to capture. For example, when we worked on the Yandex search personalisation competition, we spent a lot of time looking at the data and thinking what makes sense for users to be doing. In that case it was easy to come up with ideas, because we all use search engines. But the main message is that to be effective, you have to become one with the data.
Tip 5: Do local validation
This is a point I covered in my Kaggle beginner tips post. Having a local validation environment allows you to move faster and produce more reliable results than when relying on the leaderboard. The main scenarios when you should skip local validation is when the data is too small (a problem I had in the Arabic writers competition), or when you run out of time (towards the end of the competition).
Tip 6: Make fewer submissions
In addition to making you look good, making few submissions reduces the likelihood of overfitting the leaderboard, which is a real problem. If your local validation is set up well and is consistent with the leaderboard (which you need to test by making one or two submissions), there’s really no need to make many submissions. Further, if you’re doing well, making submissions erodes your competitive advantage by showing your competitors what scores are obtainable and motivating them to work harder. Just resist the urge to submit, unless you have a really good reason.
Tip 7: Do your research
For any given problem, it’s likely that there are people dedicating their lives to its solution. These people (often academics) have probably published papers, benchmarks and code, which you can learn from. Unlike actually winning, which is not only dependent on you, gaining deeper knowledge and understanding is the only sure reward of a competition. This has worked well for me, as I’ve learned something new and applied it successfully in nearly every competition I’ve worked on.
Tip 8: Apply the basics rigorously
While playing with obscure methods can be a lot of fun, it’s often the case that the basics will get you very far. Common algorithms have good implementations in most major languages, so there’s really no reason not to try them. However, note that when you do try any methods, you must do some minimal tuning of the main parameters (e.g., number of trees in a random forest or the regularisation of a linear model). Running a method without minimal tuning is worse than not running it at all, because you may get a false negative – giving up on a method that actually works very well.
An example of applying the basics rigorously is in the classic paper In defense of one-vs-all classification, where the authors showed that the simple one-vs-all (OVA) approach to multiclass classification is at least as good as approaches that are much more sophisticated. In their words: “What we find is that although a wide array of more sophisticated methods for multiclass classification exist, experimental evidence of the superiority of these methods over a simple OVA scheme is either lacking or improperly controlled or measured”. If such a failure to perform proper experiments can happen to serious machine learning researchers, it can definitely happen to the average kaggler. Don’t let it happen to you.
Tip 9: The forum is your friend
It’s very important to subscribe to the forum to receive notifications on issues with the data or the competition. In addition, it’s worth trying to figure out what your competitors are doing. An extreme example is the recent trend of code sharing during the competition (which I don’t really like) – while it’s not a good idea to rely on such code, it’s important to be aware of its existence. Finally, reading the post-competition summaries on the forum is a valuable way of learning from the winners and improving over time.
Tip 10: Ensemble all the things
Not to be confused with ensemble methods (which are also very important), the idea here is to combine models that were developed independently. In high-profile competitions, it is often the case that teams merge and gain a significant boost from combining their models. This is worth doing even when competing alone, because almost no competition is won by a single model.