The hardest parts of data science

Contrary to common belief, the hardest part of data science isn’t building an accurate model or obtaining good, clean data. It is much harder to define feasible problems and come up with reasonable ways of measuring solutions. This post discusses some examples of these issues and how they can be addressed. The not-so-hard parts Before discussing the hardest parts of data science, it’s worth quickly addressing the two main contenders: model fitting and data collection/cleaning....

November 23, 2015 · Yanir Seroussi

Learning to rank for personalised search (Yandex Search Personalisation – Kaggle Competition Summary – Part 2)

This is the second and last post summarising my team’s solution for the Yandex search personalisation Kaggle competition. See the first post for a summary of the dataset, evaluation approach, and some thoughts about search engine optimisation and privacy. This post discusses the algorithms and features we used. To quickly recap the first post, Yandex released a 16GB dataset of query & click logs. The goal of the competition was to use this data to rerank query results such that the more relevant results appear before less relevant results....

February 11, 2015 · Yanir Seroussi

Is thinking like a search engine possible? (Yandex search personalisation – Kaggle competition summary – part 1)

About a year ago, I participated in the Yandex search personalisation Kaggle competition. I started off as a solo competitor, and then added a few Kaggle newbies to the team as part of a program I was running for the Sydney Data Science Meetup. My team hasn’t done too badly, finishing 9th out of 194 teams. As is usually the case with Kaggle competitions, the most valuable part was the lessons learned from the experience....

January 29, 2015 · Yanir Seroussi

Fitting noise: Forecasting the sale price of bulldozers (Kaggle competition summary)

Messy data, buggy software, but all in all a good learning experience... Early last year, I had some free time on my hands, so I decided to participate in yet another Kaggle competition. Having never done any price forecasting work before, I thought it would be interesting to work on the Blue Book for Bulldozers competition, where the goal was to predict the sale price of auctioned bulldozers. I’ve done alright, finishing 9th out of 476 teams....

November 19, 2014 · Yanir Seroussi

What is data science?

Data science has been a hot term in the past few years. Despite this fact (or perhaps because of it), it still seems like there isn't a single unifying definition of data science. This post discusses my favourite definition. Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. — Josh Wills (@josh_wills) May 3, 2012 One of my reasons for doing a PhD was wanting to do something more interesting than “vanilla” software engineering....

October 23, 2014 · Yanir Seroussi

Greek Media Monitoring Kaggle competition: My approach

A few months ago I participated in the Kaggle Greek Media Monitoring competition. The goal of the competition was doing multilabel classification of texts scanned from Greek print media. Despite not having much time due to travelling and other commitments, I managed to finish 6th (out of 120 teams). This post describes my approach to the problem. Data & evaluation The data consists of articles scanned from Greek print media in May-September 2013....

October 7, 2014 · Yanir Seroussi

How to (almost) win Kaggle competitions

Last week, I gave a talk at the Data Science Sydney Meetup group about some of the lessons I learned through almost winning five Kaggle competitions. The core of the talk was ten tips, which I think are worth putting in a post (the original slides are here). Some of these tips were covered in my beginner tips post from a few months ago. Similar advice was also recently published on the Kaggle blog – it’s great to see that my tips are in line with the thoughts of other prolific kagglers....

August 24, 2014 · Yanir Seroussi

Kaggle competition tips and summaries

Over the years, I’ve participated in a few Kaggle competitions and wrote a bit about my experiences. This page contains pointers to all my posts, and will be updated if/when I participate in more competitions. General advice posts 10 Steps to Success in Kaggle Data Science Competitions (guest post on KDNuggets) How to (almost) win Kaggle competitions Kaggle beginner tips Solution posts Greek Media Monitoring Multilabel Classification [6th/120] – multi-label classification of pre-tokenised texts Personalised Web Search Challenge [9th/194] – reranking web search results in a personalised manner Blue Book for Bulldozers [9th/476] – forecasting auction sale price of bulldozers ICFHR 2012 – Arabic Writer Identification Competition [3rd/42] – classifying handwritten texts by the identity of the writer (Kaggle blog post) EMC Data Science Global Hackathon (Air Quality Prediction) [6th/110] – forecasting levels of air pollutants (Kaggle forum post)

April 5, 2014 · Yanir Seroussi

Kaggle beginner tips

These are few points from an email I sent to members of the Data Science Sydney Meetup. I suppose other Kaggle beginners may find it useful. My first steps when working on a new competition are: Read all the instructions carefully to understand the problem. One important thing to look at is what measure is being optimised. For example, minimising the mean absolute error (MAE) may require a different approach from minimising the mean square error (MSE)....

January 19, 2014 · Yanir Seroussi