What is data science?

Data science has been a hot term in the past few years. Despite this fact (or perhaps because of it), it still seems like there isn't a single unifying definition of data science. This post discusses my favourite definition. Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. — Josh Wills (@josh_wills) May 3, 2012 One of my reasons for doing a PhD was wanting to do something more interesting than “vanilla” software engineering....

October 23, 2014 · Yanir Seroussi

Greek Media Monitoring Kaggle competition: My approach

A few months ago I participated in the Kaggle Greek Media Monitoring competition. The goal of the competition was doing multilabel classification of texts scanned from Greek print media. Despite not having much time due to travelling and other commitments, I managed to finish 6th (out of 120 teams). This post describes my approach to the problem. Data & evaluation The data consists of articles scanned from Greek print media in May-September 2013....

October 7, 2014 · Yanir Seroussi

Applying the Traction Book’s Bullseye framework to BCRecommender

This is the fourth part of a series of posts on my Bandcamp recommendations (BCRecommender) project. Check out previous posts on the general motivation behind this project, the system's architecture, and the recommendation algorithms. Having used BCRecommender to find music I like, I’m certain that other Bandcamp fans would like it too. It could probably be extended to attract a wider audience of music lovers, but for now, just getting feedback from Bandcamp fans would be enough....

September 24, 2014 · Yanir Seroussi

Bandcamp recommendation and discovery algorithms

This is the third part of a series of posts on my Bandcamp recommendations (BCRecommender) project. Check out the first part for the general motivation behind this project and the second part for the system architecture. The main goal of the BCRecommender project is to help me find music I like. This post discusses the algorithmic approaches I took towards that goal. I’ve kept the descriptions at a fairly high-level, without getting too much into the maths, as all recommendation algorithms essentially try to model simple intuition....

September 19, 2014 · Yanir Seroussi

Building a recommender system on a shoestring budget (or: BCRecommender part 2 – general system layout)

This is the second part of a series of posts on my BCRecommender – personalised Bandcamp recommendations project. Check out the first part for the general motivation behind this project. BCRecommender is a hobby project whose main goal is to help me find music I like on Bandcamp. Its secondary goal is to serve as a testing ground for ideas I have and things I’d like to explore. One question I’ve been wondering about is: how much money does one need to spend on infrastructure for a simple web-based product before it reaches meaningful traffic?...

September 7, 2014 · Yanir Seroussi

Building a Bandcamp recommender system (part 1 – motivation)

I’ve been a Bandcamp user for a few years now. I love the fact that they pay out a significant share of the revenue directly to the artists, unlike other services. In addition, despite the fact that fans may stream all the music for free and even easily rip it, almost $80M were paid out to artists through Bandcamp to date (including almost $3M in the last month) – serving as strong evidence that the traditional music industry’s fight against piracy is a waste of resources and time....

August 30, 2014 · Yanir Seroussi

How to (almost) win Kaggle competitions

Last week, I gave a talk at the Data Science Sydney Meetup group about some of the lessons I learned through almost winning five Kaggle competitions. The core of the talk was ten tips, which I think are worth putting in a post (the original slides are here). Some of these tips were covered in my beginner tips post from a few months ago. Similar advice was also recently published on the Kaggle blog – it’s great to see that my tips are in line with the thoughts of other prolific kagglers....

August 24, 2014 · Yanir Seroussi

Data’s hierarchy of needs

One of my favourite blog posts in recent times is The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps. That post comprehensively describes how abstracting all the data produced by LinkedIn’s various components into a single log pipeline greatly simplified their architecture and enabled advanced data-driven applications. Among the various technical details there are some beautifully-articulated business insights. My favourite one defines data’s hierarchy of needs:...

August 17, 2014 · Yanir Seroussi

Kaggle competition tips and summaries

Over the years, I’ve participated in a few Kaggle competitions and wrote a bit about my experiences. This page contains pointers to all my posts, and will be updated if/when I participate in more competitions. General advice posts 10 Steps to Success in Kaggle Data Science Competitions (guest post on KDNuggets) How to (almost) win Kaggle competitions Kaggle beginner tips Solution posts Greek Media Monitoring Multilabel Classification [6th/120] – multi-label classification of pre-tokenised texts Personalised Web Search Challenge [9th/194] – reranking web search results in a personalised manner Blue Book for Bulldozers [9th/476] – forecasting auction sale price of bulldozers ICFHR 2012 – Arabic Writer Identification Competition [3rd/42] – classifying handwritten texts by the identity of the writer (Kaggle blog post) EMC Data Science Global Hackathon (Air Quality Prediction) [6th/110] – forecasting levels of air pollutants (Kaggle forum post)

April 5, 2014 · Yanir Seroussi

Kaggle beginner tips

These are few points from an email I sent to members of the Data Science Sydney Meetup. I suppose other Kaggle beginners may find it useful. My first steps when working on a new competition are: Read all the instructions carefully to understand the problem. One important thing to look at is what measure is being optimised. For example, minimising the mean absolute error (MAE) may require a different approach from minimising the mean square error (MSE)....

January 19, 2014 · Yanir Seroussi