First steps in data science: author-aware sentiment analysis

People often ask me what’s the best way of becoming a data scientist. The way I got there was by first becoming a software engineer and then doing a PhD in what was essentially data science (before it became such a popular term). This post describes my first steps in the field with the goal of helping others who are interested in making the transition from pure software engineering to data science....

May 2, 2015 · Yanir Seroussi

My divestment from fossil fuels

This post covers recent choices I've made to reduce my exposure to fossil fuels, including practical steps that can be taken by Australians and generally applicable lessons. I recently read Naomi Klein’s This Changes Everything, which deeply influenced me. The book describes how the world has been dragging its feet when it comes to reducing carbon emissions, and how we are coming very close to a point where climate change is likely to spin out of control....

April 24, 2015 · Yanir Seroussi

The long road to a lifestyle business

Almost a year ago, I left my last full-time job and decided to set on an independent path that includes data science consulting and work on my own projects. The ultimate goal is not to have to sell my time for money by generating enough passive income to live comfortably. My five main areas of focus are – in no particular order – personal branding & networking, data science contracting, Bandcamp Recommender, Price Dingo, and marine conservation....

March 22, 2015 · Yanir Seroussi

Learning to rank for personalised search (Yandex Search Personalisation – Kaggle Competition Summary – Part 2)

This is the second and last post summarising my team’s solution for the Yandex search personalisation Kaggle competition. See the first post for a summary of the dataset, evaluation approach, and some thoughts about search engine optimisation and privacy. This post discusses the algorithms and features we used. To quickly recap the first post, Yandex released a 16GB dataset of query & click logs. The goal of the competition was to use this data to rerank query results such that the more relevant results appear before less relevant results....

February 11, 2015 · Yanir Seroussi

Is thinking like a search engine possible? (Yandex search personalisation – Kaggle competition summary – part 1)

About a year ago, I participated in the Yandex search personalisation Kaggle competition. I started off as a solo competitor, and then added a few Kaggle newbies to the team as part of a program I was running for the Sydney Data Science Meetup. My team hasn’t done too badly, finishing 9th out of 194 teams. As is usually the case with Kaggle competitions, the most valuable part was the lessons learned from the experience....

January 29, 2015 · Yanir Seroussi

Automating Parse.com bulk data imports

Parse is a great backend-as-a-service (BaaS) product. It removes much of the hassle involved in backend devops with its web hosting service, SDKs for all the major mobile platforms, and a generous free tier. Parse does have its share of flaws, including various reliability issues (which seem to be getting rarer), and limitations on what you can do (which is reasonable price to pay for working within a sandboxed environment). One such limitation is the lack of APIs to perform bulk data imports....

January 15, 2015 · Yanir Seroussi

Stochastic Gradient Boosting: Choosing the Best Number of Iterations

In my summary of the Kaggle bulldozer price forecasting competition, I mentioned that part of my solution was based on stochastic gradient boosting. To reduce runtime, the number of boosting iterations was set by minimising the loss on the out-of-bag (OOB) samples, skipping trees where samples are in-bag. This approach was motivated by a bug in scikit-learn, where the OOB loss estimate was calculated on the in-bag samples, meaning that it always improved (and thus was useless for the purpose of setting the number of iterations)....

December 29, 2014 · Yanir Seroussi

SEO: Mostly about showing up?

In previous posts about getting traction for my Bandcamp recommendations project (BCRecommender), I mentioned search engine optimisation (SEO) as one of the promising traction channels. Unfortunately, early efforts yielded negligible traffic – most new visitors came from referrals from blogs and Twitter. It turns out that the problem was not showing up for the SEO game: most of BCRecommender’s pages were blocked for crawling via robots.txt because I was worried that search engines (=Google) would penalise the website for thin/duplicate content....

December 15, 2014 · Yanir Seroussi

Fitting noise: Forecasting the sale price of bulldozers (Kaggle competition summary)

Messy data, buggy software, but all in all a good learning experience... Early last year, I had some free time on my hands, so I decided to participate in yet another Kaggle competition. Having never done any price forecasting work before, I thought it would be interesting to work on the Blue Book for Bulldozers competition, where the goal was to predict the sale price of auctioned bulldozers. I’ve done alright, finishing 9th out of 476 teams....

November 19, 2014 · Yanir Seroussi

BCRecommender Traction Update

This is the fifth part of a series of posts on my Bandcamp recommendations (BCRecommender) project. Check out previous posts on the general motivation behind this project, the system’s architecture, the recommendation algorithms, and initial traction planning. In a previous post, I discussed my plans to apply the Bullseye framework from the Traction Book to BCRecommender, my Bandcamp recommendations project. In that post, I reviewed the 19 traction channels described in the book, and decided to focus on the three most promising ones: blogger outreach, search engine optimisation (SEO), and content marketing....

November 5, 2014 · Yanir Seroussi