Data science

You don’t need a data scientist (yet)

The hype around big data has caused many organisations to hire data scientists without giving much thought to what these data scientists are going to do and whether they’re actually needed. This is a source of frustration for all parties involved. This post discusses some questions you should ask yourself before deciding to hire your first data scientist.

Q1: Do you know what data scientists do?

Somewhat surprisingly, there are quite a few companies that hire data scientists without having a clear idea of what data scientists actually do. People seem to have a fear of missing out on the big data hype, and think of hiring data scientists as the solution. A common misconception is that a data scientist’s role includes telling you what to do with your data. While this may sometimes happen in practice, the ideal scenario is where the business has problems that can be solved using data science (more on this under Q3 below). If you don’t know what your data scientist is going to do, you probably don’t need one.

So what do data scientists do? When you think about it, adding the word “data” to “science” is a bit redundant, as all science is based on data. Following from this, anyone who does any kind of data analysis is a data scientist. While it may be true, this broad definition is not very helpful. As discussed in a previous post, it’s more useful to define data scientists as individuals who combine expertise in statistics and machine learning with strong software engineering skills.

Q2: Do you have enough data available?

It’s not uncommon to see products that suffer from over-engineering and premature investment in advanced analytics capabilities. In the early stages, it’s important to focus on creating a minimum viable product and getting it to market quickly. Data science starts to shine once the product is generating enough data, as most of the power of advanced analytics is in optimising and automating existing processes.

Not having a data scientist in the early stages doesn’t mean the data is being ignored – it just means that it doesn’t require the attention of a full-time data scientist. If your product is at an early stage and you are still concerned, you’re better off hiring a data science consultant for a few days to help lay out the long-term vision for data-driven capabilities. This would be cheaper and less time-consuming than hiring a full-timer. The exception to this rule is when the product itself is built around advanced analytics (e.g., AlchemyAPI or Enlitic). Building such products without data scientists is far from ideal, or just impossible.

Even if your product is mature and generating a lot of data, it doesn’t mean it’s ready for data science. Advanced analytics capabilities are at the top of data’s hierarchy of needs: If your product is buggy, or if your data is scattered everywhere and your platform lacks centralised reporting, you need to first invest in fixing your data plumbing. This is the job of data engineers. Getting data scientists involved when the data is hardly available due to infrastructure issues is likely to lead to frustration. In addition, setting up centralised reporting and dashboarding is likely to give you ideas for problems that data scientists can solve.

Q3: Do you have a specific problem to solve?

If the problem you’re trying to solve is “everyone is doing smart things with data, we should be doing stuff with data too”, you don’t have a specific problem that can be solved by bringing a data scientist on board. Defining the problem often ends up occupying a lot of the data scientist’s time, so you are likely to obtain better results if have more than just a vague idea around “doing something with data, because Hadoop”. Ideally you want to optimise an existing process that is currently being solved with heuristics, make an existing model better, implement a new data-driven feature, or something along these lines. Common examples include reducing churn, increasing conversions, and replacing manual processes with automated data-driven systems. Again, getting advice from experienced data scientists before committing to hiring one may be your best first step.

Q4: Can you get away with heuristics, intuition, and/or manual processes?

Some data scientists would passionately claim that you must deploy only models that are theoretically justified and well-tested. However, in many cases you can get away with using simple heuristics, intuition, and/or manual processes. These can be orders of magnitude cheaper than building sophisticated predictive models and the infrastructure to support them. For many businesses, there are more pressing needs than doing everything in a theoretically sound way. Despite what many technical people like to think, customers don’t tend to care how things are implemented, as long as their needs are fulfilled.

For example, I spent some time with a client whose product includes a semi-manual part where structured data is extracted from documents. Their process included sending some of the documents to a trained team in the Philippines for manual analysis. The client was interested in replacing that manual work with a machine learning algorithm. As is often the case with machine learning, it was unknown whether the resultant model would be accurate enough to completely replace the manual workers. This generally depends on data quality and the feasibility of solving the problem. Assessing the feasibility would have taken some time and money, so the client decided to park the idea and focus on other areas of their business.

Every business has resource constraints. Situations where the best investment you can make is hiring a full-time data scientist are rarer than what the hype may make you think. It’s often the case that functions that would be the responsibility of a data scientist are adequately performed by existing employees, such as software engineers, business/data analysts, and marketers.

Q5: Are you committed to being data-driven?

I have seen more than one case where data scientists are hired only to be blocked or ignored. This is more prevalent in the corporate world, where managers are often incentivised to prioritise doing things that look good over things that make financial sense. But even if recruitment is done with the best intentions, progress may be blocked by employees who feel threatened because they would be replaced by automated data-driven algorithms. Successful data science projects require support from senior leadership, as discussed by Greta Roberts, Radim Řehůřek, Alec Smith, and many others. Without such support and a strong commitment to making data-driven decisions, everyone is just wasting their time.

Closing thoughts

While data science is currently over-hyped, many organisations still have much to gain from hiring data scientists. I hope that this post has helped you decide whether you need a data scientist right now. If you’re unsure, please don’t hesitate to contact me. And to any data scientists reading this: Be very wary of potential employers who do not have good answers to the above questions. At this point in time you can afford to be picky, at least until the hype is over.

Learning about deep learning through album cover classification

In the past month, I’ve spent some time on my album cover classification project. The goal of this project is for me to learn about deep learning by working on an actual problem. This post covers my progress so far, highlighting lessons that would be useful to others who are getting started with deep learning.

Initial steps summary

The following points were discussed in detail in the previous post on this project.

  • The problem I chose to work on is classifying Bandcamp album covers by genre, using a balanced dataset of 10,000 images from 10 different genres.
  • The experimental code is based on Lasagne, and is available on GitHub.
  • Having set up the environment for running experiments on a GPU, the plan was to get Lasagne’s examples working on my dataset, and then iteratively read tutorials/papers/books, implement ideas, play with parameters, and visualise parts of the network until I’m satisfied with the results.

Preliminary experiments and learning resources

I hit several issues when adapting Lasagne’s example code to my dataset. The key issue is that the example code is based on the MNIST digits dataset. That dataset’s images are 28×28 grayscale, and my dataset’s images are 350×350 RGB. This difference led to the training loss quickly diverging when running the example code without any changes. It turns out that simply lowering the learning rate resolves this issue, though the initial results I got were still not much better than random. In general, it appears that everything works on the MNIST digits dataset, so choosing to work on my own dataset made things more challenging (which is a good thing).

The main learning resource I used is the excellent notes for the Stanford course Convolutional Neural Networks for Visual Recognition. The notes are very clear, contain up-to-date information from recent publications, and include many practical tips for successful training of convolutional networks (convnets). In addition, I read some other tutorials and a few papers. These are summarised in a separate page.

The first step after getting the MNIST examples working on my dataset was to extend the code to enable more flexible architectures. My main focus was on vanilla convnets, i.e., networks with several convolutional layers, where each convolutional layer is optionally followed by a max-pooling layer, and the convolutional layers are followed by multiple dense/fully-connected layers and dropout layers. To allow for easy experimentation, the specification of the network can be done from the command line. For example, to train an AlexNet architecture:

$ python manage.py run_experiment --dataset-path /path/to/dataset --model-architecture ConvNet --model-params num_conv_layers=5:num_dense_layers=2:lc0_num_filters=48:lc0_filter_size=11:lc0_stride=4:lc0_mp=True:lm0_pool_size=3:lm0_stride=2:lc1_num_filters=128:lc1_filter_size=5:lc1_mp=True:lm1_pool_size=3:lm1_stride=2:lc2_num_filters=192:lc2_filter_size=3:lc3_num_filters=192:lc3_filter_size=3:lc4_num_filters=128:lc4_filter_size=3:lc4_mp=True:lm4_pool_size=3:lm4_stride=2:ld0_num_units=2048:ld1_num_units=2048

This can obviously be a bit of a mouthful, so common architectures are also defined in the code with parameters that can be overridden. For instance, to train an AlexNet with 64 filters in the first layer instead of 48:

$ python manage.py run_experiment --dataset-path /path/to/dataset --model-architecture AlexNet --model-params lc0_num_filters=64

There are many more command line flags (possibly too many), which make it easy to both tinker with various settings, and also run more rigorous experiments. My initial tinkering with convnets didn’t yield impressive results in terms of predictive accuracy on my dataset. It turned out that this was partly due to the lack of preprocessing – the less exciting but crucial part of any predictive modelling work.

The importance of preprocessing

My initial focus was on getting things to work on the dataset without worrying too much about preprocessing. I haven’t done any image classification work in the past, so I had to learn about the right type of preprocessing to use. I kept it pretty simple and applied the following transformations:

  • Downsampling: all images were scaled down to 256×256. I played briefly with other sizes, but decided on this size to make it easy to use models pretrained on ImageNet.
  • Cropping & mirroring: during training time, each image was cropped to random 224×224 slices. Deterministic slices were used in test time. In addition, each crop was mirrored horizontally. In most cases I used ten overall crops. Again, these numbers were chosen for comparability with ImageNet-trained models.
  • Mean subtraction: the training mean of each pixel was subtracted from each instance.
  • Shuffling: probably the most important preprocessing step. Initially I had the instances sorted by their class, as an artifact of the way the dataset was constructed. Due to the relatively small number of instances the network sees in each batch, this meant that in each epoch, the network first fitted on all the instances from class 1, then all the instances from class 2, etc. This led to very poor performance, which was fixed by shuffling the data once at the start of the training procedure (shuffling every epoch could potentially make things even better).

Baselines

After building the experimental environment and a fair bit of tinkering, I decided it was time for some more serious experiments. The results of my initial games were rather disappointing – slightly better than a random baseline, which yields an accuracy score of 10%. Therefore, I ran some baselines to get an idea of what’s possible on this dataset.

The first baseline I tried was a random forest with 1,000 trees, which yielded 15.25% accuracy. This baseline was trained directly on the pixel values without any preprocessing other than downsampling. It’s worth noting that the downsampling size didn’t make much of a difference to this baseline (I tried a few values in the range 50×50-350×350). This baseline was also not particularly sensitive to whether RGB or grayscale values were used to represent the images.

The next experiments were with baselines that utilised pretrained Caffe models. Training a random forest with 1,000 trees on features extracted from the highest fully-connected layer (fc7) in the CaffeNet and VGGNet-19 models yielded accuracies of 16.72% and 16.40% respectively. This was pretty disappointing, as I expected these features to perform much better. The reason may be that album covers are very different from ImageNet images, and the representations in fc7 are too specific to ImageNet. Indeed, when fine-tuning the CaffeNet model (following the procedure outlined here), I got the best accuracy on the dataset: 22.60%. Using Caffe to train the same network from scratch didn’t even get close to this accuracy. However, I didn’t try to tune Caffe’s learning parameters. Instead, I went back to running experiments with my code.

It’s worth noting that the classes identified by the CaffeNet model often have little to do with the actual content of the image. Better baseline results may be obtained by using models that were pretrained on a richer dataset than ImageNet. The following table presents three example covers together with the top-five classes identified by the CaffeNet model for each image. The tags assigned by Clarifai’s API are also presented for comparison. From this example, it looks like Clarifai’s model is more successful at identifying the correct elements than the CaffeNet model, indicating that a baseline that uses the Clarifai tags may yield competitive performance.

Album CaffeNet Clarifai
October by Wille P
October by Wille P
hiphop_rap
digital clock, spotlight, jack-o’-lantern, volcano, traffic light tree, landscape, sunset, desert, sun, sunrise, nature, evening, sky, travel
Demo by Blackrat
Demo by Blackrat
metal
spider web, barn spider, chain, bubble, fountain skull, bone, nobody, death, vector, help, horror, medicine, black and white, tattoo
The Kool-Aid Album by Mr. Merge
The Kool-Aid Album by Mr. Merge
soul
dishrag, paper towel, honeycomb, envelope, chain mail symbol, nobody, sign, illustration, color, flag, text, stripes, business, character

Training from scratch

My initial experiments were with various convnet architectures, where I manually varied the filter sizes and number of layers to have a reasonable number of parameters and ensure that the model is trainable on a GPU with 4GB of memory. As mentioned, this approach yielded unimpressive results. Following the relative success of the fine-tuned CaffeNet baseline, I decided to run more rigorous experiments on variants of AlexNet (which is very similar to CaffeNet).

Given the large number of hyperparameters that need to be set when training deep convnets, I realised that setting values manually or via grid search is unlikely to yield the best results. To address this, I used hyperopt to search for the best configuration of values. The hyperparameters that were included in the search were the learning method (Nesterov momentum versus Adam with their respective parameters), the learning rate, whether crops are mirrored or not, the number of crops to use (1 or 5), dropout probabilities, the number of hidden units in the fully-connected layers, and the number of filters in each convolutional layer.

Each configuration suggested by hyperopt was trained for 10 epochs, and the promising setups were trained until results stopped improving. The results of the search were rather disappointing, with the best accuracy being 17.19%. However, I learned a lot by finding hyperparameters in this manner – in the past I’ve only used a combination of manual settings with grid search.

There are many possible reasons for why the results are so poor. It could be that there’s just too little data to train a good classifier, which is supported by the inability to beat the fine-tuned results. This is in line with the results obtained by Zeiler and Fergus (2013), who found that convnets pretrained on ImageNet performed much better on the Caltech-101 and Caltech-256 datasets than the same networks trained from scratch. However, it could also be that I just didn’t run enough experiments – I definitely feel like I haven’t explored everything as well as I’d like. In addition, I’m still building my intuition for what works and why. I should work more on visualising the way the network learns to uncover more hidden gotchas in addition to those I’ve already found. Finally, it could be that it’s just too hard to distinguish between covers from the genres I chose for the study.

Ideas for future work

There are many avenues for improving on the work I’ve done so far. The code could definitely be made more robust and better tested, optimised and parallelised. It would be worth investing more in hyperparameter and architecture search, including incorporation of ideas from non-vanilla convnets (e.g., GoogLeNet). This search should be guided by visualisation and a deeper understanding of the trained networks, which may also come from analysing class-level accuracy (certain genres seem to be easier to distinguish than others). In addition, more sophisticated preprocessing may yield improved results.

If the goal were to get the best possible performance on my dataset, I’d invest in establishing the human performance baseline on the dataset by running some tests with Mechanical Turk. My guess is that humans would perform better than the algorithms tested so far due to access to external knowledge. Therefore, incorporating external knowledge in the form of manual features or additional data sources may yield the most substantial performance boosts. For example, text on an album cover may contain important clues about its genre, and models pretrained on style datasets may be more suitable than ImageNet models. In addition, it may be beneficial to use a model to detect multiple elements in images where the universe is not restricted to ImageNet classes. This approach was taken by Alexandre Passant, who used Clarifai’s API to tag and classify doom metal and K-pop album covers. Finally, using several different models in an ensemble is likely to help squeeze a bit more accuracy out of the dataset.

Another direction that may be worth exploring is using image data for recommendation work. The reason I chose to work on this problem was my exposure to album covers through my work on Bandcamp Recommender – a music recommendation system. It is well-known that visual elements influence the way users interact with recommender systems. This is especially true in Bandcamp Recommender’s case, as users see the album covers before they choose to play them. This leads me to conjecture that considering features that describe the album covers when generating recommendations would increase user interaction with the system. However, it’s hard to tell whether it’d increase the overall relevance of the results. You can’t judge an album by its cover. Or can you…?

Conclusion

While I’ve learned a lot from working on this project, there’s still much more to discover. It was especially great to learn some generally-applicable lessons about hyperparameter optimisation and improvements to vanilla gradient descent. Despite the many potential ways of improving performance on my dataset, my next steps in the field would probably include working on problems for which obtaining a good solution is feasible and useful. For example, I have some ideas for applications to marine creature identification.

Feedback and suggestions are always welcome. Please feel free to contact me privately or via the comments section.

Acknowledgement: Thanks to Brian Basham and Diogo Moitinho de Almeida for useful tips and discussions.

Hopping on the deep learning bandwagon

I’ve been meaning to get into deep learning for the last few years. Now, the stars having finally aligned and I have the time and motivation to work on a small project that will hopefully improve my understanding of the field. This is the first in a series of posts that will document my progress on this project.

As mentioned in a previous post on getting started as a data scientist, I believe that the best way of becoming proficient at solving data science problems is by getting your hands dirty. Despite being familiar with high-level terminology and having some understanding of how it all works, I don’t have any practical experience applying deep learning. The purpose of this project is to fix this experience gap by working on a real problem.

The problem: Inferring genre from album covers

Deep learning has been very successful at image classification. Therefore, it makes sense to work on an image classification problem for this project. Rather than using an existing dataset, I decided to make things a bit more interesting by building my own dataset. Over the last year, I’ve been running BCRecommender – a recommendation system for Bandcamp music. I’ve noticed that album covers vary by genre, though it’s hard to quantify exactly how they vary. So the question I’ll be trying to answer with this project is how accurately can genre be inferred from Bandcamp album covers?

As the goal of this project is to learn about deep learning rather than make a novel contribution, I didn’t do a comprehensive search to see whether this problem has been addressed before. However, I did find a recent post by Alexandre Passant that describes his use of Clarifai’s API to tag the content of Spotify album covers (identifying elements such as men, night, dark, etc.), and then using these tags to infer the album’s genre. Another related project is Karayev et al.’s Recognizing image style paper, in which the authors classified datasets of images from Flickr and Wikipedia by style and art genre, respectively. In all these cases, the results are pretty good, supporting my intuition that the genre inference task is feasible.

Data collection & splits

As I’ve already been crawling Bandcamp data for BCRecommender, creating the dataset was relatively straightforward. Currently, I have data on about 1.8 million tracks and albums. Bandcamp artists assign multiple tags to each release. To create the dataset, I selected 10 of the top tags: ambient, dubstep, folk, hiphop_rap, jazz, metal, pop, punk, rock, and soul. Then, I randomly selected 10,000 album covers that have exactly one of those tags, with 1,000 albums for each tag/genre. Each cover image size is 350×350. The following image shows a sample of the dataset.

bandcamp album covers by genre

It is apparent that some genres can be inferred more easily than others, especially when browsing through the full dataset. For example, metal albums tend to be pretty distinct. I doubt that predictive accuracy would be very high, but I think that it can definitely be much better than the random baseline of 10%.

For training, validation and testing I decided to use a static stratified 80%/10%/10% split of the dataset. It quickly became apparently that the full dataset is too big for development purposes, making it hard to quickly test code on my local machine. To address this, I created a local development dataset, using an 80%/10%/10% split of 1,000 images from the full training subset.

The code for downloading the dataset and creating the splits is available from the project repository on GitHub. This repository will include all the code for the project as it evolves. I will try to keep it well-documented enough to be useful for others, though it assumes some familiarity with Python. If you experience any issues running the code or find any bugs, please let me know.

Getting started

One of the things that has stopped me from playing with deep learning in the past is the feeling that there is a bit of a steep learning curve around the tools and methods. A lot of the deep learning libraries out there don’t seem as mature as general machine learning libraries, such as scikit-learn. There are also many more parameters to play with when building deep neural networks than when using linear models or algorithms such as random forests. Further, to enable any kind of meaningful experimentation, using a GPU is essential.

Fortunately, the tools and documentation have matured a lot in recent years. Motivated by Daniel Nouri’s excellent tutorial on detecting facial keypoints with convolutional neural nets, I decided to use the Lasagne package as my starting point. My plan was simple: Convert the MNIST example code to work on my dataset locally, setup an AWS machine with a GPU for full-scale experiments, and then play with various network architectures and techniques to improve accuracy and gain a deeper understanding of deep learning.

Initial environment setup

While Lasagne’s MNIST example code is pretty clear – especially once you get your head around the way Theano works – it doesn’t really lend itself to easy experimentation. I addressed this by refactoring the code in several iterations, until I got to the current state, where there’s a simple command-line interface that allows me to experiment with different datasets and architectures. This will probably change and become more complex as I start doing more sophisticated things.

To enable rapid experimentation, I had to set up an AWS machine with a GPU (g2.2xlarge instance). I wrote some simple deployment code using Fabric, which allows me to setup a machine from scratch, install all the requirements, package the project, and copy it to the remote machine.

Getting the code running on the CPU was trivial, but I hit several issues when running on the GPU. First, the vanilla Ubuntu 14.04 server I used didn’t come with CUDA installed. After trying and failing to get it working by following some tutorials, I ended up going down the easier path of using the AMI supplied by Caffe. This AMI also has the advantage of coming with Caffe installed (surprisingly), which I may end up using at some point.

The second issue I encountered was that using the GPU to run Lasagne’s enhanced example code on my full dataset was impossible due to memory constraints. The problem was that the example assumes that the entire dataset can fit in the GPU’s memory (as discussed here and here). This took a while to resolve, even though the solution is conceptually simple – just copy the dataset to the GPU in chunks rather than attempt to copy it all in one go. Resolving this issue was a good way of getting a better understanding of what the code does, since I ended up rewriting most of the original example code.

Next steps

So far, I left the network architecture from the original example mostly untouched, as I was busy collecting the dataset, getting the environment set up, and resolving various issues. One thing I did notice was that the example’s architecture diverges on my dataset, so instead I tested my code using a basic multi-layer perceptron architecture with a single hidden layer. This performs about as well as a random classifier on my dataset, but at least it converges. I also tested the modified code on the MNIST dataset and the results are decent, so now it is time to move forward and actually do some modelling, starting with convolutional neural nets.

The high level plan is to iteratively read tutorials/papers/books, implement ideas, play with parameters, and visualise parts of the network until I’m satisfied with the results. The main goal remains to learn as much as possible and get a good intuition of how things work. I’ll write more about my experiences in subsequent posts. Stay tuned!

Update: The second post in the series is now available.

First steps in data science: author-aware sentiment analysis

People often ask me what’s the best way of becoming a data scientist. The way I got there was by first becoming a software engineer and then doing a PhD in what was essentially data science (before it became such a popular term). This post describes my first steps in the field with the goal of helping others who are interested in making the transition from pure software engineering to data science.

While my first steps were in a PhD program, I don’t think that going through the formal PhD process is necessary if you wish to become a data scientist. Self-motivated individuals can get very far by making use of the abundance of learning resources available online. In fact, one can make progress much faster than in a PhD, because PhD programs have many overheads.

This post is organised as a list of steps. Despite the sequential numbering, many steps can be done in parallel. These steps roughly recount the work I’ve done to publish my first paper, which was co-authored by Ingrid Zukerman and Fabian Bohnert. Most of the technical details are intentionally omitted. Readers who are interested in learning more are invited to read the original paper or chapter 6 in my thesis, which includes more thorough experiments and explanations.

Step one: Find a problem to work on

Even if you know nothing about the machine learning and statistics side of data science, it’s important to find a problem to work on. Ideally it’d be something you find personally interesting, as this helps with motivation. You could use a predefined problem such as a Kaggle competition or one of the UCI datasets. Alternatively, you could collect the data yourself to make things a bit more challenging.

In my case, I was interested in natural language processing and user modelling. My supervisor was given a grant to work on sentiment analysis of opinion polls, which was my first direction of research. This quickly changed to focus on the connection between authors and the way they express their sentiments, with the application of harnessing this connection to improve the accuracy of sentiment analysis algorithms. For the purpose of this research, I collected a dataset of texts by the most prolific IMDb users. The problem was to infer the ratings these users assigned to their own reviews, with the hypothesis that methods that take author identity into account would outperform methods that ignore authorship information.

Step two: Close your knowledge gaps

Whatever problem you choose, you will have some knowledge gaps that require filling. Wikipedia, textbooks, and online courses will be your best guide for foundational areas like machine learning and statistics. Reading academic papers is often required to get a better understanding of recent work on the specific problem you’re trying to solve.

Doing a PhD afforded me the luxury of spending about a month just reading papers. Most of the ~200 papers I read were on sentiment analysis, which gave me a good overview of what’s been done in the field. However, the best thing I’ve done was to stop reading and move on to working on the problem. This is also the best advice I can give: there’s no better way to learn than getting your hands dirty working on a problem.

Step three: Get your hands dirty

With a well-defined problem and the knowledge gaps more-or-less closed, it is time to come up with a plan and implement it. Due to my background in software engineering and some exposure to early collaborative filtering approaches to recommender systems, my plan was very much a part of what Leo Breiman called the algorithmic modelling culture. That is, I was more focused on developing algorithms that work than on modelling the process that generated the data. This approach is arguably more in line with the mindset that software engineers tend to have than with the approach of mathematicians and statisticians.

The plan was quite simple:

  • Reproduce results that showed that rating inference models trained on enough texts by the target author (i.e., the author who wrote the text whose rating we want to predict) outperform models trained on texts by multiple authors
  • Use an approach inspired by collaborative filtering to combine multiple single-author models to infer ratings for texts by the target author, where those models are weighted by similarity to the target author
  • Experiment with multiple similarity measurements under various constraints on the number of texts available by the training and target authors
  • Iterate on these ideas until the results are publishable

The rationale behind this plan was that while different people express their sentiments differently, similar people would express their sentiments similarly (e.g., use of understatements varies by culture). The key motivation was Pang and Lee’s finding that a model trained on a single author is best if we have enough texts by this author.

The way I implemented the plan was vastly different from how I’d do it today. This was 2009, and using Java with the Weka package for the core modelling seemed like a huge improvement over the C/C++ I was used to. I relied heavily on the university grid to run experiments and wrote a bunch of code to handle experimental logic, including some Perl scripts for post-processing. It ended up being pretty messy, but it worked and I got publishable results. If I were to do the same work today, I’d use Python for everything. IPython Notebook is a great way of keeping track of experimental work, and Python packages like pandas, scikit-learn, gensim, TextBlob, etc. are mature and easy to use for data science applications.

Step four: Publish your results

Having a deadline for publishing results can be stressful, but it has two positive outcomes. First, making your work public allows you to obtain valuable feedback. Second, hard deadlines are great in making you work towards a tangible goal. You can always keep iterating to get infinitesimal improvements, but publication deadlines force you to decide that you’ve done enough.

In my case, the deadline for the UMAP 2010 conference and the promise of a free trip to Hawaii served as excellent motivators. But even if you don’t have the time or energy to get an academic paper published, you should set yourself a deadline to publish something on a blog or a forum, or even as a report to a mentor who can assess your work. Receiving continuous feedback is a key factor in improvement, so release early and release often.

Step five: Improve results or move on

Congratulations! You have published the results of your study. What now? You can either keep working on the same problem – try more approaches, add more data, change the constraints, etc. Or you can move on to work on other problems that interest you.

In my case, I had to go back to iterate on the results of the first paper because of things I learned later. I ended up rerunning all the experiments to make things fit together into a more-or-less coherent story for the thesis (writing a thesis is one of the main overheads that comes with doing a PhD). If I had a choice, I wouldn’t have done that. I would instead have pursued more sensible enhancements to the work presented in the paper, such as using the author as a feature, employing more robust ensemble methods, and testing different base methods than support vector machines. Nonetheless, I still think that the core idea – that the identity of authors should be taken into account in sentiment analysis – is still relevant and viable today. But I’ve taken my own advice and moved on.

Stochastic Gradient Boosting: Choosing the Best Number of Iterations

In my summary of the Kaggle bulldozer price forecasting competition, I mentioned that part of my solution was based on stochastic gradient boosting. To reduce runtime, the number of boosting iterations was set by minimising the loss on the out-of-bag (OOB) samples, skipping trees where samples are in-bag. This approach was motivated by a bug in scikit-learn, where the OOB loss estimate was calculated on the in-bag samples, meaning that it always improved (and thus was useless for the purpose of setting the number of iterations).

The bug in scikit-learn was fixed by porting the solution used in R’s GBM package, where the number of iterations is estimated by minimising the improvement on the OOB samples in each boosting iteration. This approach is known to underestimate the number of required iterations, which means that it’s not very useful in practice. This underestimation may be due to the fact that the GBM method is partly estimated on in-bag samples, as the OOB samples for the Nth iteration are likely to have been in-bag in previous iterations.

I was curious about how my approach compares to the GBM method. Preliminary results on the toy dataset from scikit-learn’s documentation looked promising:

Gradient Boosting out of bag experiment -- toy dataset

My approach (TSO) beat both 5-fold cross-validation (CV) and the GBM/scikit-learn method (SKO), as TSO obtains its minimum at the closest number of iterations to the test set’s (T) optimal value.

The next step in testing TSO’s viability was to rerun Ridgeway’s experiments from Section 3.3 of the GBM documentation (R code here). I used the same 12 UCI datasets that Ridgeway used, running 5×2 cross-validation on each one. For each dataset, the score was obtained by dividing the mean loss of the best method on the dataset by the loss of each method. Hence, all scores are between 0.0 and 1.0, with the best score being 1.0. The following figure summarises the results on the 12 datasets.

The following table shows the raw data that was used to produce the figure.

Dataset CV SKO TSO
creditrating 0.9962 0.9771 1.0000
breastcancer 1.0000 0.6675 0.4869
mushrooms 0.9588 0.9963 1.0000
abalone 1.0000 0.9754 0.9963
ionosphere 0.9919 1.0000 0.8129
diabetes 1.0000 0.9869 0.9985
autoprices 1.0000 0.9565 0.5839
autompg 1.0000 0.8753 0.9948
bostonhousing 1.0000 0.8299 0.5412
haberman 1.0000 0.9793 0.9266
cpuperformance 0.9934 0.9160 1.0000
adult 1.0000 0.9824 0.9991

The main finding is that CV remains the most reliable approach. Even when CV is not the best-performing method, it’s not much worse than the best method (this is in line with Ridgeway’s findings). TSO yielded the best results on 3/12 of the datasets, and beat SKO 7/12 times. However, TSO’s results are the most variant of the three methods: when it fails, it often yields very poor results.

In conclusion, stick to cross-validation for the best results. It’s more computationally intensive than SKO and TSO, but can be parallelised. I still think that there may be a way to avoid cross-validation, perhaps by extending SKO/TSO in more intelligent ways (see some interesting ideas by Eugene Dubossarsky here and here). Any comments/ideas are very welcome.

What is data science?

Data science has been a hot term in the past few years. Despite this fact (or perhaps because of it), it still seems like there isn’t a single unifying definition of data science. This post discusses my favourite definition.

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.

— Josh Wills (@josh_wills) May 3, 2012

One of my reasons for doing a PhD was wanting to do something more interesting than “vanilla” software engineering. When I was in the final stages of my PhD, I started going to meetups to see what’s changed in the world outside academia. Back then, I defined myself as a “software engineer with a research background”, which didn’t mean much to most people. My first post-PhD job ended up being a data scientist at a small startup. As soon as I changed my LinkedIn title to Data Scientist, many offers started flowing. This is probably the reason why so many people call themselves data scientists these days, often diluting the term to a point where it’s so broad it becomes meaningless. This post presents my preferred data science definitions and my opinions on who should or shouldn’t call themselves a data scientist.

Defining data science

I really like the definition quoted above, of data science as the intersection of software engineering and statistics. Ofer Mendelevitch goes into more detail, drawing a continuum of professions that ranges from software engineer on the left to pure statistician (or machine learning researcher) on the right. data skill continuum This continuum contains two additional roles, which are often confused with data scientists:

  • Data engineer: a software engineer that deals with data plumbing (traditional database setup, Hadoop, Spark and all the rest)
  • Data analyst: a person who digs into data to surface insights, but lacks the skills to do so at scale (e.g., they know how to use Excel, Tableau and SQL but can’t build a web app from scratch)

Data science mixes all these roles. Because of this, there are few true data science positions for people with no work experience. A successful data scientist needs to be able to “become one with the data” by exploring it and applying rigorous statistical analysis (right-hand side of the continuum). But good data scientists also understand what it takes to deploy production systems, and are ready to get their hands dirty by writing code that cleans up the data or performs core system functionality (left-hand side of the continuum). Gaining all these skills takes time. It is still somewhat rare to find people who are true data scientists according to this definition, which is why Ofer Mendelevitch’s post recommends building teams that consist of people with skills from both sides of the continuum.

How is data science different from just science?

Data is everywhere. Extracting knowledge from data is an essential part of any science. Hence, the name data science doesn’t really capture what’s new about the field. The way I see it, the novelty of data science comes from the application of software to model any type of data in a way that generalises across domains. So while a physicist may use software to build models based on data, they won’t become a data scientist until they’ve gone and applied these skills to other fields (as many physicists end up doing). As Kaggle shows, data scientists can work on a wide variety of problems – from biology and physics to marketing, text mining and web search personalisation. It’s often the case in Kaggle competitions that the same people apply similar techniques to very different problems, obtaining results that significantly improve on the state of the art.

However, domain experts such as physicists aren’t going to be made redundant any time soon. Contrary to what Kaggle may have you believe, there is much more to data science than predictive modelling on a well-defined problem. Data scientists typically spend much of their time working with domain experts to define the problem, and chasing down diverse data sources to extract features that enable predictive modelling (also known as “the fun part”). Despite the existence of these less-glamorous aspects of data science, there’s still a lot of fun to be had working in the area. I highly recommend getting into data science to people who enjoy such challenges.

Getting started as a data scientist is actually pretty simple: become a software engineer, become a data analyst, learn how to model data using software (e.g., by participating in Kaggle competitions), and find a job as a data scientist. Obviously, it’s not going to happen overnight. It took me around 10 ten years, and I’m still learning.

Data’s hierarchy of needs

One of my favourite blog posts in recent times is The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps. That post comprehensively describes how abstracting all the data produced by LinkedIn’s various components into a single log pipeline greatly simplified their architecture and enabled advanced data-driven applications. Among the various technical details there are some beautifully-articulated business insights. My favourite one defines data’s hierarchy of needs:

Effective use of data follows a kind of Maslow’s hierarchy of needs. The base of the pyramid involves capturing all the relevant data, being able to put it together in an applicable processing environment (be that a fancy real-time query system or just text files and python scripts). This data needs to be modeled in a uniform way to make it easy to read and process. Once these basic needs of capturing data in a uniform way are taken care of it is reasonable to work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.

It’s worth noting the obvious: without a reliable and complete data flow, a Hadoop cluster is little more than a very expensive and difficult to assemble space heater. Once data and processing are available, one can move concern on to more refined problems of good data models and consistent well understood semantics. Finally, concentration can shift to more sophisticated processing—better visualization, reporting, and algorithmic processing and prediction.

In my experience, most organizations have huge holes in the base of this pyramid—they lack reliable complete data flow—but want to jump directly to advanced data modeling techniques. This is completely backwards. [emphasis mine]

Visually, it looks something like this:

hierarchyIn addition, before starting to build a data pipeline, one needs to ensure that the tracked system works as expected. For example, a buggy website is likely to produce weird metrics, which in turn would make the data processing, reporting and predictions unreliable. I completely agree with Jay’s point about needing to get the basis of the pyramid right before setting out to do “something with data” (which seems to be the desire of every company nowadays).

The general point is that it’s important to have realistic expectations about what can be obtained by data-driven algorithms and insights. These can only be as good as the underlying data, with the results always depending to a large degree on having a solid infrastructure. Not everything has to be perfect from the start (most things never will be), but some degree of robustness is required to avoid spending too many resources on things that would never work. Trying to apply the latest predictive models without a reliable data infrastructure is like driving a fancy car on broken roads – you’re unlikely to get very far.