Aspiring data scientists and other visitors to this site often repeat the same questions. This post is the definitive collection of my answers to such questions (which may evolve over time).
How do I become a data scientist?
It depends on your situation. Before we get into it, have you thought about why you want to become a data scientist?
Hmm… Not really. Why should I become a data scientist?
I can't answer this for you, but it's great to see you asking why. Do you know what data science is? Do you understand what data scientists do?
Sort of. Just so we’re on the same page, what is data science?
No one knows for sure. Here are my thoughts from 2014 on defining data science as the intersection of software engineering and statistics, and a more recent post on defining data science in 2018.
What are the hardest parts of data science?
The hardest parts of data science are problem definition and solution measurement, not model fitting and data cleaning, because counting things is hard.
Thanks, that’s helpful. But what do data scientists actually do?
It varies a lot. This variability makes the job title somewhat useless. You should try to get an idea what areas of data science interest you. For many people, excitement over the technical aspects wanes with time. And even if you still find the technical aspects exciting, most jobs have boring parts. When considering career changes, think of the non-technical aspects that would keep you engaged.
To answer the question, here are some posts on things I've done: Joined Automattic by improving the Elasticsearch language detection plugin, calculated customer lifetime value, analysed A/B test results, built recommender systems (including one for Bandcamp music), competed on Kaggle, and completed a PhD. I've also dabbled in deep learning, marine surveys, causality, and other things that I haven't had the chance to write about.
Cool! Can you provide a general overview of how to become a data scientist?
Yes! Check out Alec Smith's excellent articles.
I’m pretty happy with my current job, but still thinking of becoming a data scientist. What should I do?
Find ways of doing data science within your current role, working overtime if needed. Working on a real problem in a familiar domain is much more valuable than working on toy problems from online courses and platforms like Kaggle (though they're also useful). If you're a data analyst, learn how to program to automate and simplify your analyses. If you're a software engineer, become comfortable with analysing and modelling data. Machine learning doesn't have to be a part of what you choose to do.
I’m pretty busy. What online course should I take to learn about the area?
Calling Bullshit: Data Reasoning for the Digital Age is a good place to start. Deep learning should be pretty low on your list if you don't have much background in the area.
Should I learn Python or R? Keras or Tensorflow? What about
<insert name here>?
It doesn't matter. Focus on principles and you'll be fine. The following quote still applies today (to people of all genders).
As to methods, there may be a million and then some, but principles are few. The man who grasps principles can successfully select his own methods. The man who tries methods, ignoring principles, is sure to have trouble.
I want to become a data science freelancer. Can you provide some advice?
As with any freelancing job, expect to spend much of your time on sales and networking. I've only explored the freelancing path briefly, but Radim Řehůřek has published great slides on the topic. If you're thinking of freelancing as a way of gaining financial independence, also consider spending less, earning more, and investing wisely.
Can you recommend an academic data science degree?
Sorry, but I don't know much about those degrees. Boris Gorelik has some interesting thoughts on studying data science.
Will you be my mentor?
Probably not, unless you're hard-working, independent, and doing something I find interesting. Feel free to contact me if you believe we'd both find the relationship beneficial.
Can you help with my project?
Probably not, as I work full-time with Automattic. I barely have time for my side projects, and I'm not looking for more paid work. However, if you think I'd find your project exciting, please do contact me.
What about ethics?
What about them? There isn't a single definition of right and wrong, as morality is multi-dimensional. I believe it's important to question your own choices, and avoid applying data science blindly. For me, this means divesting from harmful industries like fossil fuels and striving to go beyond the creation of greedy robots (among other things).
I’m a manager. When should I hire a data scientist and start using machine learning?
There's a good chance you don't need a data scientist yet, but you should be aware of common pitfalls when trying to be data-driven. It's also worth reading Paras Chopra's post on what you need to know before you board the machine learning train.
Do you want to buy my products or services?
No. If I did, I'd contact you.
I have a question that isn’t answered here or anywhere on the internet, and I think you can help. Can I contact you?
Sure, use the form on this page.