Summarising the work Uri Seroussi and I did to improve Reef Life Survey’s Reef Species of the World app.
Video and summary of a talk I gave at DataEngBytes Brisbane on what I learned from doing data engineering as part of every data science role I had.
Reflections on publishing on this website: Writing publicly to share thoughts and documentation beats chasing views and likes.
Yes, data science projects have suffered from classic software engineering mistakes, but the field is maturing with the rise of new engineering roles.
Exploring the hackability of speed-based coding tests, using CodeSignal’s Industry Coding Framework as a case study.
Bing Chat recently quipped that humans are small language models. Here are some of my thoughts on how we small language models can remain relevant (for now).
My perspective after a week of using ChatGPT: This is a step change in finding distilled information, and it’s only the beginning.
Reviewing the first three chapters of the book Causal Machine Learning by Robert Osazuwa Ness.
Discussing my recent career move into climate tech as a way of doing more to help mitigate dangerous climate change.
Lessons learned building a fish ID web app with fast.ai and Streamlit, in an attempt to reduce my fear of missing out on the latest deep learning developments.
Epidemiologists analyse clinical trials to estimate the intention-to-treat and per-protocol effects. This post applies their strategies to online experiments.
Overview of a talk I gave at a deep learning course, focusing on AI ethics as the need for humans to think on the context and consequences of applying AI.
My reasons for switching from WordPress.com to Hugo on GitHub + Cloudflare, along with a summary of the solution components and migration process.
Back-dated meta-post that gathers my posts on Automattic blogs into a summary of the work I’ve done with the company.
Sharing remote teamwork insights, my climate & sustainability activism, Reef Life Survey publications, and progress on Automattic’s Experimentation Platform.
Going deeper into correct testing of different methods for bootstrap estimation of confidence intervals.
Being a data scientist can sometimes feel like a race against software commodities that replace interesting work. What can one do to remain relevant?
Video of a talk I gave on remote data science work at the Data Science Sydney meetup.
Video and summary of a talk I gave at YOW! Data on bootstrap estimation of confidence intervals.
Bootstrap sampling has been promoted as an easy way of modelling uncertainty to hackers without much statistical knowledge. But things aren’t that simple.
Causal Inference by Miguel Hernán and Jamie Robins is a must-read for anyone interested in the area.
Discussing the pluses and minuses of remote work eighteen months after joining Automattic as a data scientist.
Updating my definition of data science to match changes in the field. It is now broader than before, but its ultimate goal is still to support decisions.
Frequently asked questions by visitors to this site, especially around entering the data science field.
Call for BCRecommender maintainers followed by a decision to shut it down, as I don’t have enough time and Bandcamp now offers recommendations.
I wanted a well-paid data science-y remote job with an established company that offers a good life balance and makes products I care about. I got it eventually.
Web tools I built to visualise Reef Life Survey data and assist citizen scientists in underwater visual census work.
There’s a lot of misleading content on the estimation of customer lifetime value. Here’s what I learned about doing it well.
Video and summary of a talk I gave at the Data Science Sydney meetup, about going beyond the what & how of predictive modelling.
Seven common mistakes to avoid when working with data, such as ignoring uncertainty and confusing observed and unobserved quantities.
It seems like anyone who touches data can call themselves a data scientist, which makes the title useless. The work they do can still be useful, though.
A web tool I built to interpret A/B test results in a Bayesian way, including prior specification, visualisations, and decision rules.
Discussing the need for untested assumptions and temporality in causal inference. Mostly based on Samantha Kleinberg’s Causality, Probability, and Time.
Is artificial/machine intelligence a future threat? I argue that it’s already here, with greedy robots already dominating our lives.
Causality is often overlooked but is of much higher relevance to most data scientists than deep learning.
Insights on data collection and machine learning from spending a month sailing, diving, and counting fish with Reef Life Survey.
Some companies present raw data or information as “insights”. This post surveys some examples, and discusses how they can be turned into real insights.
Defining feasible problems and coming up with reasonable ways of measuring solutions is harder than building accurate models or obtaining clean data.
Migrating BCRecommender from MongoDB to Elasticsearch made it possible to offer a richer search experience to users at a similar cost, among other benefits.
Nutritionism is a special case of misinterpretation and miscommunication of scientific results – something many data scientists encounter in their work.
Giving an overview of the field and common paradigms, and debunking five common myths about recommender systems.
Hiring data scientists prematurely is wasteful and frustrating. Here are some questions to ask before you hire your first data scientist.
Migrating my web apps away from Parse.com due to reliability issues. Self-hosting is a better solution.
Progress on my album cover classification project, highlighting lessons that would be useful to others who are getting started with deep learning.
To become proficient at solving data science problems, you need to get your hands dirty. Here, I used album cover classification to learn about deep learning.
I became a data scientist by doing a PhD, but the same steps can be followed without a formal education program.
Recent choices I’ve made to reduce my exposure to fossil fuels, including practical steps that can be taken by Australians and generally applicable lessons.
An overview of my PhD in data science / artificial intelligence. Thesis title: Text Mining and Rating Prediction with Topical User Models.
Progress since leaving my last full-time job and setting on an independent path that includes data science consulting and work on my own projects.
My team’s solution to the Yandex Search Personalisation competition (finished 9th out of 194 teams).
Insights on search personalisation and SEO from participating in a Kaggle competition (finished 9th out of 194 teams).
A script for importing data into the Parse backend-as-a-service.
Exploring an approach to choosing the optimal number of iterations in stochastic gradient boosting, following a bug I found in scikit-learn.
Increasing SEO traffic to BCRecommender by adding content and opening up more pages for crawling. It turns out that thin content is better than no content.
Summary of a Kaggle competition to forecast bulldozer sale price, where I finished 9th out of 476 teams.
Update on BCRecommender traction using three channels: blogger outreach, search engine optimisation, and content marketing.
Data science has been a hot term in the past few years. Still, there isn’t a single definition of the field. This post discusses my favourite definition.
Summary of my approach to the Greek Media Monitoring Kaggle competition, where I finished 6th out of 120 teams.
Ranking 19 channels with the goal of getting traction for BCRecommender.
The recommendation backend for my BCRecommender service for personalised Bandcamp music discovery.
Iterating on my BCRecommender service with the goal of keeping costs low while providing a valuable music recommendation service.
My motivation behind building BCRecommender, a free recommendation & discovery service for Bandcamp music.
Summary of a talk I gave at the Data Science Sydney meetup with ten tips on almost-winning Kaggle competitions.
Discussing the hierarchy of needs proposed by Jay Kreps. Key takeaway: Data-driven algorithms & insights can only be as good as the underlying data.
Pointers to all my Kaggle advice posts and competition summaries.
First post! An email I sent to members of the Data Science Sydney Meetup with tips on how to get started with Kaggle competitions.