Questions to consider when using AI for PDF data extraction

Discussing considerations that arise when attempting to automate the extraction of structured data from PDFs and similar documents.

March 11, 2024

Substance over titles: Your first data hire may be a data scientist

Advice for hiring a startup’s first data person: match skills to business needs, consider contractors, and get help from data people.

February 5, 2024

New decade, new tagline: Data & AI for Impact

Shifting focus to ‘Data & AI for Impact’, with more startup-related content, increased posting frequency, and deeper audience engagement.

January 19, 2024

Positioning is a common problem for data scientists

With the commodification of data scientists, the problem of positioning has become more common: My takeaways from Genevieve Hayes interviewing Jonathan Stark.

December 18, 2023

You don't need a proprietary API for static maps

For many use cases, libraries like cartopy are better than the likes of Mapbox and Google Maps.

November 21, 2023

Lessons from reluctant data engineering

Video and summary of a talk I gave at DataEngBytes Brisbane on what I learned from doing data engineering as part of every data science role I had.

October 25, 2023

Google's Rules of Machine Learning still apply in the age of large language models

Despite the excitement around large language models, building with machine learning remains an engineering problem with established best practices.

September 21, 2023

Was data science a failure mode of software engineering?

Yes, data science projects have suffered from classic software engineering mistakes, but the field is maturing with the rise of new engineering roles.

June 30, 2023

Causal Machine Learning is off to a good start, despite some issues

Reviewing the first three chapters of the book Causal Machine Learning by Robert Osazuwa Ness.

September 12, 2022

The mission matters: Moving to climate tech as a data scientist

Discussing my recent career move into climate tech as a way of doing more to help mitigate dangerous climate change.

June 6, 2022

Building useful machine learning tools keeps getting easier: A fish ID case study

Lessons learned building a fish ID web app with fast.ai and Streamlit, in an attempt to reduce my fear of missing out on the latest deep learning developments.

March 20, 2022

Analysis strategies in online A/B experiments: Intention-to-treat, per-protocol, and other lessons from clinical trials

Epidemiologists analyse clinical trials to estimate the intention-to-treat and per-protocol effects. This post applies their strategies to online experiments.

January 14, 2022

Use your human brain to avoid artificial intelligence disasters

Overview of a talk I gave at a deep learning course, focusing on AI ethics as the need for humans to think on the context and consequences of applying AI.

November 22, 2021

My work with Automattic

Back-dated meta-post that gathers my posts on Automattic blogs into a summary of the work I’ve done with the company.

October 7, 2021

Many is not enough: Counting simulations to bootstrap the right way

Going deeper into correct testing of different methods for bootstrap estimation of confidence intervals.

August 24, 2020

Software commodities are eating interesting data science work

Being a data scientist can sometimes feel like a race against software commodities that replace interesting work. What can one do to remain relevant?

January 11, 2020

A day in the life of a remote data scientist

Video of a talk I gave on remote data science work at the Data Science Sydney meetup.

December 11, 2019

Bootstrapping the right way?

Video and summary of a talk I gave at YOW! Data on bootstrap estimation of confidence intervals.

October 6, 2019

Hackers beware: Bootstrap sampling may be harmful

Bootstrap sampling has been promoted as an easy way of modelling uncertainty to hackers without much statistical knowledge. But things aren’t that simple.

January 7, 2019

The most practical causal inference book I’ve read (is still a draft)

Causal Inference by Miguel Hernán and Jamie Robins is a must-read for anyone interested in the area.

December 24, 2018

Reflections on remote data science work

Discussing the pluses and minuses of remote work eighteen months after joining Automattic as a data scientist.

November 3, 2018

Defining data science in 2018

Updating my definition of data science to match changes in the field. It is now broader than before, but its ultimate goal is still to support decisions.

July 22, 2018

Advice for aspiring data scientists and other FAQs

Frequently asked questions by visitors to this site, especially around entering the data science field.

October 15, 2017

My 10-step path to becoming a remote data scientist with Automattic

I wanted a well-paid data science-y remote job with an established company that offers a good life balance and makes products I care about. I got it eventually.

July 29, 2017

Exploring and visualising Reef Life Survey data

Web tools I built to visualise Reef Life Survey data and assist citizen scientists in underwater visual census work.

June 3, 2017

Customer lifetime value and the proliferation of misinformation on the internet

There’s a lot of misleading content on the estimation of customer lifetime value. Here’s what I learned about doing it well.

January 8, 2017

Ask Why! Finding motives, causes, and purpose in data science

Video and summary of a talk I gave at the Data Science Sydney meetup, about going beyond the what & how of predictive modelling.

September 19, 2016

If you don’t pay attention, data can drive you off a cliff

Seven common mistakes to avoid when working with data, such as ignoring uncertainty and confusing observed and unobserved quantities.

August 21, 2016

Is Data Scientist a useless job title?

It seems like anyone who touches data can call themselves a data scientist, which makes the title useless. The work they do can still be useful, though.

August 4, 2016

Making Bayesian A/B testing more accessible

A web tool I built to interpret A/B test results in a Bayesian way, including prior specification, visualisations, and decision rules.

June 19, 2016

Diving deeper into causality: Pearl, Kleinberg, Hill, and untested assumptions

Discussing the need for untested assumptions and temporality in causal inference. Mostly based on Samantha Kleinberg’s Causality, Probability, and Time.

May 14, 2016

The rise of greedy robots

Is artificial/machine intelligence a future threat? I argue that it’s already here, with greedy robots already dominating our lives.

March 20, 2016

Why you should stop worrying about deep learning and deepen your understanding of causality instead

Causality is often overlooked but is of much higher relevance to most data scientists than deep learning.

February 14, 2016

The joys of offline data collection

Insights on data collection and machine learning from spending a month sailing, diving, and counting fish with Reef Life Survey.

January 24, 2016

This holiday season, give me real insights

Some companies present raw data or information as “insights”. This post surveys some examples, and discusses how they can be turned into real insights.

December 8, 2015

The hardest parts of data science

Defining feasible problems and coming up with reasonable ways of measuring solutions is harder than building accurate models or obtaining clean data.

November 23, 2015

Miscommunicating science: Simplistic models, nutritionism, and the art of storytelling

Nutritionism is a special case of misinterpretation and miscommunication of scientific results – something many data scientists encounter in their work.

October 19, 2015

The wonderful world of recommender systems

Giving an overview of the field and common paradigms, and debunking five common myths about recommender systems.

October 2, 2015

You don’t need a data scientist (yet)

Hiring data scientists prematurely is wasteful and frustrating. Here are some questions to ask before you hire your first data scientist.

August 24, 2015

Learning about deep learning through album cover classification

Progress on my album cover classification project, highlighting lessons that would be useful to others who are getting started with deep learning.

July 6, 2015

Hopping on the deep learning bandwagon

To become proficient at solving data science problems, you need to get your hands dirty. Here, I used album cover classification to learn about deep learning.

June 6, 2015

First steps in data science: author-aware sentiment analysis

I became a data scientist by doing a PhD, but the same steps can be followed without a formal education program.

May 2, 2015

My PhD work

An overview of my PhD in data science / artificial intelligence. Thesis title: Text Mining and Rating Prediction with Topical User Models.

March 30, 2015

The long road to a lifestyle business

Progress since leaving my last full-time job and setting on an independent path that includes data science consulting and work on my own projects.

March 22, 2015

Learning to rank for personalised search (Yandex Search Personalisation – Kaggle Competition Summary – Part 2)

My team’s solution to the Yandex Search Personalisation competition (finished 9th out of 194 teams).

February 11, 2015

Is thinking like a search engine possible? (Yandex search personalisation – Kaggle competition summary – part 1)

Insights on search personalisation and SEO from participating in a Kaggle competition (finished 9th out of 194 teams).

January 29, 2015

Stochastic Gradient Boosting: Choosing the Best Number of Iterations

Exploring an approach to choosing the optimal number of iterations in stochastic gradient boosting, following a bug I found in scikit-learn.

December 29, 2014

Fitting noise: Forecasting the sale price of bulldozers (Kaggle competition summary)

Summary of a Kaggle competition to forecast bulldozer sale price, where I finished 9th out of 476 teams.

November 19, 2014

What is data science?

Data science has been a hot term in the past few years. Still, there isn’t a single definition of the field. This post discusses my favourite definition.

October 23, 2014

Greek Media Monitoring Kaggle competition: My approach

Summary of my approach to the Greek Media Monitoring Kaggle competition, where I finished 6th out of 120 teams.

October 7, 2014

Bandcamp recommendation and discovery algorithms

The recommendation backend for my BCRecommender service for personalised Bandcamp music discovery.

September 19, 2014

How to (almost) win Kaggle competitions

Summary of a talk I gave at the Data Science Sydney meetup with ten tips on almost-winning Kaggle competitions.

August 24, 2014

Data’s hierarchy of needs

Discussing the hierarchy of needs proposed by Jay Kreps. Key takeaway: Data-driven algorithms & insights can only be as good as the underlying data.

August 17, 2014

Kaggle competition tips and summaries

Pointers to all my Kaggle advice posts and competition summaries.

April 5, 2014

Kaggle beginner tips

First post! An email I sent to members of the Data Science Sydney Meetup with tips on how to get started with Kaggle competitions.

January 19, 2014
Subscribe