Browse Posts

Posting into the void – with guardrails

A LinkedIn post with little traction led to a surprising AI collaboration – proof that sharing online can spark unexpected opportunities.

Data moats, stealthy AI, and more: AI Con 2024 notes

Themes from AI Con 2024: data moats, stealthy AI use, Chatty’s UX revolution, and enduring fundamentals.

Don't build AI, build with AI

Building AI is hard and expensive. For most companies, the path to AI success is building with third-party AI interns and cheap AI cogs.

In praise of inconsistency: Ditching weekly posts

On moving away from weekly blog posts in favour of deeper inconsistent articles and LinkedIn engagement.

Data, AI, humans, and climate: Carving a consulting niche

Podcast chat on the reality of Data & AI and my consulting focus: Helping climate & nature tech startups ship data-intensive solutions.

Juggling delivery, admin, and leads: Monthly biz recap

Highlights and lessons from my solo expertise biz, including value pricing, fractional cash flow, and distractions from admin & politics.

AI hype, AI bullshit, and the real deal

My views on separating AI hype and bullshit from the real deal. The general ideas apply to past and future hype waves in tech.

Giving up on the minimum viable data stack

Exploring why universal advice on startup data stacks is challenging, and the importance of context-specific decisions in data infrastructure.

Keep learning: Your career is never truly done

Podcast chat on my career journey from software engineering to data science and independent consulting.

First year lessons from a solo expertise biz in Data & AI

Reflections on building a solo expertise business in Data & AI, focusing on climate tech startups. Lessons learned from the first year of transition.

AI/ML lifecycle models versus real-world mess

The real world of AI/ML doesn’t fit into a neat diagram, so I created another diagram and a maturity heatmap to model the mess.

Your first Data-to-AI hire: Run a lovable process

Video and key points from the second part of a webinar on a startup’s first data hire, covering tips for defining the role and running the process.

Learn about Dataland to avoid expensive hiring mistakes

Video and key points from the first part of a webinar on a startup’s first data hire, covering data & AI definitions and high-level recommendations.

Exploring an AI product idea with the latest ChatGPT, Claude, and Gemini

Asking identical questions about my MagicGrantMaker idea yielded near-identical responses from the top chatbot models.

Stay alert! Security is everyone's responsibility

Questions to assess the security posture of a startup, focusing on basic hygiene and handling of sensitive data.

Is your tech stack ready for data-intensive applications?

Questions to assess the quality of tech stacks and lifecycles, with a focus on artificial intelligence, machine learning, and analytics.

AI ain't gonna save you from bad data

Since we’re far from a utopia where data issues are fully handled by AI, this post presents six questions humans can use to assess data projects.

Startup data health starts with healthy event tracking

Expanding on the startup health check question of tracking Kukuyeva’s five business aspects as wide events.

How to avoid startups with poor development processes

Questions that prospective data specialists and engineers should ask about development processes before accepting a startup role.

Plumbing, Decisions, and Automation: De-hyping Data & AI

Three essential questions to understand where an organisation stands when it comes to Data & AI (with zero hype).

Question startup culture before accepting a data-to-AI role

Eight questions that prospective data-to-AI employees should ask about a startup’s work and data culture.

Probing the People aspects of an early-stage startup

Ten questions that prospective employees should ask about a startup’s team, especially for data-centric roles.

Business questions to ask before taking a startup data role

Fourteen questions that prospective employees should ask about a startup’s business model and product, especially for data-focused roles.

Mentorship and the art of actionable advice

Reflections on what it takes to package expertise and deliver timely, actionable advice outside the context of employee relationships.

Assessing a startup's data-to-AI health

Reviewing the areas that should be assessed to determine a startup’s opportunities and challenges on the data/AI/ML front.

AI does not obviate the need for testing and observability

It’s easy to prototype with AI, but production-grade AI apps require even more thorough testing and observability than traditional software.

My experience as a Data Tech Lead with Work on Climate

The story of how I joined Work on Climate as a volunteer and became its data tech lead, with lessons applied to consulting & fractional work.

Artificial intelligence, automation, and the art of counting fish

Discussing the use of AI to automate underwater marine surveys as an example of the uneven distribution of technological advancement.

Questions to consider when using AI for PDF data extraction

Discussing considerations that arise when attempting to automate the extraction of structured data from PDFs and similar documents.

Two types of startup data problems

Classifying startups as ML-centric or non-ML is a helpful exercise to uncover the data challenges they’re likely to face.

Avoiding AI complexity: First, write no code

Two stories of getting AI functionality to production, which demonstrate the risks inherent in custom development versus starting with a no-code approach.

Building your startup's minimum viable data stack

First post in a series on building a minimum viable data stack for startups, introducing key definitions, components, and considerations.

Nudging ChatGPT to invent books you have no time to read

Getting ChatGPT Plus to elaborate on possible book content and produce a PDF cheatsheet, with the goal of learning about its capabilities.

Substance over titles: Your first data hire may be a data scientist

Advice for hiring a startup’s first data person: match skills to business needs, consider contractors, and get help from data people.

New decade, new tagline: Data & AI for Impact

Shifting focus to ‘Data & AI for Impact’, with more startup-related content, increased posting frequency, and deeper audience engagement.

Supporting volunteer monitoring of marine biodiversity with modern web and data tools

Summarising the work Uri Seroussi and I did to improve Reef Life Survey’s Reef Species of the World app.

Lessons from reluctant data engineering

Video and summary of a talk I gave at DataEngBytes Brisbane on what I learned from doing data engineering as part of every data science role I had.

My rediscovery of quiet writing on the open web

Reflections on publishing on this website: Writing publicly to share thoughts and documentation beats chasing views and likes.

Was data science a failure mode of software engineering?

Yes, data science projects have suffered from classic software engineering mistakes, but the field is maturing with the rise of new engineering roles.

How hackable are automated coding assessments?

Exploring the hackability of speed-based coding tests, using CodeSignal’s Industry Coding Framework as a case study.

Remaining relevant as a small language model

Bing Chat recently quipped that humans are small language models. Here are some of my thoughts on how we small language models can remain relevant (for now).

ChatGPT is transformative AI

My perspective after a week of using ChatGPT: This is a step change in finding distilled information, and it’s only the beginning.

Causal Machine Learning is off to a good start, despite some issues

Reviewing the first three chapters of the book Causal Machine Learning by Robert Osazuwa Ness.

The mission matters: Moving to climate tech as a data scientist

Discussing my recent career move into climate tech as a way of doing more to help mitigate dangerous climate change.

Building useful machine learning tools keeps getting easier: A fish ID case study

Lessons learned building a fish ID web app with fast.ai and Streamlit, in an attempt to reduce my fear of missing out on the latest deep learning developments.

Analysis strategies in online A/B experiments: Intention-to-treat, per-protocol, and other lessons from clinical trials

Epidemiologists analyse clinical trials to estimate the intention-to-treat and per-protocol effects. This post applies their strategies to online experiments.

Use your human brain to avoid artificial intelligence disasters

Overview of a talk I gave at a deep learning course, focusing on AI ethics as the need for humans to think on the context and consequences of applying AI.

Migrating from WordPress.com to Hugo on GitHub + Cloudflare

My reasons for switching from WordPress.com to Hugo on GitHub + Cloudflare, along with a summary of the solution components and migration process.

My work with Automattic

Back-dated meta-post that gathers my posts on Automattic blogs into a summary of the work I’ve done with the company.

Some highlights from 2020

Sharing remote teamwork insights, my climate & sustainability activism, Reef Life Survey publications, and progress on Automattic’s Experimentation Platform.

Many is not enough: Counting simulations to bootstrap the right way

Going deeper into correct testing of different methods for bootstrap estimation of confidence intervals.

Software commodities are eating interesting data science work

Being a data scientist can sometimes feel like a race against software commodities that replace interesting work. What can one do to remain relevant?

A day in the life of a remote data scientist

Video of a talk I gave on remote data science work at the Data Science Sydney meetup.

Bootstrapping the right way?

Video and summary of a talk I gave at YOW! Data on bootstrap estimation of confidence intervals.

Hackers beware: Bootstrap sampling may be harmful

Bootstrap sampling has been promoted as an easy way of modelling uncertainty to hackers without much statistical knowledge. But things aren’t that simple.

The most practical causal inference book I’ve read (is still a draft)

Causal Inference by Miguel Hernán and Jamie Robins is a must-read for anyone interested in the area.

Reflections on remote data science work

Discussing the pluses and minuses of remote work eighteen months after joining Automattic as a data scientist.

Defining data science in 2018

Updating my definition of data science to match changes in the field. It is now broader than before, but its ultimate goal is still to support decisions.

Advice for aspiring data scientists and other FAQs

Frequently asked questions by visitors to this site, especially around entering the data science field.

State of Bandcamp Recommender, Late 2017

Call for BCRecommender maintainers followed by a decision to shut it down, as I don’t have enough time and Bandcamp now offers recommendations.

My 10-step path to becoming a remote data scientist with Automattic

I wanted a well-paid data science-y remote job with an established company that offers a good life balance and makes products I care about. I got it eventually.

Exploring and visualising Reef Life Survey data

Web tools I built to visualise Reef Life Survey data and assist citizen scientists in underwater visual census work.

Customer lifetime value and the proliferation of misinformation on the internet

There’s a lot of misleading content on the estimation of customer lifetime value. Here’s what I learned about doing it well.

Ask Why! Finding motives, causes, and purpose in data science

Video and summary of a talk I gave at the Data Science Sydney meetup, about going beyond the what & how of predictive modelling.

If you don’t pay attention, data can drive you off a cliff

Seven common mistakes to avoid when working with data, such as ignoring uncertainty and confusing observed and unobserved quantities.

Is Data Scientist a useless job title?

It seems like anyone who touches data can call themselves a data scientist, which makes the title useless. The work they do can still be useful, though.

Making Bayesian A/B testing more accessible

A web tool I built to interpret A/B test results in a Bayesian way, including prior specification, visualisations, and decision rules.

Diving deeper into causality: Pearl, Kleinberg, Hill, and untested assumptions

Discussing the need for untested assumptions and temporality in causal inference. Mostly based on Samantha Kleinberg’s Causality, Probability, and Time.

The rise of greedy robots

Is artificial/machine intelligence a future threat? I argue that it’s already here, with greedy robots already dominating our lives.

Why you should stop worrying about deep learning and deepen your understanding of causality instead

Causality is often overlooked but is of much higher relevance to most data scientists than deep learning.

The joys of offline data collection

Insights on data collection and machine learning from spending a month sailing, diving, and counting fish with Reef Life Survey.

This holiday season, give me real insights

Some companies present raw data or information as “insights”. This post surveys some examples, and discusses how they can be turned into real insights.

The hardest parts of data science

Defining feasible problems and coming up with reasonable ways of measuring solutions is harder than building accurate models or obtaining clean data.

Migrating a simple web application from MongoDB to Elasticsearch

Migrating BCRecommender from MongoDB to Elasticsearch made it possible to offer a richer search experience to users at a similar cost, among other benefits.

Miscommunicating science: Simplistic models, nutritionism, and the art of storytelling

Nutritionism is a special case of misinterpretation and miscommunication of scientific results – something many data scientists encounter in their work.

The wonderful world of recommender systems

Giving an overview of the field and common paradigms, and debunking five common myths about recommender systems.

You don’t need a data scientist (yet)

Hiring data scientists prematurely is wasteful and frustrating. Here are some questions to ask before you hire your first data scientist.

Goodbye, Parse.com

Migrating my web apps away from Parse.com due to reliability issues. Self-hosting is a better solution.

Learning about deep learning through album cover classification

Progress on my album cover classification project, highlighting lessons that would be useful to others who are getting started with deep learning.

Hopping on the deep learning bandwagon

To become proficient at solving data science problems, you need to get your hands dirty. Here, I used album cover classification to learn about deep learning.

First steps in data science: author-aware sentiment analysis

I became a data scientist by doing a PhD, but the same steps can be followed without a formal education program.

My divestment from fossil fuels

Recent choices I’ve made to reduce my exposure to fossil fuels, including practical steps that can be taken by Australians and generally applicable lessons.

My PhD work

An overview of my PhD in data science / artificial intelligence. Thesis title: Text Mining and Rating Prediction with Topical User Models.

The long road to a lifestyle business

Progress since leaving my last full-time job and setting on an independent path that includes data science consulting and work on my own projects.

Learning to rank for personalised search (Yandex Search Personalisation – Kaggle Competition Summary – Part 2)

My team’s solution to the Yandex Search Personalisation competition (finished 9th out of 194 teams).

Is thinking like a search engine possible? (Yandex search personalisation – Kaggle competition summary – part 1)

Insights on search personalisation and SEO from participating in a Kaggle competition (finished 9th out of 194 teams).

Automating Parse.com bulk data imports

A script for importing data into the Parse backend-as-a-service.

Stochastic Gradient Boosting: Choosing the Best Number of Iterations

Exploring an approach to choosing the optimal number of iterations in stochastic gradient boosting, following a bug I found in scikit-learn.

SEO: Mostly about showing up?

Increasing SEO traffic to BCRecommender by adding content and opening up more pages for crawling. It turns out that thin content is better than no content.

Fitting noise: Forecasting the sale price of bulldozers (Kaggle competition summary)

Summary of a Kaggle competition to forecast bulldozer sale price, where I finished 9th out of 476 teams.

BCRecommender Traction Update

Update on BCRecommender traction using three channels: blogger outreach, search engine optimisation, and content marketing.

What is data science?

Data science has been a hot term in the past few years. Still, there isn’t a single definition of the field. This post discusses my favourite definition.

Greek Media Monitoring Kaggle competition: My approach

Summary of my approach to the Greek Media Monitoring Kaggle competition, where I finished 6th out of 120 teams.

Applying the Traction Book’s Bullseye framework to BCRecommender

Ranking 19 channels with the goal of getting traction for BCRecommender.

Bandcamp recommendation and discovery algorithms

The recommendation backend for my BCRecommender service for personalised Bandcamp music discovery.

Building a recommender system on a shoestring budget (or: BCRecommender part 2 – general system layout)

Iterating on my BCRecommender service with the goal of keeping costs low while providing a valuable music recommendation service.

Building a Bandcamp recommender system (part 1 – motivation)

My motivation behind building BCRecommender, a free recommendation & discovery service for Bandcamp music.

How to (almost) win Kaggle competitions

Summary of a talk I gave at the Data Science Sydney meetup with ten tips on almost-winning Kaggle competitions.

Data’s hierarchy of needs

Discussing the hierarchy of needs proposed by Jay Kreps. Key takeaway: Data-driven algorithms & insights can only be as good as the underlying data.

Kaggle competition tips and summaries

Pointers to all my Kaggle advice posts and competition summaries.

Kaggle beginner tips

First post! An email I sent to members of the Data Science Sydney Meetup with tips on how to get started with Kaggle competitions.