Questions to consider when using AI for PDF data extraction

Discussing considerations that arise when attempting to automate the extraction of structured data from PDFs and similar documents.

March 11, 2024

Two types of startup data problems

Classifying startups as ML-centric or non-ML is a helpful exercise to uncover the data challenges they’re likely to face.

March 4, 2024

Avoiding AI complexity: First, write no code

Two stories of getting AI functionality to production, which demonstrate the risks inherent in custom development versus starting with a no-code approach.

February 26, 2024

Transfer learning applies to energy market bidding

An interesting approach to bidding of energy storage assets, showing that training on New York data is transferable to Queensland.

December 14, 2023

Supporting volunteer monitoring of marine biodiversity with modern web and data tools

Summarising the work Uri Seroussi and I did to improve Reef Life Survey’s Reef Species of the World app.

November 29, 2023

Google's Rules of Machine Learning still apply in the age of large language models

Despite the excitement around large language models, building with machine learning remains an engineering problem with established best practices.

September 21, 2023

ChatGPT is transformative AI

My perspective after a week of using ChatGPT: This is a step change in finding distilled information, and it’s only the beginning.

December 11, 2022

Causal Machine Learning is off to a good start, despite some issues

Reviewing the first three chapters of the book Causal Machine Learning by Robert Osazuwa Ness.

September 12, 2022

Building useful machine learning tools keeps getting easier: A fish ID case study

Lessons learned building a fish ID web app with fast.ai and Streamlit, in an attempt to reduce my fear of missing out on the latest deep learning developments.

March 20, 2022

Use your human brain to avoid artificial intelligence disasters

Overview of a talk I gave at a deep learning course, focusing on AI ethics as the need for humans to think on the context and consequences of applying AI.

November 22, 2021

My work with Automattic

Back-dated meta-post that gathers my posts on Automattic blogs into a summary of the work I’ve done with the company.

October 7, 2021

Defining data science in 2018

Updating my definition of data science to match changes in the field. It is now broader than before, but its ultimate goal is still to support decisions.

July 22, 2018

Why you should stop worrying about deep learning and deepen your understanding of causality instead

Causality is often overlooked but is of much higher relevance to most data scientists than deep learning.

February 14, 2016

Miscommunicating science: Simplistic models, nutritionism, and the art of storytelling

Nutritionism is a special case of misinterpretation and miscommunication of scientific results – something many data scientists encounter in their work.

October 19, 2015

The wonderful world of recommender systems

Giving an overview of the field and common paradigms, and debunking five common myths about recommender systems.

October 2, 2015

Learning about deep learning through album cover classification

Progress on my album cover classification project, highlighting lessons that would be useful to others who are getting started with deep learning.

July 6, 2015

Hopping on the deep learning bandwagon

To become proficient at solving data science problems, you need to get your hands dirty. Here, I used album cover classification to learn about deep learning.

June 6, 2015

First steps in data science: author-aware sentiment analysis

I became a data scientist by doing a PhD, but the same steps can be followed without a formal education program.

May 2, 2015

My PhD work

An overview of my PhD in data science / artificial intelligence. Thesis title: Text Mining and Rating Prediction with Topical User Models.

March 30, 2015

Learning to rank for personalised search (Yandex Search Personalisation – Kaggle Competition Summary – Part 2)

My team’s solution to the Yandex Search Personalisation competition (finished 9th out of 194 teams).

February 11, 2015

Is thinking like a search engine possible? (Yandex search personalisation – Kaggle competition summary – part 1)

Insights on search personalisation and SEO from participating in a Kaggle competition (finished 9th out of 194 teams).

January 29, 2015

Stochastic Gradient Boosting: Choosing the Best Number of Iterations

Exploring an approach to choosing the optimal number of iterations in stochastic gradient boosting, following a bug I found in scikit-learn.

December 29, 2014
Subscribe