Whitetip shark with an RLS transect

The joys of offline data collection

Many modern data scientists don’t get to experience data collection in the offline world. Recently, I spent a month sailing down the northern Great Barrier Reef, collecting data for the Reef Life Survey project. In addition to being a great diving experience, the trip helped me obtain general insights on data collection and machine learning, which are shared in this article.

The Reef Life Survey project

Reef Life Survey (RLS) is a citizen scientist project, led by a team from the University of Tasmania. The data collected by RLS volunteers is freely available on the RLS website, and has been used for producing various reports and scientific publications. An RLS survey is performed along a 50 metre tape, which is laid at a constant depth following a reef’s contour. After laying the tape, one diver takes photos of the bottom at 2.5 metre intervals along the transect line. These photos are automatically analysed to classify the type of substrate or growth (e.g., hard coral or sand). Divers then complete two swims along each side of the transect. On the first swim (method 1), divers record all the fish species and large swimming animals found in a 5 metre corridor from the line. The second swim (method 2) requires keeping closer to the bottom and looking under ledges and vegetation in a 1 metre corridor from the line, targeting invertebrates and cryptic animals. The RLS manual includes all the details on how surveys are performed.

Performing RLS surveys is not a trivial task. In the tropics, it is not uncommon to record around 100 fish species on method 1. The scientists running the project are very conscious of the importance of obtaining high-quality data, so training to become an RLS volunteer takes considerable effort and dedication. The process generally consists of doing surveys together with an experienced RLS diver, and comparing the data after each dive. Once the trainee’s data matches that of the experienced RLSer, they are considered good enough to perform surveys independently. However, retraining is often required when surveying new ecoregions (e.g., an RLSer trained in Sydney needs further training to survey the Great Barrier Reef).

RLS requires a lot of hard work, but there are many reasons why it’s worth the effort. As someone who cares about marine conservation, I like the fact that RLS dives yield useful data that is used to drive environmental management decisions. As a scuba diver, I enjoy the opportunity to dive places that are rarely dived and the enhanced knowledge of the marine environment – doing surveys makes me notice things that I would otherwise overlook. Finally, as a data scientist, I find the exposure to the work of marine scientists very educational.

Pre-training and thoughts on supervised learning

Doing surveys in the tropics is a completely different story from surveying temperate reefs, due to the substantially higher diversity and abundance of marine creatures. Producing high-quality results requires being able to identify most creatures underwater, while doing the survey. It is possible to write down descriptions and take photos of unidentified species, but doing this for a large number of species is impractical.

Training the neural network in my head to classify tropical fish by species was an interesting experience. The approach that worked best was making flashcards using reveal.js, photos scraped from various sources, and past survey data. As the image below shows, each flashcard consists of a single photo, and pressing the down arrow reveals the name of the creature. With some basic JavaScript, I made the presentation select a different subset of photos on each load. Originally, I tried to learn all the 1000+ species that were previously recorded in the northern Great Barrier Reef, but this proved to be too hard – I realised that a better strategy was needed. The strategy that I chose was to focus on the most frequently-recorded species: I started by memorising the most frequent ones (e.g., those recorded on more than 50% of surveys), and gradually made it more challenging by decreasing the frequency threshold (e.g., to 25% in 5% steps). This proved to be pretty effective – by the time I started diving I could identify about 50-100 species underwater, even though I had mostly been using static images. It’d be interesting to know whether this kind of approach would be effective in training neural networks (or other batch-trained models) in certain scenarios – spend a few epochs training with instances from a subset of the classes, and gradually increase the number of considered classes. This may be effective when errors on certain classes are more important than others, and may yield different results from simply weighting classes or instances. Please let me know if you know of anyone who has experimented with this idea (update: gwern from Reddit pointed me to the paper Curriculum Learning by Bengio et al., which discusses this idea).

RLS flashcard example (Chaetodon lunulatus)

RLS flashcard example (Chaetodon lunulatus)

While repeatedly looking at photos and their labels felt a lot like training an artificial neural network, as a human I have the advantage of being able to easily use information from multiple sources. For example, fish ID books such as Reef Fish Identification: Tropical Pacific provide concise descriptions of the identifying physical features of each fish (see the image below for the book’s entry for Chaetodon lunulatus – the butterflyfish from the flashcard above). Reading those descriptions made me learn more effectively, by helping me focus my attention on the parts that matter for classification. Learning only from static images can be hard when classifying creatures with highly variable colour schemes – using extraneous knowledge about what actually matters when it comes to classification is the way to go in practice. Further, features that are hard to decode from photos – like behaviour and habitat – are sometimes crucial to distinguishing different species. One interesting thought is that while photos can be seen as raw data, natural language descriptions are essentially models. Utilising such models is likely to be of benefit in many areas. For example, being able to tell a classifier what to look for in an image would make training a supervised classifier more similar to the way humans learn. This may be achieved using similar techniques to those used for generating image descriptions, except that the goal would be to use descriptions of the classes to improve classification accuracy.

Fish ID example (Chaetodon lunulatus)

Fish ID example (Chaetodon lunulatus). Source: Reef Fish Identification: Tropical Pacific

Another difference between my learning and supervised machine learning is that if I found a creature hard to identify, I would go and look for more photos or videos of them. Videos were especially valuable, because in practice I rarely had to identify static creatures. This approach may be applicable in situations where labelled data is abundant. Sometimes, using all the labelled data makes model training too slow to be practical. An approach I used in the past to overcome this issue is to randomly sample the data, but it often makes sense to sample in a way that yields the best model, e.g., by sampling more instances from classes that are harder to classify.

One similarity to supervised machine learning that I encountered was the danger of overfitting. Due to the relatively small number of photos and the fact that I had to view each one of them multiple times, I found that in some cases I memorised the entire photo rather than the creature. This was especially the case with low-quality photos or ones that were missing key features. My regularisation approach consisted of trying to memorise the descriptions from the book, and collecting more photos. I wish more algorithms were this self-conscious about overfitting!

Can’t this be automated?

While doing surveys and studying species, I kept asking myself whether the whole thing can be automated. Thanks to deep learning, computers have recently gotten very good at classifying images, sometimes outperforming humans. It seems likely that at some point the survey methodology would be changed to just taking a video of the dive, and letting an algorithm do the hard job of identifying the creatures. Analysis of the bottom photos is automated, so it is reasonable to automate the other survey methods as well. However, there are quite a few challenges that need to be overcome before full automation can be implemented.

If the results of the LifeCLEF 2015 Fish Task are any indication, we are quite far from automating fish identification. The precision of the top methods in that challenge was around 80% for identifying 15 fish species from underwater videos, where the chosen species are quite distinct from each other. In tropical surveys it is not uncommon to record around 100 fish species along the 50 metre transect, with many species being similar to each other. It’s usually the case that it’s not same species on every dive (even at the same site), so replacing humans would require training a highly accurate classifier on thousands of species.

Dealing with high diversity isn’t the only challenge in automating RLS. The appearance of many species varies by gender and age, so the classifier would have to learn all those variations (see image below for an example). Getting good training data can be very challenging, since the labelling process is labour-intensive, and elements like colour and backscatter are highly dependent on dive site conditions and the quality of the camera. Another complication is that RLS data includes size estimates, which can be hard to obtain from videos and photos without knowing how far the camera was from the subject and the type of lens used. In addition, accounting for side information (geolocation, behaviour, depth, etc.) can make a huge difference in accurately identifying species, but it isn’t easy to integrate with some learning models. Finally, it is likely that some species will be missed when videos are taken without any identification done underwater, because RLSers tend to get good photos of species that they know will be hard to identify, even if it means spending more time at one spot or shining strobes under ledges.

Chlorurus sordidus variations

Chlorurus sordidus variations. Source: Tropical Marine Fishes of Australia

Another aspect of automating surveys is completely removing the need for human divers by sending robots down. This is an active research area, and is the only way of surveying deep waters. However, this approach still requires a boat-based crew to deploy the robots. It may also yield different data from RLS for cryptic species, though this depends on the type of robots used. In addition, there’s the issue of cost – RLS relies on volunteer scuba divers who are diving anyway, so the cost of getting RLSers to do surveys is rather low (especially for shore dives near a diver’s home, where there is no cost to RLS). Further, RLS’s mission is “to inspire and engage a global volunteer community to survey reefs using scientific methods and share knowledge about marine ecosystem health”. Engaging the community is a crucial part of RLS because robots do not care about the environment. Humans do.

Small data is valuable

When compared to datasets commonly encountered online, RLS data is small. As the image below shows, fewer than 10,000 surveys have been conducted to date. However, this data is still valuable, as it provides a high-quality snapshot of the state of marine ecosystems in areas that wouldn’t be surveyed if it wasn’t for RLS volunteers. For example, in a recent Nature article, the authors used RLS data to assess the vulnerability of marine fauna to global warming.

RLS surveys by Australian financial year (July-June)

RLS surveys by Australian financial year (July-June). Source: RLS Foundation Annual Report 2015

Each RLS survey requires several hours of work. In addition to performing the survey itself, a lot of work goes into entering the data and verifying its quality. Getting to the survey sites is not always a trivial task, especially for remote sites such as some of those we dived on my recent trip. Spending a month diving the Great Barrier Reef is a good way of appreciating its greatness. As the map shows, the surveys we did covered only the top part of the reef’s 2300 kilometres, and we only sampled a few sites within that part. The Great Barrier Reef is very vast, and it is hard to convey its vastness with just words or a map. You have to be there to understand – it is quite humbling.

In summary, the RLS experience has given me a new appreciation for small data in the offline world. Offline data collection is often expensive and labour-intensive – you need to work hard to produce a few high-quality data points. But the size of your data doesn’t matter (though having more quality data is always good). What really matters is what you do with the data – and the RLS team and their collaborators have been doing quite a lot. The RLS experience also illustrates the importance of domain expertise: I’ve looked at the RLS datasets, but I have no idea what questions are worth asking and answering using those datasets. The RLS project is yet another example of how in science collecting data is time-consuming, and coming up with appropriate research questions is hard. It is a lot of fun, though.

My divestment from fossil fuels

This post covers recent choices I’ve made to reduce my exposure to fossil fuels, including practical steps that can be taken by Australians and generally applicable lessons.

I recently read Naomi Klein’s This Changes Everything, which deeply influenced me. The book describes how the world has been dragging its feet when it comes to reducing carbon emissions, and how we are coming very close to a point where climate change is likely to spin out of control. While many of the facts presented in the book can be very depressing, one ray of light is that it is still not too late to act. There are still things we can do to avoid catastrophic climate change.

One such thing is divestment from fossil fuels. Fossil fuel companies have committed to extracting (and therefore burning) more than what scientists agree is the safe amount of carbon that can be pumped into the atmosphere. While governments have been rather ineffective in stopping this (the current Australian government is even embarrassingly rolling back emission-reduction measures), divesting your money from such companies can help take away the social licence of these companies to do as they please. Further, this may be a smart investment strategy because the world is moving towards renewable energy. Indeed, according to one index, investors who divested from fossil fuels have had higher returns than conventional investors over the last five years.

It’s worth noting that even if you disagree with the scientific consensus that releasing billions of tonnes of greenhouse gases into the atmosphere increases the likelihood of climate change, you should agree that it’d be better to stop breathing all the pollutants that result from burning fossil fuels. Further, the environmental damage that comes with extracting fossil fuels is something worth avoiding. Examples include the Deepwater Horizon oil spill, numerous cases of poisoned water due to fracking, and the potential damage to the Great Barrier Reef due to coal mine expansion. Even climate change deniers would admit that divestment from fossil fuels and a rapid move to clean renewables will prevent such disasters.

The rest of this post describes steps I’ve recently taken towards divesting from fossil fuels. These are mostly relevant to Australians, though other countries may have similar options.


In Australia, we have compulsory superannuation (commonly known as super), meaning that most working Australians have some money invested somewhere. As this money is only available at retirement, investors can afford to optimise for long-term returns. Many super funds allow investors to choose what to invest in, and switching funds is relatively straightforward. My super fund is UniSuper. Last week, I switched my plan from Balanced, which includes investments in coal miners Rio Tinto and BHP Billiton, to 75% Sustainable Balanced, which doesn’t directly invest in fossil fuels, and 25% Global Environment Opportunities, which is focused on companies with a green agenda such as Tesla. This switch was very simple – I wish I had done it earlier. If you’re interested in making a similar switch, check out Superswitch’s guide to fossil-free super options.


While our previous energy retailer (ClickEnergy) isn’t one of the big three retailers who are actively lobbying the government to reduce the renewable energy target for 2020, my partner and I decided to switch to Powershop, as it appears to be the greenest energy retailer in New South Wales. Powershop supports maintaining the renewable energy target in its current form and provides free carbon offsets for all non-renewable energy. In addition, Powershop allows customers to purchase 100% green power from renewables – an option that we choose to take. With the savings from moving to Powershop and the extra payment for green power, our bill is expected to be more or less the same as before. Everyone wins!

Note: If you live in New South Wales or Victoria and generally support what GetUp is doing, you can sign up via the links on this page, and GetUp will be paid a referral fee by Powershop.


There’s been a lot of focus recently on financing provided by the major banks to fossil fuel companies. The problem is that – unlike with super and energy – there aren’t many viable alternatives to the big banks. Reading the statements by smaller banks and credit unions, it is clear that they don’t provide financing to polluters just because they’re too small or not focused on commercial lending. Further, some of the smaller banks invest their money with the bigger banks. If the smaller banks were to become big due to the divestment movement, they may end up financing polluters. Unfortunately, changing your bank doesn’t give you more control over how your chosen financial institute uses your money.

For now, I think it makes sense to push the banks to become fossil free by putting them on notice or participating in demonstrations. With enough pressure, one of the big banks may make a strong statement against lending to polluters, and then it’ll be time to act on the notices. One thing that the big banks care about is customer satisfaction and public image. Sending a strong message about the connection between financing polluters and satisfaction may be enough to make a difference. I’ll be tracking news in this area and will possibly make a switch in the future, depending on how things evolve.


My top transportation choices are cycling and public transport, followed by driving when the former two are highly inconvenient (e.g., when going scuba diving). Every bike ride means less pollution and is a vote against fossil fuels. Further, bike riding is my main form of exercise, so I don’t need to set aside time to go to the gym. Finally, it’s almost free, and it’s also the fastest way of getting to the city from where I live.

Since January, I’ve been allowing people to borrow my car through Car Next Door. This service, which is currently active in Sydney and Melbourne, allows people to hire their neighbours’ cars, thereby reducing the number of cars on the road. They also carbon offset all the rides taken through the service. While making my car available has made using it slightly less convenient (because I need to book it for myself), it’s also saved me money, so far covering the cost of insurance and roadside assistance. With my car sitting idle for 95% of the time before joining Car Next Door, it’s definitely another win-win situation. If you’d like to join Car Next Door as either a borrower or an owner, you can use this link to get $15 credit.

Other areas and next steps

Many of the choices we make every day have the power to reduce energy demand. These choices often make our life better, as seen with the bike riding example above. There’s a lot of material online about these green choices, which I may cover from my angle in another post. In general, I’m planning to be more active in the area of environmentalism. While this may come at the cost of reduced focus on my other activities, I would rather be more a part of the solution than a part of the problem. I’ll update as I go – please subscribe to get notified when updates occur.