Imagining undiscovered species with neural networks

TLDR: Neural networks are cool. I trained a recurrent neural network on a list of around 25000 species names and made it generate its own, then built a Twitter bot that tweets one out every hour. The results are kinda funny.

I’ve always been interested in the application of artificial intelligence techniques to ecology. There’s huge potential and some very low hanging fruit in the use of machine learning to make predictions about species distribution and abundance, as well as a whole list of other things like evolutionary processes and taxonomic classification. Generative neural networks can sometimes also produce some surprising and funny results. I see it as a form of uncanny valley, where the results produced are similar enough to be plausible, but strange enough to cause a moment of cognitive disconnect. Koans in AI generated text. These texts give insight into our own lingual and syntactic abilities. What is normal language and what are the rules by which we produce and recognise it? Why is it so funny when those rules are broken?

I was recently inspired and amused after coming across an article on Janelle Shane’s blog, in which she lists recipe names generated by a recurrent neural network trained on a corpus of about 30 000 real recipe titles. Janelle has also turned her neural network to several other text sources, including Irish folk songs, knock knock jokes, Pokemon names, and full recipes themselves, all with hilarious results. I decided that I would take up the torch (you’ll get the pun in a minute) and have a go at producing a neural network capable of generating plausible species names.

Firstly, I needed to gather a training dataset. In order to broaden the appeal and increase the comedic value of the output, I decided producing common names for each species was important, so that meant I needed a dataset of species that had common names. This was a little hard to find. In the end I cobbled together lists of around 12600 animals and 14700 plants from various online datasources including the Atlas of Living Australia and the Global Biodiversity Information Facility. For each species I included the family name, the species binomial, and a list of common names. I kept the plants and the animals separate. The animals dataset was quite heavy in marine creatures, which is visible in the output – the network generates a lot of eels, fish, and crabs.

So how’d I actually do it? And what is ‘training’ a neural network? There’s a very good explanation of how recurrent neural networks work here, although that may be a little technical for most. The simple explanation (and this is about the extent to which I understand it properly, so feel free to update or add to my knowledge in the comments) is this: The neural network looks at each character in the source text at a time, and it makes a guess about what character will come next based on all the previous characters that it has read. Then it checks that next character, and updates its model according to whether it guessed correctly or not. The network doesn’t know anything about english; it doesn’t know anything about the subject matter; it just sees each character as a vector within a probabilistic space and it builds a model around those probabilities. How do you actually train it? That part is pretty simple, thanks to the great tools that have been built in this space over the last few years.

I used torch-rnn, a recurrent neural network package for the Torch (there’s the pun!) scientific computing framework. I followed the excellent guide here, and while I had to dive into Github issues a couple of times to solve installation glitches and even had to modify the source code to run on my machine, I got it up and going within an hour or so. Training took a while on my GPU-less MacBook Air – around 12 hours each for the animal and plant datasets. At the end of that process I issued commands that asked the neural network to generate sets of species names based on the animal and plant models it had developed. The output of these went into a text file ready for my Twitter bot to tweet. I generated enough to keep the bot going for around a year at one tweet per hour.

The results don’t quite have the comedic value of Janelle Shane’s recipe titles, but biologists might find them amusing, and it is really interesting how the AI has learned many of the rules of species naming – that plant families should end with -aceae, and animals with -dae, and that species names should have a ‘latin’ feel to them. In many cases it even used real family names, and sometimes genus names, I guess because there are few enough of them that it could learn that the whole word was commonly used. It learned that species names should be in two parts, and that common names often include hyphens, possessives, and terms like ‘weed’, or might end with ‘fish’ or ‘rose’. Of course, there are many times it gets those things wrong too – sometimes producing a family-like name in place of a species name, which resulted in the bot tweeting a species-like name in place of the common names, or else just combining things in some unrecognisable fashion.

The bot itself is pretty standard; using the Tweepy library really makes it easy to set up a Twitter bot. It runs on my Raspberry Pi, and is triggered by a cron job every hour.

Follow Undiscovered Species on Twitter to keep up with the names.

Heatmap of Threatened Plant Species of Australia

Using data from the Atlas of Living Australia and tools from Mapbox, I created a heatmap of observations of threatened plant species in Australia.

Methods

Preparing the data

I accessed the ALA’s excellent web services API to get the data on threatened plant species observations. I wrote two python scripts to gather this data; the first got the GUIDs (a unique ID) of each plant species that had a Commonwealth conservation status of Rare, Vulnerable, Endangered, or Critically Endangered. Once I had all those GUIDs (around 4000 of them), I ran the second python script which queried the API for all observation records for each GUID (sorry for the server hit, ALA!). Once I had all those observations, I simply stripped the location coordinates out of them, as I didn’t need to know anything more about them for this project, and then wrote the coordinates to a csv file. The result was 11086 coordinates.

Making the map

I loaded up QGIS and imported the csv of coordinates. Following these instructions, I built a heatmap from the coordinate points using a 100 km radius (meaning that the map shows numbers of other records within 100 km) and the Triweight kernel shape. I used a cell size of 0.1 map units (ie, degrees, since I was running this project in WGS84), which I figured would give good enough spatial resolution while keeping file size reasonable for upload. Many of the records I was working with were generalised to 0.1 degrees anyway, in order to protect the exact locations of the conservation-dependent plant species. In order to get the map onto the web, I used Mapbox’s TileMill software. I exported the heatmap from QGIS, reprojecting the image into Google Mercator projection (900913) to make it display properly in TileMill. From TileMill, I uploaded the map into my Mapbox account – and here it is.

Results and Discussion

The Map

Explanation

The legend doesn’t show up in the embedded map, but you can see it in the full map. Here’s an explanation of what the colours actually mean. The numbers displayed in the legend are increments from 0 – 5.69 (roughly). The values refer to the number of records of threatened plant species observations within a 100 km radius of any spot. In the red sections, there were no other records within 100 km (ie there was only one record). In the blue sections, there were at least five other records within 100 km of the cell. The numbers were calculated based on the estimate cumulative cut of the full extent of the map (which may have been the wrong method to use – see below), using a cut value of 2 – 98%.

Discussion

The pattern shown is that the known biodiversity hotspots tend to be blue coloured. This explains the blue in the south west corner, in the area between Melbourne and Adelaide, in Tasmania, and in central Queensland. That’s not a surprise. But this analysis was based on numbers of records, not necessarily numbers of species, and it was based on threatened species only, so there’s some other factors that affect the appearance of the map apart from biodiversity.

The first is survey effort. Areas closest to populated parts of the continent are, by default, likely to be more frequently visited by ecologists, recording observations of species. This explains the heavy blue colour up the east coast, where most of Australia’s population is concentrated. This factor may also explain the high values around Alice Springs in central Australia, Darwin, and Townsville in Queensland, which are not known as biodiversity hotspots, but where research institutions such as herbaria and universities are based, which allow increased recording of data. Survey effort (or lack thereof) may also explain the lack of many records in some areas that are known biodiversity hotspots – for instance, the Kimberley in WA, and to a lesser extent the Pilbara – which does have a faint orange/yellow colour, and which has been relatively highly surveyed due to environmental consultants carrying out environmental impact assessment surveys for the booming mining industry over the past ten years. For that reason, my expectation was that there would be more records in the Pilbara – but there are only a couple of Commonwealth-listed threatened species from the area, which probably explains it. High biodiversity doesn’t necessarily equate to large numbers of threatened species, either for entirely natural reasons, or because of delays in the reporting of data (which are quite likely to be a major contributing factor).

The final factor that may complicate things is that this study looked at threatened species. Therefore, it is likely that there will be more records in areas that have been highly disturbed, such as urban and agricultural areas, because due to Australia’s high rate of endemism, there are many species that only occur within small geographic areas, and when those areas have been heavily modified, it’s likely that those species will have become threatened. Again though, this theory fails to explain the lack of records in the Pilbara, an area that has been heavily disturbed by mining and grazing.

Limitations

Map colours

I wasn’t sure if the method I used to colour the raster image was the most appropriate. The main problem with this method is it fails to discriminate between pixels with higher values. The maximum value present in the raster was 81.977, which is considerably higher than five, which is the value at which the colours stop changing. This means that, although there are relatively few data points at these high levels (the mean value was 1.263, and standard deviation 4.763), the large range is all squished into that one colour. This could potentially hide areas of unusually high threatened species records.

In order to test this, I recoloured the map using the maximum and minimum values rather than the 2-98% cut, and using the actual, rather than estimated values, which takes a little longer (although the difference is negligible for this map), but results in the true (higher) value for the maximum. I also changed the colour increments from continuous (which defaulted to five colour classes) to incremental, and manually specified ten colour classes. The result looked like this:

Result of recolouring raster image with maximum value and ten colour classes
Result of recolouring raster image with maximum value and ten colour classes

As you can see, this is effective at highlighting the areas of really high observation numbers, however it too has a downside – the vast majority of pixels are now classified at the lowest levels, which means the main body of the variation in the map is invisible.

One solution to this problem is to manually modify the colour breakdown to produce the most visually clear and expressive map. I played around with this, and managed to develop a map which appeared to differentiate between the large number of pixels at the lower end of the spectrum, while allowing those few pixels at the very high end to stand out as highlights. The only downside to this is that it’s a very subjective process. I wanted the map coloration to have a clear mathematical relationship with the data, even if that meant losing a little bit of detail at the top end. For that reason, I stuck with the original method.

Why does Tasmania look so weird?

I don’t know. I think it must be a problem in the process of TileMill creating png map tiles out of the GeoTIFF raster. It’s not present in the raster in QGIS, and it’s not visible at all zoom scales in the TileMill map.