In this notebook, we are going to take a closer look at the data. Let us begin by loading everything in.
import librosa
import pandas as pd
import numpy as np
from IPython.lib.display import Audio
from matplotlib import pyplot as plt
import multiprocessing
import scipy.signal
from scipy import signal
There are several files that we downloaded.
ls data/
anno.pkl is a pandas.DataFrame containing the annotations. The audio folder contains the audio files.
Additionally, train_tp.csv and train_fp.csv contain bounding boxes around the vocalization on a spectrogram. The intent is that this additional data can augment training (the tp file consisting of true positives, the fp file of false positives).
In the basic, most common setup, what we want though are the labels on a per recording level, listing the species that can be heard in the recording. These live in anno.pkl)
anno = pd.read_pickle('data/anno.pkl')
anno.head()
unique_species = np.unique(anno.species.sum())
There are 24 unique species present in the dataset.
unique_species.shape[0]
Overall, the number of times a species appears in the dataset is roughly balanced.
The least common species is margarops_fuscatus as it appears in only 34 recordings. The most common species is spindalis_portoricensis, appearing in 90 recordings.
Most of the species appear exactly 50 times.
from collections import Counter
Counter(anno.species.sum())
There is a total of 4727 recordings across which these vocalizations occur.
ls data/audio/*.flac | wc -l
%%time
from pathlib import Path
durations, srs = [], []
for audio_file in Path('data/audio').iterdir():
x, sr = librosa.load(audio_file, sr=None, mono=False)
durations.append(x.shape[0] / sr)
srs.append(sr)
set(durations), set(srs)
All the audio files are exactly 1 minute long and were recorded wtih a 48 kHz sample rate.
SAMPLE_RATE = 48000
Now, please bear in mind that this is a multilabel multiclass problem. Many of the examples do not have a single species present.
anno.species.apply(lambda labels: labels == []).mean()
76% of examples do not have a single species present!
Let us look for some interesting files to listen to, to get a better feel for this dataset. We start off by listening to a file containing the ambient sounds, without any of the target species present.
x, sr = librosa.load('data/audio/4071f7aa7.flac', sr=None, mono=False)
freqs, times, Sx = signal.spectrogram(x, fs=sr)
plt.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-9), cmap='viridis', shading='auto')
plt.ylabel('Frequency [kHz]')
plt.xlabel('Time [s]');
As expected, the tropical soundscape is very busy! We can hear many vocalizations, but not of the species we are after.
Let's now find a couple of files where there are many species present.
In the following audio file, we can hear the following species:
anno.iloc[3084].species
x, sr = librosa.load('data/audio/c12e0a62b.flac', sr=None, mono=False)
freqs, times, Sx = signal.spectrogram(x, fs=sr)
plt.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-9), cmap='viridis', shading='auto')
plt.ylabel('Frequency [kHz]')
plt.xlabel('Time [s]');
Let's now listen to a recording where a single from the target species is present, the turdus_plumbeus.
x, sr = librosa.load('data/audio/d59d099b3.flac', sr=None, mono=False)
freqs, times, Sx = signal.spectrogram(x, fs=sr)
plt.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-9), cmap='viridis', shading='auto')
plt.ylabel('Frequency [kHz]')
plt.xlabel('Time [s]');
Before we go, let's consider the distribution of labels - how many recordings have 0 labels, 1 label, 2 labels, etc.
anno.species.apply(lambda labels: len(labels)).value_counts()
It is an infrequent occurence for any species to be present, let alone multiple species!
Using the information contained in the train_tp.csv file, one could construct a database of calls that could be used for a CPP study. One issue with this approach though is that we would not be able to disambiguate the identity of the callers - within a species, we wouldn't know if the same or different individual is vocalizing.
Another use for this dataset would be to use it as a background for mixtures, or to use it for training a CPP model in an unsupervised / semi-supervised way.