In this notebook, we are going to take a closer look at the data. Let us begin by loading everything in.
import librosa
import pandas as pd
import numpy as np
from IPython.lib.display import Audio
from matplotlib import pyplot as plt
import multiprocessing
import scipy.signal
from scipy import signal
anno = pd.read_pickle('data/annotations.pkl.gzip')
anno.head()
The annotations dataframe contains extracted calls in the call column. All of the calls have been recorded with a sample rate of 24kHz.
SAMPLE_RATE = 44100
There is a total of 10575 calls in this dataset.
anno.shape
Columns other than the call column contain metadata information as well as labels.
All the metadata from the original annotation file is retained in case it might be helpful to reference back the original file, look at the surrounding audio, and so on.
Here is the distribution of the call durations.
call_durations = anno.call.apply(lambda x: x.shape[0] / SAMPLE_RATE)
plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations);
Nearly 95% of all calls are under one second long, with a couple of very long outliers.
np.mean(call_durations <= 1)
Let's take a closer look at calls that are under 1 second long.
plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations[call_durations < 1]);
The spectrogram reveals that most of the calls (83.5%) are under 0.5 seconds.
np.mean(call_durations <= 0.5)
Here are four columns that naturally lend themselves to be used as labels:
caller_sexcaller_agecaller_idvocal_typeHere are the values these variables can take.
anno['caller_sex'].value_counts().plot(kind='barh', title='Total count by caller_sex')
The vast majority of callers are male, with a very small proportion of individuals who couldn't be identified.
Here is the breakdown of the other column types.
fig, axes = plt.subplots(3,1, figsize=(4, 14))
for label, ax in zip(['caller_age', 'caller_id', 'vocal_type'], axes.flat):
anno[label].value_counts().plot(kind='barh', title=f'Total count by {label}', ax=ax)
All of these classes are unbalanced.
There seems to be a high number of caller_ids - we can investigate the situation further.
anno['caller_id'].value_counts()
A total of 90 unique labels, with unknown individuals also identified. This information can be used for modelling but will require further processing.
The downstream task will dictate what data processing steps would be applicable here - possible strategies might include picking just a subset of classes and merging lower count classes (for instance, all unknowns could be merged into a single class).
To get a sense of the variability of the data, let's listen to the longest and shortest calls in the dataset.
Let's start with the shortest one
anno.call.apply(lambda x: x.size).argmax()
anno.iloc[550]
from IPython.lib.display import Audio
And here is the longest call.
We can also listen to an example of each call type.
from IPython.core.display import display, HTML
anno = anno.sample(frac=1)
idx_and_vocal_type = [(idx, row.vocal_type) for (idx, row) in anno.groupby('vocal_type').sample(n=1).iterrows()]
for idx, vocal_type in idx_and_vocal_type:
display(HTML(f'''
{vocal_type}
<audio style="display: block"
controls
src="assets/{idx}.wav">
Your browser does not support the
<code>audio</code> element.
</audio>
''')
)
Here is what these calls look like depicted on a spectrogram.
fig, subplots = plt.subplots(5,4, figsize=(20,30))
for (idx, row), ax in zip(anno.groupby('vocal_type').sample(n=1).iterrows(), subplots.flat):
freqs, times, Sx = signal.spectrogram(row.call, fs=SAMPLE_RATE)
ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-9), cmap='viridis', shading='auto')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
ax.set_title(row.vocal_type)
subplots[-1, -1].axis('off');
This dataset has the potential of lending itself very well to a CPP study. It consists of over 10k calls (good amount of data) and the recordings are very clear.
One source of difficulty could be that some recordings might contain calls from more than a single individual. I am not sure if this is the case though (for instance, the 'whee inhale' call you can listen to above). If we have a chance, it might be useful to confirm with the biologist how this calls were annotated, whether all these calls originate from a single individual.