Data Exploration¶

In this notebook, we are going to take a closer look at the data. Let us begin by loading everything in.

import librosa
import pandas as pd
import numpy as np
from IPython.lib.display import Audio
from matplotlib import pyplot as plt
import multiprocessing
import scipy.signal
from scipy import signal

anno = pd.read_pickle('data/annotations.pkl.gzip')

anno.head()

The annotations dataframe contains extracted calls in the call column. All of the calls have been recorded with a sample rate of 24kHz.

SAMPLE_RATE = 44100

There is a total of 10575 calls in this dataset.

anno.shape

(10575, 14)

Columns other than the call column contain metadata information as well as labels.

All the metadata from the original annotation file is retained in case it might be helpful to reference back the original file, look at the surrounding audio, and so on.

Here is the distribution of the call durations.

call_durations = anno.call.apply(lambda x: x.shape[0] / SAMPLE_RATE)
plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations);

Nearly 95% of all calls are under one second long, with a couple of very long outliers.

np.mean(call_durations <= 1)

0.9463829787234043

Let's take a closer look at calls that are under 1 second long.

plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations[call_durations < 1]);

The spectrogram reveals that most of the calls (83.5%) are under 0.5 seconds.

np.mean(call_durations <= 0.5)

0.8347990543735224

Here are four columns that naturally lend themselves to be used as labels:

caller_sex
caller_age
caller_id
vocal_type

Here are the values these variables can take.

anno['caller_sex'].value_counts().plot(kind='barh', title='Total count by caller_sex')

<AxesSubplot:title={'center':'Total count by caller_sex'}>

The vast majority of callers are male, with a very small proportion of individuals who couldn't be identified.

Here is the breakdown of the other column types.

fig, axes = plt.subplots(3,1, figsize=(4, 14))

for label, ax in zip(['caller_age', 'caller_id', 'vocal_type'], axes.flat):
    anno[label].value_counts().plot(kind='barh', title=f'Total count by {label}', ax=ax)

All of these classes are unbalanced.

There seems to be a high number of caller_ids - we can investigate the situation further.

anno['caller_id'].value_counts()

dev         1078
len          936
unknown1     896
die          638
wal          615
            ... 
bim            2
mar            2
unknown4       2
bro            1
aut            1
Name: caller_id, Length: 90, dtype: int64

A total of 90 unique labels, with unknown individuals also identified. This information can be used for modelling but will require further processing.

The downstream task will dictate what data processing steps would be applicable here - possible strategies might include picking just a subset of classes and merging lower count classes (for instance, all unknowns could be merged into a single class).

To get a sense of the variability of the data, let's listen to the longest and shortest calls in the dataset.

Let's start with the shortest one

anno.call.apply(lambda x: x.size).argmax()

550

anno.iloc[550]

wav_file                                           MLG0122_call03
caller_sex                                                   male
caller_age                                                  adult
caller_id                                                     len
vocal_type                                           exhaled moan
vocal_onset                                                 0.505
vocal_offset                                                 5.23
focal_sample                                              MLG0122
focal_date                                    2014-03-31 00:00:00
focal_time                                               08:24:47
focal_male                                                    len
wav_time                                                 08:32:31
wav_state                                                    rest
call            [-0.021240234, -0.020202637, -0.019134521, -0....
Name: 4605, dtype: object

from IPython.lib.display import Audio

And here is the longest call.

We can also listen to an example of each call type.

from IPython.core.display import display, HTML

anno = anno.sample(frac=1)
idx_and_vocal_type = [(idx, row.vocal_type) for (idx, row) in anno.groupby('vocal_type').sample(n=1).iterrows()]

for idx, vocal_type in idx_and_vocal_type:
    display(HTML(f'''
        {vocal_type}
        <audio style="display: block"
        controls
        src="assets/{idx}.wav">
            Your browser does not support the
            <code>audio</code> element.
        </audio>
        ''')
    )

Here is what these calls look like depicted on a spectrogram.

fig, subplots = plt.subplots(5,4, figsize=(20,30))

for (idx, row), ax in zip(anno.groupby('vocal_type').sample(n=1).iterrows(), subplots.flat):
    freqs, times, Sx = signal.spectrogram(row.call, fs=SAMPLE_RATE)
    ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-9), cmap='viridis', shading='auto')
    ax.set_ylabel('Frequency [kHz]')
    ax.set_xlabel('Time [s]');
    ax.set_title(row.vocal_type)

subplots[-1, -1].axis('off');

CPP suitability analysis¶

This dataset has the potential of lending itself very well to a CPP study. It consists of over 10k calls (good amount of data) and the recordings are very clear.

One source of difficulty could be that some recordings might contain calls from more than a single individual. I am not sure if this is the case though (for instance, the 'whee inhale' call you can listen to above). If we have a chance, it might be useful to confirm with the biologist how this calls were annotated, whether all these calls originate from a single individual.

	wav_file	caller_sex	caller_age	caller_id	vocal_type	vocal_onset	vocal_offset	focal_sample	focal_date	focal_time	focal_male	wav_time	wav_state	call
0	MLG0001_call01	male	adult	imp	exhaled contact grunt	0.978	1.303	MLG0001	2014-01-29	08:43:27	imp	08:52:20	forage	[-0.517395, -0.53131104, -0.532074, -0.5234985...
1	MLG0001_call01	male	adult	imp	exhaled contact grunt	1.668	1.964	MLG0001	2014-01-29	08:43:27	imp	08:52:20	forage	[0.18444824, 0.18887329, 0.1809082, 0.16281128...
2	MLG0001_call01	male	adult	imp	exhaled contact grunt	2.522	2.818	MLG0001	2014-01-29	08:43:27	imp	08:52:20	forage	[0.6281433, 0.62161255, 0.6202698, 0.62142944,...
3	MLG0001_call01	male	adult	imp	exhaled contact grunt	3.204	3.465	MLG0001	2014-01-29	08:43:27	imp	08:52:20	forage	[0.4548645, 0.45187378, 0.45358276, 0.45529175...
4	MLG0001_call01	female	adult	chu	scream (submissive)	3.413	4.449	MLG0001	2014-01-29	08:43:27	imp	08:52:20	forage	[-0.08654785, -0.110443115, -0.1361084, -0.161...