Data Exploration

In this notebook, we are going to take a closer look at the data. Let us begin by loading everything in.

In [64]:
import librosa
import pandas as pd
import numpy as np
from IPython.lib.display import Audio
from matplotlib import pyplot as plt
import multiprocessing
import scipy.signal
from scipy import signal

anno = pd.read_pickle('data/annotations.pkl.gzip')
In [2]:
anno.head()
Out[2]:
wav_file caller_sex caller_age caller_id vocal_type vocal_onset vocal_offset focal_sample focal_date focal_time focal_male wav_time wav_state call
0 MLG0001_call01 male adult imp exhaled contact grunt 0.978 1.303 MLG0001 2014-01-29 08:43:27 imp 08:52:20 forage [-0.517395, -0.53131104, -0.532074, -0.5234985...
1 MLG0001_call01 male adult imp exhaled contact grunt 1.668 1.964 MLG0001 2014-01-29 08:43:27 imp 08:52:20 forage [0.18444824, 0.18887329, 0.1809082, 0.16281128...
2 MLG0001_call01 male adult imp exhaled contact grunt 2.522 2.818 MLG0001 2014-01-29 08:43:27 imp 08:52:20 forage [0.6281433, 0.62161255, 0.6202698, 0.62142944,...
3 MLG0001_call01 male adult imp exhaled contact grunt 3.204 3.465 MLG0001 2014-01-29 08:43:27 imp 08:52:20 forage [0.4548645, 0.45187378, 0.45358276, 0.45529175...
4 MLG0001_call01 female adult chu scream (submissive) 3.413 4.449 MLG0001 2014-01-29 08:43:27 imp 08:52:20 forage [-0.08654785, -0.110443115, -0.1361084, -0.161...

The annotations dataframe contains extracted calls in the call column. All of the calls have been recorded with a sample rate of 24kHz.

In [3]:
SAMPLE_RATE = 44100

There is a total of 10575 calls in this dataset.

In [4]:
anno.shape
Out[4]:
(10575, 14)

Columns other than the call column contain metadata information as well as labels.

All the metadata from the original annotation file is retained in case it might be helpful to reference back the original file, look at the surrounding audio, and so on.

Here is the distribution of the call durations.

In [5]:
call_durations = anno.call.apply(lambda x: x.shape[0] / SAMPLE_RATE)
plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations);

Nearly 95% of all calls are under one second long, with a couple of very long outliers.

In [6]:
np.mean(call_durations <= 1)
Out[6]:
0.9463829787234043

Let's take a closer look at calls that are under 1 second long.

In [7]:
plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations[call_durations < 1]);

The spectrogram reveals that most of the calls (83.5%) are under 0.5 seconds.

In [8]:
np.mean(call_durations <= 0.5)
Out[8]:
0.8347990543735224

Here are four columns that naturally lend themselves to be used as labels:

  • caller_sex
  • caller_age
  • caller_id
  • vocal_type

Here are the values these variables can take.

In [9]:
anno['caller_sex'].value_counts().plot(kind='barh', title='Total count by caller_sex')
Out[9]:
<AxesSubplot:title={'center':'Total count by caller_sex'}>

The vast majority of callers are male, with a very small proportion of individuals who couldn't be identified.

Here is the breakdown of the other column types.

In [10]:
fig, axes = plt.subplots(3,1, figsize=(4, 14))

for label, ax in zip(['caller_age', 'caller_id', 'vocal_type'], axes.flat):
    anno[label].value_counts().plot(kind='barh', title=f'Total count by {label}', ax=ax)

All of these classes are unbalanced.

There seems to be a high number of caller_ids - we can investigate the situation further.

In [11]:
anno['caller_id'].value_counts()
Out[11]:
dev         1078
len          936
unknown1     896
die          638
wal          615
            ... 
bim            2
mar            2
unknown4       2
bro            1
aut            1
Name: caller_id, Length: 90, dtype: int64

A total of 90 unique labels, with unknown individuals also identified. This information can be used for modelling but will require further processing.

The downstream task will dictate what data processing steps would be applicable here - possible strategies might include picking just a subset of classes and merging lower count classes (for instance, all unknowns could be merged into a single class).

To get a sense of the variability of the data, let's listen to the longest and shortest calls in the dataset.

Let's start with the shortest one

In [61]:
anno.call.apply(lambda x: x.size).argmax()
Out[61]:
550
In [62]:
anno.iloc[550]
Out[62]:
wav_file                                           MLG0122_call03
caller_sex                                                   male
caller_age                                                  adult
caller_id                                                     len
vocal_type                                           exhaled moan
vocal_onset                                                 0.505
vocal_offset                                                 5.23
focal_sample                                              MLG0122
focal_date                                    2014-03-31 00:00:00
focal_time                                               08:24:47
focal_male                                                    len
wav_time                                                 08:32:31
wav_state                                                    rest
call            [-0.021240234, -0.020202637, -0.019134521, -0....
Name: 4605, dtype: object
In [12]:
from IPython.lib.display import Audio

And here is the longest call.

We can also listen to an example of each call type.

In [53]:
from IPython.core.display import display, HTML

anno = anno.sample(frac=1)
idx_and_vocal_type = [(idx, row.vocal_type) for (idx, row) in anno.groupby('vocal_type').sample(n=1).iterrows()]
In [57]:
for idx, vocal_type in idx_and_vocal_type:
    display(HTML(f'''
        {vocal_type}
        <audio style="display: block"
        controls
        src="assets/{idx}.wav">
            Your browser does not support the
            <code>audio</code> element.
        </audio>
        ''')
    )
baby grunt
copulation call
display call
exhaled contact grunt
exhaled moan
exhaled wobble
fear bark (submissive)
first inhaled part of a vy
infant squeal
inhaled contact grunt
inhaled moan
inhaled wobble
pre-copulation grunt
scream (submissive)
second exhaled part of a vy
threat grunt
unknown
vocal yawn
whee inhale

Here is what these calls look like depicted on a spectrogram.

In [29]:
fig, subplots = plt.subplots(5,4, figsize=(20,30))

for (idx, row), ax in zip(anno.groupby('vocal_type').sample(n=1).iterrows(), subplots.flat):
    freqs, times, Sx = signal.spectrogram(row.call, fs=SAMPLE_RATE)
    ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-9), cmap='viridis', shading='auto')
    ax.set_ylabel('Frequency [kHz]')
    ax.set_xlabel('Time [s]');
    ax.set_title(row.vocal_type)

subplots[-1, -1].axis('off');

CPP suitability analysis

This dataset has the potential of lending itself very well to a CPP study. It consists of over 10k calls (good amount of data) and the recordings are very clear.

One source of difficulty could be that some recordings might contain calls from more than a single individual. I am not sure if this is the case though (for instance, the 'whee inhale' call you can listen to above). If we have a chance, it might be useful to confirm with the biologist how this calls were annotated, whether all these calls originate from a single individual.