Data Exploration

In this notebook, we are going to take a closer look at the data. Let us begin by loading everything in.

In [1]:
import librosa
import pandas as pd
import numpy as np
from IPython.lib.display import Audio
from matplotlib import pyplot as plt
import multiprocessing
import scipy.signal
from scipy import signal

anno = pd.read_pickle('data/annotations.dataframe.pkl.gz')
In [2]:
anno.head()
Out[2]:
channel filename call_duration offset_in_frames duration_in_frames call
0 2 190806140351.wav 0.823104 110469 19754 [9.1552734e-05, 9.1552734e-05, 9.1552734e-05, ...
1 2 190806140351.wav 1.169673 405746 28072 [-9.1552734e-05, -6.1035156e-05, -9.1552734e-0...
2 2 190806140351.wav 1.083031 588216 25992 [0.0016784668, 0.0016479492, 0.0016479492, 0.0...
3 2 190806140351.wav 1.169673 650599 28072 [-0.00024414062, -0.00021362305, -0.0002136230...
4 2 190806140351.wav 1.386280 692186 33270 [0.00061035156, 0.00061035156, 0.00064086914, ...

The annotations dataframe contains extracted calls in the call column. This data sets does not include other annotations. All of the calls have been recorded with a sample rate of 24kHz.

In [3]:
SAMPLE_RATE = 24000

There is a total of 17882 calls in this dataset.

In [4]:
anno.shape
Out[4]:
(17882, 6)

They are of varying type and we do not have additional labels for them.

Here is what the distribution of call duration looks like

In [5]:
call_durations = anno.call.apply(lambda x: x.shape[0] / SAMPLE_RATE)
plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations);

Out of the 17882 calls, 16432 (92% of all calls) are under two seconds long.

In [6]:
sum(call_durations < 2)
Out[6]:
16432

Let us look at the distribution of these shorter calls more closely.

In [7]:
plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations[call_durations < 2]);

The vocalizations are extremely varied. Below is a non-exhaustive selection to provide a better understanding of some of the richness of this dataset.

The labels are not annotations, but my own qualitative description of the calls.

growl-like bark-like yawn-like

whistle-like squeak-like trumpet-like

elephant-like parrot-like

There are calls that don't fit clearly into any of the above categories or that are some combination of them.

Here are two examples of unusual calls.

Another challenging aspect of this dataset is that some calls are quite subtle. Below are two such examples

To get a feel for what a conversation might sound like, below I am including a 20 second example. A couple of growl-like and squeak-like vocalizations can be heard in succession.

Potential issue in working with the dataset

Some calls are recorded against a mechanical background

Here is an example of such a call.

And here are 10 seconds from a recording where this issue exists.

Only a small portion of the dataset is affected but, depending on the downstream tasks, it might be necssary to single out and exclude these calls.

Some calls might be hard to visualize

Some calls can be clearly heard, but are not easy to visualize on a spectrogram due to being relatively faint.

In [352]:
freqs, times, Sx = signal.spectrogram(anno.call.iloc[1101], fs=SAMPLE_RATE)
f, ax = plt.subplots()
ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx), cmap='viridis')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');

We can attempt to bring the vocalization to the foreground by adding a small value to the spectrogram before log-scaling the values.

In [354]:
freqs, times, Sx = signal.spectrogram(anno.call.iloc[1101], fs=SAMPLE_RATE)
f, ax = plt.subplots()
ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-11), cmap='viridis')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');

Since most of the information is in the lower frequency range, log scaling the frequency axis is also worth considering.

In [478]:
freqs, times, Sx = signal.spectrogram(anno.call.iloc[1101], fs=SAMPLE_RATE)
f, ax = plt.subplots()
ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-11), cmap='viridis')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
ax.set_yscale('symlog')

Another option might be reaching out for the linearly reassigned spectrogram.

In [ ]:
from spectral_hyperresolution.linear_reassignment_pytorch import high_resolution_spectrogram
# this can be installed by running !pip install git+git://github.com/earthspecies/spectral_hyperresolution.git
In [468]:
%%time
q = 1
tdeci = 100
over = 20
noct = 24
minf = 4e-3 # 4e-3 corresponds to frequency 4e-3 * sr which is 96 hz
maxf = 1

lin_spectrogram = high_resolution_spectrogram(anno.call.iloc[1101].reshape((-1, 1)), q, tdeci, over, noct, minf, maxf, 'cpu')
CPU times: user 38 s, sys: 984 ms, total: 39 s
Wall time: 19.7 s
In [469]:
lin_spectrogram = lin_spectrogram.detach().cpu().numpy().T
In [477]:
freqs_lin = np.linspace(0, 1, num=spectrogram.shape[0]) # dummy values
times_lin = np.linspace(0, anno.call.iloc[1101].shape[0] / SAMPLE_RATE, num=spectrogram.shape[1])

f, ax = plt.subplots()
ax.pcolormesh(times_lin, freqs_lin, 10 * np.log10(spectrogram+1e-6)[::-1,:], cmap='viridis')
ax.set_ylabel('Frequency [kHz]')
ax.set_yticklabels([])
ax.set_xlabel('Time [s]');

We see more structure, but without further experimentation, it is unclear if this exposes more of the signal or we are better visualizing the noise.

In [485]:
fig, axes = plt.subplots(1, 2, figsize=(10,4))

axes[0].pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-11), cmap='viridis')
axes[0].set_ylabel('Frequency [kHz]')
axes[0].set_xlabel('Time [s]');
axes[0].set_yscale('symlog')
axes[0].set_title('log spectrogram')

axes[1].pcolormesh(times_lin, freqs_lin, 10 * np.log10(spectrogram+1e-6)[::-1,:], cmap='viridis')
axes[1].set_ylabel('Frequency [kHz]')
axes[1].set_yticklabels([])
axes[1].set_xlabel('Time [s]');
axes[1].set_title('hyperresolution spectrogram');

CPP suitability analysis

This dataset could prove a challenge for cleanly separating individual calls. Here are the contributing factors:

  • background noise of high intensity in some portion of the calls
  • some of the calls being extremely faint and short in duration

A factor that could prove to be advantageous and contribute to ameliorating the above mentioned issues is the size of this dataset - over 17 000 thousand calls opens the route to further pre-processing.