Data Exploration

In this notebook, we are going to take a closer look at the data. Let us begin by loading everything in.

In [4]:
import librosa
import pandas as pd
import numpy as np
from IPython.lib.display import Audio
from matplotlib import pyplot as plt
import multiprocessing
import scipy.signal
from scipy import signal

anno = pd.read_pickle('data/annotations.dataframe.pkl.gz')
In [5]:
anno.head()
Out[5]:
Selection View Channel Begin Time (s) End Time (s) Low Freq (Hz) High Freq (Hz) Begin Path File Offset (s) Begin Date Time ... Call Grade (E, G, F, P) Center Freq Grade (1 or 0) Peak Freq Grade (1 or 0) Start Freq (Hz) End Freq (Hz) filename call_duration offset_in_frames duration_in_frames call
0 1 Spectrogram 1 1 1888.267046 1889.352991 42.640 317.500 C:\Users\Annie Bartlett\Box\Babylon\5148-Yello... 1888.2670 16:11.3 ... F 1 1 63.308 80.951 Yellow_190805114443.wav 1.085945 45318408 26062 [0.00048828125, 0.0004272461, 0.00036621094, 0...
1 3 Spectrogram 1 1 2296.805479 2298.167697 22.832 315.000 C:\Users\Annie Bartlett\Box\Babylon\5148-Yello... 2296.8055 22:59.8 ... F 1 0 44.627 63.308 Yellow_190805114443.wav 1.362218 55123332 32693 [0.00076293945, 0.00079345703, 0.0008239746, 0...
2 4 Spectrogram 1 1 2525.664576 2526.794546 58.445 923.167 C:\Users\Annie Bartlett\Box\Babylon\5148-Yello... 2525.6646 26:48.7 ... F 1 1 77.828 88.159 Yellow_190805114443.wav 1.129970 60615950 27119 [0.0006713867, 0.0006713867, 0.0007019043, 0.0...
3 5 Spectrogram 1 1 2780.422324 2781.726670 147.194 356.800 C:\Users\Annie Bartlett\Box\Babylon\5148-Yello... 2780.4223 31:03.4 ... F 1 1 164.610 162.662 Yellow_190805114443.wav 1.304346 66730135 31304 [0.0, 0.0, 0.0, 0.0, 0.0, 3.0517578e-05, 3.051...
4 7 Spectrogram 1 1 2925.875763 2927.329177 65.100 1110.158 C:\Users\Annie Bartlett\Box\Babylon\5148-Yello... 2925.8758 33:28.9 ... F 0 1 82.255 97.408 Yellow_190805114443.wav 1.453414 70221019 34881 [0.0005493164, 0.0005493164, 0.0005493164, 0.0...

5 rows × 33 columns

The annotations dataframe contains extracted calls in the call column. All of the calls have been recorded with a sample rate of 24kHz.

In [6]:
SAMPLE_RATE = 24000

There are a total of 880 calls in this dataset.

In [4]:
anno.shape
Out[4]:
(880, 33)

This dataset contains "whup" and "growl" calls.

Here is what they sound like.

This is the distribution of call durations

In [8]:
call_durations = anno.call.apply(lambda x: x.shape[0] / SAMPLE_RATE)
plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations);

The calls, being of the same type, have roughly the same duration.

Just so that we have a better understanding of the data, let's listen to a couple of the shorter and longer calls, to see how they differ from the calls we already listened to.

In [23]:
anno.call.iloc[np.argsort(call_durations)[0]].shape[0] / SAMPLE_RATE
Out[23]:
0.6338333333333334

The shortest call is 0.63 second long. And this is what it sounds like

This is the second shortest call.

And now let's listen to the longest call. It is 3.2795 second long.

In [40]:
anno.call.iloc[np.argsort(call_durations)[879]].shape[0] / SAMPLE_RATE
Out[40]:
3.2795

And the second longest call.

The calls, being longest or shortest, still like we would imagine a growl or whup call to sound based on the other examples that we heard!

Meaning - this is valid data. The outliers are not malformed, but are valid, potentially interesting examples, to include in any analysis.

Let's attempt visualizing the calls on a spectrogram.

In [28]:
freqs, times, Sx = signal.spectrogram(anno.call.iloc[200], fs=SAMPLE_RATE)
f, ax = plt.subplots()
ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx), cmap='viridis', shading='auto')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');

CPP suitability analysis

CPP could be run on this data but results are uncertain.

  • a small portion of calls contain pronounced background noise, it might be advantageous to manually remove them
  • many of the calls are longer in duration and well-defined, which in the context of CPP might be a plus

This dataset is moderate size. It is likely big enough to facilitate running the CPP pipeline, though more data might be beneficial.