In this notebook, we are going to take a closer look at the data. Let us begin by loading everything in.
import librosa
import pandas as pd
import numpy as np
from IPython.lib.display import Audio
from matplotlib import pyplot as plt
import multiprocessing
import scipy.signal
from scipy import signal
anno = pd.read_pickle('data/annotations.dataframe.pkl.gz')
anno.head()
The annotations dataframe contains extracted calls in the call column. All of the calls have been recorded with a sample rate of 24kHz.
SAMPLE_RATE = 24000
There are a total of 880 calls in this dataset.
anno.shape
This dataset contains "whup" and "growl" calls.
Here is what they sound like.
This is the distribution of call durations
call_durations = anno.call.apply(lambda x: x.shape[0] / SAMPLE_RATE)
plt.title('Call durations in seconds')
plt.xlabel('seconds')
plt.ylabel('count')
plt.hist(call_durations);
The calls, being of the same type, have roughly the same duration.
Just so that we have a better understanding of the data, let's listen to a couple of the shorter and longer calls, to see how they differ from the calls we already listened to.
anno.call.iloc[np.argsort(call_durations)[0]].shape[0] / SAMPLE_RATE
The shortest call is 0.63 second long. And this is what it sounds like
This is the second shortest call.
And now let's listen to the longest call. It is 3.2795 second long.
anno.call.iloc[np.argsort(call_durations)[879]].shape[0] / SAMPLE_RATE
And the second longest call.
The calls, being longest or shortest, still like we would imagine a growl or whup call to sound based on the other examples that we heard!
Meaning - this is valid data. The outliers are not malformed, but are valid, potentially interesting examples, to include in any analysis.
Let's attempt visualizing the calls on a spectrogram.
freqs, times, Sx = signal.spectrogram(anno.call.iloc[200], fs=SAMPLE_RATE)
f, ax = plt.subplots()
ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx), cmap='viridis', shading='auto')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
CPP could be run on this data but results are uncertain.
This dataset is moderate size. It is likely big enough to facilitate running the CPP pipeline, though more data might be beneficial.