In this notebook, we are going to take a closer look at the data.
import librosa
import pandas as pd
import numpy as np
from IPython.lib.display import Audio
from matplotlib import pyplot as plt
import multiprocessing
import scipy.signal
from scipy import signal
The audio was recorded with a sample rate of 22050 Hz.
audio, SAMPLE_RATE = librosa.core.load('data/190826_001.WAV', mono=False)
SAMPLE_RATE
The audio comes from an array and was recorded across 4 channels.
audio.shape
(audio.shape[1] / SAMPLE_RATE) / 60
The audio is over 15 minutes long, the video is just under 7 minutes. The 7th second of the video syncs to 7th minute 23rd second of audio.
The video contains 360 degrees footage. When played on a computer, you can point the camera in various directions which makes for an amazing experience.
Despite the video being endowed with this capability, we can still use standard Python tools for loading it in.
import cv2
vidcap = cv2.VideoCapture('data/2019-08-26_rig1_0.mp4')
success, img = vidcap.read()
plt.imshow(img)
This frame is still from when the device was on the deck, before being submerged.
Soon after the array was put in water, a flock of dolphins arrived.
Let's visualize their calls.
fig, axes = plt.subplots(2, 2, figsize=(10,10))
for ax, y in zip(axes.flat, audio[:, 535*SAMPLE_RATE:540*SAMPLE_RATE]):
freqs, times, Sx = signal.spectrogram(y, fs=SAMPLE_RATE)
ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-9), cmap='viridis', shading='auto')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
There is a lot of chatter going on. Let's see if we can notice a shift in the channels across calls.
We probably will not be able to observe this effect on a spectrogram even if we zoom in.
fig, axes = plt.subplots(2, 2, figsize=(10,10))
for ax, y in zip(axes.flat, audio[:, 539*SAMPLE_RATE:540*SAMPLE_RATE]):
freqs, times, Sx = signal.spectrogram(y, fs=SAMPLE_RATE)
ax.pcolormesh(times, freqs / 1000, 10 * np.log10(Sx+1e-9), cmap='viridis', shading='auto')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
fig, axes = plt.subplots(2, 2, figsize=(10,10))
for ax, y in zip(axes.flat, audio[:, 539*SAMPLE_RATE:int(539.2*SAMPLE_RATE)]):
ax.plot(y)
Looking at the waveforms, in particular on the subplot in the upper and lower right hand corner, we can see the audio shifted by 100 - 200 frames.
This information, if processed correctly, could allow for localization of the speaker in space.
The key to unlocking this dataset for CPP might be it's scale (this here is just a very small subset of all available data).
There are probably many paths to performing a CPP study on this dataset. One such path could look as follows.
A human annotator identifies portions of recordings where only a single dolphin is visible and its vocalizations are heard. Using recordings from the 4 hydrophones in the array, the location of the speaker is established, as outlined in the "Sound Finder: A new software approach for localizing animals recorded with a microphone array". There is a chance that this procedure could be automated to be applied at scale, regardless, the real goal here would be to obtain a dataset that could be used for training a deep neural network to locate the origin of sound based on the 4-channel information from the hydrophone array.
Being able to train a model that could localize the origin of vocalizations in space would in itself be valuable (please see "Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates" for a discussion of the modelling approach that could be taken). Having this information, we could apply the "beamforming" technique to augment the quality of the signal.
More importantly though, with this information, we could create synthetic mixtures of individuals talking at the same time and additionally supply the location information. The model could be trained to both disambiguate the signal and predict source location at the same time.
This is a very fertile area of research, thus it is very likely that other formulations could be arrived at in which this dataset could be used for training.
There is also the possibility of using this dataset solely for inference. A CPP model trained on another dataset could be applied to the data here. It would not be possible to calculate objective performance metrics (due to a lack of ground trouth in the scenario where this dataset is used for inference only), however qualitative analysis of results could be performed. Despite this approach lacking scientific rigour, it is ofen very beneficial to analyze individual results for insights that can inform the improvement of the modelling approach.
Alternatively, assuming sufficiently large amount of data, this dataset could also be leveraged for the standard CPP analysis. A human annotator could scan the data identifying calls with single individual present. They could be extracted and used to create a synthetic mixture. The issue that arises here though is that we would not be able to guarantee the identity of the dolphin is indeed different between calls (if video contains recordings of a single caller on two seperate occasions, we would not be able to disambiguate whether these are two distinct individuals or the same individual following the boat on two occasions).