Using squeezed wav2vec2 to automatically detect owl calls
https://www.seangoedecke.com/animal-call-audio-recognition/By gfysfm at
refibrillator | 1 comment | 3 weeks ago
There are a couple relative advantages of your approach that I feel are notable though:
Squeezed wav2vec2 (SEW) architecture leverages Transformer layers and operates directly on time series inputs. But BirdNET converts audio to a spectrogram first and then uses 2D convolution layers (ResNet-like backbone).
This over-representation of inputs to BirdNET implies that SEW will be much more computationally efficient for a given audio classification task (all else held equal).
Plus, simply using a pre-trained SEW model and then training a linear classifier on the embeddings would almost certainly produce strong baseline results. No GPU would be necessary for that.
P.S. Minor typo - precision and recall are confused here:
> “precision” (how many of the animal calls it notices) and its “recall” (the rate at which it makes accurate predictions).
sdenton4 | 1 comment | 3 weeks ago
The gold standard for input features is a PCEN melspectrogram, largely because it gives useful generalizable features, through compression, normalization, and approximate log scaling of frequency features. Learned frontends tend to overfit training distributions badly - someone finally wrote this up recently, but I'm struggling to find the paper on my phone...
sdenton4 | 1 comment | 2 weeks ago
selimthegrim | 0 comments | 2 weeks ago
sdenton4 | 1 comment | 3 weeks ago
Lots more I can say here - I've been working on problems in bioacoustics for years - but for now will just leave a link to some work on using bird song embeddings from last year.