Towards Learning Universal Audio Representations

This paper from Deepmind‘s authors presents a new benchmark for evaluating representation learning architectures (HARES) for the audio domain. It also includes an evaluation of a variety of models trained using several supervised and self-supervised approaches.

While in Computer Vision and NLP there has been a lot of research efforts for evaluating representations produced by large deep learning models, in the audio domain, the main contributions have been related to the speech processing field (e.g., TTS). The proposed benchmark covers a variety of tasks and enables researchers to evaluate their model for general audio representation learning.

Finally, the authors also propose an ad-hoc architecture, namely SlowFast NFNet-F0, able to reach state-of-the-art performances on the proposed benchmark.

🧮 HARES benchmark:

The benchmark includes other existing datasets and a total of 12 tasks related to the audio domain. They could be categorized as follows:

  • Environment: Audio tagging, animal sound, acoustic scenes
  • Speech: Keyword, intetion, language identification, speaker identification
  • Music: instrument identification, pitch estimation and music tagging

As you can see from the list above, they do not include any generative task in the evaluation suite.

The authors evaluate several existing models by training a linear layer on the frozen pre-trained model for all tasks (the only exception is the AudioSet dataset where the authors trained a 1-hidden-layer MLP network instead of the linear layer).

☯️ Contrastive Learning recap

The authors propose a comparison between two contrastive learning objectives for training models.

Quick recall of SimCLR‘s objective:

SimCLR objective (image from: Supervised Contrastive Learning paper)

Each anchor image (spectrogram for audio) is paired with their positive examples created by applying data augmentation on the original anchor. It is also paired with other negative examples randomly sampled from the data collection. Proceeding in this way, the network is trained using a self-supervised objective (no need for annotated training data).

Quick recall of BYOL (Bootstrap Your Own Latent)‘s objective:

BYOL architecture (image from: BYOL paper)

BYOL authors in this case try to remove the need for negative examples by using a parallel architecture. The two networks use two different sets of parameters. While the first encode the anchor image, the second encodes its augmented version. The weights are updated to align the two resulting vector representations.

🧮 Slowfast NFNet-F0 model:

The authors combine SlowFast architecture design with NFNet.

NFNet architecture: this family of models consists of a modification of the original ResNet architecture for computer vision. This architecture removes the need for Batch Normalization. The original paper reaches state-of-the-art results in image classification while having >7x speedup.

SlowFast architecture: originally proposed for the video understanding domain, they rely on the hypothesis according to which static areas in the frame don’t change at all or change slowly while dynamic areas indicate something relevant for the understanding. For this reason, the authors propose two different pathways for slow and fast areas.

In the proposed architecture: slow stream, compared to the fast stream, has 8 times more channel capacity, whose input spectrogram is stridden temporally by 4 and thus has a lower temporal resolution. More details (not so much) are in the appendix of the original paper.

🎯 Results:

Results table from the original paper.

The above table reports the results obtained by state-of-the-art models in the audio domain for the HARES benchmark. There are a few relevant takeaways summarized below:

  • Among contrastive learning frameworks, SimCLR framework shows better performance if compared with BYOL.
  • Spectrogram-based models outperforms Waveform-based models overall (Bidir-CPC and Wav2vec2). Wav2Vec2 obtains good performance only on Speech related tasks.
  • Authors suggest that supervised pretraining leads to a bias of the models towards slow features rather than local traits.
  • Vision-related architectures show relatively low scores on speech-related tasks. The performance gap is however limited if compared with Waveform models in other tasks (~50% performance drop).

Additional Info:

📅 Published: 2021-12-01 (v2)

👫 Original paper’s authors: Luyu Wang, Pauline Luc, Yan Wu, Adria Recasens, Lucas Smaira, Andrew Brock, Andrew Jaegle, Jean-Baptiste Alayrac, Sander Dieleman, Joao Carreira, Aaron van den Oord

🔗 Full Paper: https://arxiv.org/abs/2111.12124

Get in contact with us!

Leave a Reply