3年前

Zifan Jiang Adrian Soldati Isaac Schamberg Adriano R. Lameira Steven Moran

音響イベント検出の入門

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)

概要

本研究では、野外調査中に収集された連続した生音声録音データから、オランウータンなどの大型類人猿の鳴き声を自動的に検出し分類する新たな手法を提案する。本手法は、wav2vec 2.0やLSTM（Long Short-Term Memory）を含む、事前学習済みおよび時系列ニューラルネットワークを活用しており、オランウータン、チンパンジー、ボノボという3つの異なる大型類人猿系統に由来する3つのデータセットを用いて検証を行った。録音データは異なる研究者によって収集され、異なる注釈スキームを含んでいるが、当パイプラインはこれらを前処理し、統一された方法で学習を行う。鳴き声の検出および分類における当手法の結果は高い精度を示した。本手法は他の動物種、より一般的には音響イベント検出タスクに対して汎用可能となることを意図している。将来の研究を促進するため、当パイプラインおよび手法を公開する。

One-sentence Summary

Leveraging wav2vec 2.0 and LSTM networks, this pipeline automatically detects and classifies great ape calls from continuous field recordings by uniformly processing heterogeneous annotation schemes, achieving high accuracy across orangutan, chimpanzee, and bonobo datasets while aiming to generalize to other species and broader sound event detection tasks.

Key Contributions

A unified processing pipeline automatically detects and classifies great ape calls directly from continuous raw audio recordings by combining the wav2vec 2.0 feature extractor with a sequential LSTM network. This architecture standardizes heterogeneous annotation schemes and diverse field recordings into a consistent training framework.
The wav2vec 2.0 model pretrained exclusively on human speech functions as an effective audio feature representation layer for great ape calls without requiring additional fine-tuning. This result demonstrates strong cross-species acoustic generalization for raw waveform processing.
Validation across three distinct great ape lineages (orangutans, chimpanzees, and bonobos) achieves more than 80% frame-level classification accuracy and weighted F1-scores. The complete processing pipeline and trained models are publicly released to facilitate reproducible research and extension to other species.

Introduction

Primatologists rely on manually annotating primate vocalizations from continuous field recordings, a process that is labor-intensive and complicated by dense, noisy forest environments. Automating this workflow would significantly accelerate behavioral and acoustic research. Prior sound event detection and animal classification systems typically require pre-segmented audio clips, struggle with continuous raw recordings, and fail to handle overlapping vocalizations and environmental noise effectively. No existing framework specifically addresses automatic great ape call detection from unprocessed field data. The authors leverage a pretrained wav2vec 2.0 feature extractor paired with an LSTM network to detect and classify great ape calls directly from continuous raw audio. Their pipeline processes heterogeneous field recordings into a unified training format, achieves over 80 percent accuracy across orangutan, chimpanzee, and bonobo datasets, and is publicly released to advance computational primatology.

Dataset

Dataset Composition and Sources: The authors compiled a multi-species vocalization corpus featuring chimpanzee pant-hoots, orangutan long calls, and bonobo high-hoots sourced from field recordings across natural behavioral contexts such as feeding, traveling, resting, and social or environmental responses.
Subset Details: Chimpanzee clips capture isolated pant-hoots without temporal overlap, organized into sequential introduction, build-up, climax, and let-down phases. Orangutan recordings contain complex, multi-pulse long calls with highly variable phase durations and a balanced ratio of call to non-call segments. Bonobo entries are dominated by loud high-hoots, supplemented by shorter, diverse vocalizations that introduce minor class imbalance.
Processing and Feature Extraction: All audio clips are converted to .wav format and resampled to 16 kHz before being segmented into 20 millisecond frames with zero padding applied as needed. The authors extract three distinct frame-level feature sets using torchaudio: raw waveforms, spectrograms, and embeddings inferred from a wav2vec 2.0 base model.
Label Construction, Splitting, and Model Usage: Frame-level labels are constructed by mapping manual annotations to positive class indices for specific call phases, while all unmarked frames default to a zero index representing non-calls. The complete dataset is shuffled and divided into 80 percent training, 10 percent validation, and 10 percent test sets, with three independent splits generated using random seeds 0, 42, and 3407. This pipeline supports the authors' dual modeling approach, enabling a multi-class phase classifier on chimpanzee data and a binary call detection system on orangutan recordings.

Method

The authors leverage a sequence modeling framework to learn a function $f$ mapping input audio frames $I_{1:T}$ to corresponding label sequences $L_{1:T}$ . The overall architecture processes raw audio waveforms through a feature extraction pipeline, where wav2vec 2.0 features are extracted from the waveform and used as input to a sequence modeling component. As shown in the figure below, the framework begins with the raw waveform, which is converted into spectrogram features and fed into the wav2vec 2.0 model to generate high-level representations. These representations are then processed by a sequence modeling function $f^{m}$ , which produces a hidden sequence $H_{1:T}$ capturing inter-frame dependencies. The hidden sequence is subsequently passed through a dense linear layer $f^{d}$ to generate intermediate outputs $D_{1:T}$ , which are then transformed via a Softmax function $f^{s}$ to yield a probability distribution $P_{1:T}$ over the target classes.

The sequence modeling function $f^{m}$ can take several forms. One implementation uses a bidirectional LSTM with a hidden size of 1024, resulting in a hidden dimension of 2048. Alternatively, a Transformer encoder with 8 attention heads and 6 layers is employed, using a hidden dimension of 1024. The authors also incorporate an autoregressive mechanism, where the output of the dense layer at time step $t$ , $D_{t}$ , is concatenated to the input $I_{t+1}$ of the next time step, enabling the model to maintain temporal consistency in predictions. This autoregressive connection is applied within an RNN-based sequence model, which can be configured to operate unidirectionally or, with the addition of two stacked layers, to preserve bidirectionality through a summation operation before the final classification step.

The model is trained using cross-entropy loss between the predicted probability distribution $P_{1:T}$ and the ground-truth labels $L_{1:T}$ , with gradients backpropagated through the network to update the parameters. To address class imbalance, class weights are assigned as the reciprocal of the frequency of each class in the training data. This training strategy enables the model to effectively learn from imbalanced datasets while maintaining high classification accuracy across all target classes.

Experiment

Experiments were conducted using PyTorch on a V100 GPU with training capped at 200 epochs and early stopping based on validation performance. Initial tests on chimpanzee vocalizations validated the substantial benefits of wav2vec 2.0 transfer learning, revealed that Transformers offer no advantage over LSTMs for this limited dataset, and confirmed that autoregressive connections enhance output consistency. The approach was subsequently extended to orangutan and bonobo recordings, where binary classification effectively isolates calls for automated analysis. Finally, successful zero-shot transfer from orangutan to bonobo data demonstrates strong potential for developing a generalized great ape sound event detection model despite class imbalance challenges.

The authors conduct experiments to evaluate different models and features for detecting great ape vocalizations, focusing on performance improvements with pre-trained audio representations and autoregressive modeling. They extend their approach to orangutan and bonobo data, achieving competitive results and demonstrating potential for zero-shot transfer across species. Pre-trained audio features significantly outperform raw waveform and spectrogram inputs in performance. Autoregressive modeling improves consistency and performance compared to non-autoregressive approaches. The model shows promising generalizability to unseen species through zero-shot transfer learning.

The authors conduct experiments on chimpanzee, orangutan, and bonobo data sets to evaluate their model's performance in detecting call types, with results showing consistent behavior across species and indicating the potential for generalizable sound event detection. The experiments involve various model architectures and training strategies, with the model achieving stable output in non-autoregressive settings and demonstrating transferability across species. The model is tested on three great ape data sets with varying numbers of audio clips and call types, showing consistent performance across species. Non-autoregressive models produce stable outputs with minimal variation, as illustrated by small gaps in results. The model trained on one species shows promise when applied to another, suggesting potential for general-purpose sound event detection across great apes.

The authors evaluated multiple model architectures and feature representations across chimpanzee, orangutan, and bonobo datasets to assess their effectiveness in great ape vocalization detection. Comparisons between pre-trained audio representations and raw inputs demonstrated that learned features substantially enhance detection accuracy, while autoregressive modeling yielded more consistent performance than non-autoregressive alternatives. Cross-species testing further revealed that models trained on a single species generalize effectively to unseen apes, highlighting the approach's strong potential for broad, zero-shot sound event detection across great apes.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

HyperAI

このノートブックを実行 Discordで議論

3年前

Zifan Jiang Adrian Soldati Isaac Schamberg Adriano R. Lameira Steven Moran

音響イベント検出の入門

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)

ノートブックへ移動

概要

One-sentence Summary

Key Contributions

A unified processing pipeline automatically detects and classifies great ape calls directly from continuous raw audio recordings by combining the wav2vec 2.0 feature extractor with a sequential LSTM network. This architecture standardizes heterogeneous annotation schemes and diverse field recordings into a consistent training framework.
The wav2vec 2.0 model pretrained exclusively on human speech functions as an effective audio feature representation layer for great ape calls without requiring additional fine-tuning. This result demonstrates strong cross-species acoustic generalization for raw waveform processing.
Validation across three distinct great ape lineages (orangutans, chimpanzees, and bonobos) achieves more than 80% frame-level classification accuracy and weighted F1-scores. The complete processing pipeline and trained models are publicly released to facilitate reproducible research and extension to other species.

Introduction

Dataset

Dataset Composition and Sources: The authors compiled a multi-species vocalization corpus featuring chimpanzee pant-hoots, orangutan long calls, and bonobo high-hoots sourced from field recordings across natural behavioral contexts such as feeding, traveling, resting, and social or environmental responses.
Subset Details: Chimpanzee clips capture isolated pant-hoots without temporal overlap, organized into sequential introduction, build-up, climax, and let-down phases. Orangutan recordings contain complex, multi-pulse long calls with highly variable phase durations and a balanced ratio of call to non-call segments. Bonobo entries are dominated by loud high-hoots, supplemented by shorter, diverse vocalizations that introduce minor class imbalance.
Processing and Feature Extraction: All audio clips are converted to .wav format and resampled to 16 kHz before being segmented into 20 millisecond frames with zero padding applied as needed. The authors extract three distinct frame-level feature sets using torchaudio: raw waveforms, spectrograms, and embeddings inferred from a wav2vec 2.0 base model.
Label Construction, Splitting, and Model Usage: Frame-level labels are constructed by mapping manual annotations to positive class indices for specific call phases, while all unmarked frames default to a zero index representing non-calls. The complete dataset is shuffled and divided into 80 percent training, 10 percent validation, and 10 percent test sets, with three independent splits generated using random seeds 0, 42, and 3407. This pipeline supports the authors' dual modeling approach, enabling a multi-class phase classifier on chimpanzee data and a binary call detection system on orangutan recordings.

Method

Experiment

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

ニューラルネットワークを用いた大猿の鳴き声の自動音響イベント検出および分類

Zifan Jiang Adrian Soldati Isaac Schamberg Adriano R. Lameira Steven Moran

音響イベント検出の入門

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

ニューラルネットワークを用いた大猿の鳴き声の自動音響イベント検出および分類

Zifan Jiang Adrian Soldati Isaac Schamberg Adriano R. Lameira Steven Moran

音響イベント検出の入門

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

ニューラルネットワークを用いた大猿の鳴き声の自動音響イベント検出および分類

Zifan Jiang Adrian Soldati Isaac Schamberg Adriano R. Lameira Steven Moran

音響イベント検出の入門

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters