Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor

Submitted as conference paper at "IEEE International Conference on Acoustics, Speech and Signal Processing" - ICASSP 2024

1Younglo Lee, 1Shukjae Choi, 1Byeong-Yeol Kim, 2Zhong-Qiu Wang, 2Shinji Watanabe

1Conversation Intelligence, 42dot Inc. and 2Language Technologies Institute, Carnegie Mellon University


We present "Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor" a speech separation model for 2 to 5 speech mixture on monaural signal.

Abstract

We propose a novel speech separation model designed to separate mixtures with an unknown number of speakers. The proposed model stacks 1) a dual-path processing block that can model spectro-temporal patterns, 2) a transformer decoder-based attractor (TDA) calculation module that can deal with an unknown number of speakers, and 3) triple-path processing blocks that can model inter-speaker relations. Given a fixed, small set of learned speaker queries and the mixture embedding produced by the dual-path blocks, TDA infers the relations of these queries and generates an attractor vector for each speaker. The estimated attractors are then combined with the mixture embedding by feature-wise linear modulation conditioning, creating a speaker dimension. The mixture embedding, conditioned with speaker information produced by TDA, is fed to the final triple-path blocks, which augment the dual-path blocks with an additional pathway dedicated to inter-speaker processing. The proposed approach outperforms the previous best reported in the literature, achieving 24.0 and 23.7 SI-SDR improvement on WSJ0-2 and 3mix respectively, with a single model trained to separate 2- and 3-speaker mixtures. The proposed model also exhibits strong performance and generalizability at counting sources and separating mixtures with up to 5 speakers.

We followed the format of the following page: Voice Separation with an Unknown Number of Multiple Speakers

Samples

Here are some samples from our model for you to listen to:
  • Mixture input - original mixed audio
  • Ground Truth - original separated samples
  • SepTDA2/3 - our proposed model trained on WSJ0-{2,3}mix
  • SepTDA[2-5] - our proposed model trained on WSJ0-{2,3,4,5}mix
  • Gated DPRNN[1] - mixture and separated source samples are downloaded from this link
  • TF-GridNet[2] - mixture and separated source samples are downloaded from this link
  • SepEDA[3] - mixture and separated source samples are downloaded from this link
Note that the number within parentheses represents the average SI-SDR improvement value over the mixture in dB scale.



WSJ0-2mix separation samples

Separation results of the SepTDA2/3 model trained on WSJ0-{2,3}mix and comparison results with TF-GridNet[2], SepEDA[3], and Gated DPRNN[1].

Mixture input Ground Truth SepTDA2/3 (SI-SDRi 19.4 dB) TF-GridNet (SI-SDRi 19.6 dB)







Mixture input Ground Truth SepTDA2/3 (SI-SDRi 25.2 dB) SepEDA (SI-SDRi 22.7 dB)







Mixture input Ground Truth SepTDA2/3 (SI-SDRi 25.4 dB) Gated DPRNN (SI-SDRi 20.4 dB)







Mixture input Ground Truth SepTDA2/3 (SI-SDRi 25.3 dB) Gated DPRNN (SI-SDRi 19.3 dB)









WSJ0-3mix separation samples

Separation results of the SepTDA2/3 model trained on WSJ0-{2,3}mix and comparison results with SepEDA[3], and Gated DPRNN[1].

Mixture input Ground Truth SepTDA2/3 (SI-SDRi 23.4 dB) SepEDA (SI-SDRi 20.1 dB)










Mixture input Ground Truth SepTDA2/3 (SI-SDRi 25.1 dB) Gated DPRNN (SI-SDRi 18.9 dB)










Mixture input Ground Truth SepTDA2/3 (SI-SDRi 16.6 dB) Gated DPRNN (SI-SDRi 13.0 dB)












WSJ0-4mix separation samples

Separation results of the SepTDA[2-5] model trained on WSJ0-{2,3,4,5}mix and comparison results with SepEDA[3], and Gated DPRNN[1].

Mixture input Ground Truth SepTDA[2-5] (SI-SDRi 23.7 dB) SepEDA (SI-SDRi 10.7 dB)













Mixture input Ground Truth SepTDA[2-5] (SI-SDRi 23.6 dB) Gated DPRNN (SI-SDRi 16.1 dB)













Mixture input Ground Truth SepTDA[2-5] (SI-SDRi 21.7 dB) Gated DPRNN (SI-SDRi 17.0 dB)















WSJ0-5mix separation samples

Separation results of the SepTDA[2-5] model trained on WSJ0-{2,3,4,5}mix and comparison results with SepEDA[3], and Gated DPRNN[1].

Mixture input Ground Truth SepTDA[2-5] (SI-SDRi 22.5 dB) SepEDA (SI-SDRi 12.8 dB)
















Mixture input Ground Truth SepTDA[2-5] (SI-SDRi 23.2 dB) Gated DPRNN (SI-SDRi 15.9 dB)
















Mixture input Ground Truth SepTDA[2-5] (SI-SDRi 23.9 dB) Gated DPRNN (SI-SDRi 15.2 dB)


















References

[1]. Nachmani, Eliya, Yossi Adi, and Lior Wolf. "Voice separation with an unknown number of multiple speakers." International Conference on Machine Learning. PMLR (2020).

[2]. Wang, Zhong-Qiu, et al. "TF-GridNet: Making time-frequency domain models great again for monaural speaker separation." ICASSP (2023).

[3]. Chetupalli, Srikanth Raj, and Emanuël Anco Peter Habets. "Speech Separation for an Unknown Number of Speakers Using Transformers with Encoder-Decoder Attractors." INTERSPEECH (2022).