Skip to content

Latest commit

 

History

History
40 lines (26 loc) · 1.53 KB

description.md

File metadata and controls

40 lines (26 loc) · 1.53 KB

UHL1

Stefan Uhlich¹, Franck Giron¹, Michael Enenkl¹, Thomas Kemp¹, Naoya Takahashi², Yuki Mitsufuji²

¹Sony European Technology Center (EuTEC), Stuttgart, Germany
²Sony Corporation, Audio Technology Development Department, Tokyo, Japan

stefan.uhlich (at) eu.sony.com

Additional Info

  • is_blind: no
  • additional_training_data: no

Supplemental Material

  • Code: not available
  • Demos: not available

Method

This submission uses a bi-directional LSTM network as described in [1] with three BLSTM layers, each having 500 cells. For each instrument, a network is trained which predicts the target instrument amplitude from the mixture amplitude in the STFT domain (frame size: 4096, hop size: 1024). The raw output of each network is then combined by a multichannel Wiener filter as described in [2] where we estimate the power spectral densities and spatial covariance matrices from the DNN outputs.

The network is trained on musdb where we split train into train_train and train_valid with 86 and 14 songs, respectively. The validation set is used to perform early stopping and hyperparameter selection (LSTM layer dropout rate, regularization strength).

References

  • [1] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi and Y. Mitsufuji. "Improving music source separation based on deep neural networks through data augmentation and network blending", Proc. ICASSP, 2017.
  • [2] A. A. Nugraha, A. Liutkus, and E. Vincent. "Multichannel music separation with deep neural networks." EUSIPCO, 2016.