Stefan Uhlich¹, Franck Giron¹, Michael Enenkl¹, Thomas Kemp¹, Naoya Takahashi², Yuki Mitsufuji²
¹Sony European Technology Center (EuTEC), Stuttgart, Germany
²Sony Corporation, Audio Technology Development Department, Tokyo, Japan
stefan.uhlich (at) eu.sony.com
- is_blind: no
- additional_training_data: no
- Code: not available
- Demos: not available
This submission uses a bi-directional LSTM network as described in [1] with three BLSTM layers, each having 500 cells. For each instrument, a network is trained which predicts the target instrument amplitude from the mixture amplitude in the STFT domain (frame size: 4096, hop size: 1024). The raw output of each network is then combined by a multichannel Wiener filter as described in [2] where we estimate the power spectral densities and spatial covariance matrices from the DNN outputs.
The network is trained on musdb
where we split train
into train_train
and train_valid
with 86 and 14 songs, respectively. The validation set is
used to perform early stopping and hyperparameter selection (LSTM layer dropout
rate, regularization strength).
- [1] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi and Y. Mitsufuji. "Improving music source separation based on deep neural networks through data augmentation and network blending", Proc. ICASSP, 2017.
- [2] A. A. Nugraha, A. Liutkus, and E. Vincent. "Multichannel music separation with deep neural networks." EUSIPCO, 2016.