Proposed Method
A block diagram of our auto-regressive model for voice conversion is shown in Figure 1. It consists of three key components: (1) Encoder; (2) Decoder; (3) Waveform synthesis. Specifically, we use the frame-level linguistic features, PPGs, as the inputs. The encoder maps input PPGs into context-dependent representations. Then the decoder predicts acoustic features from the encoder outputs. In the end, the LPCNet vocoder is conditioned on the predicted acoustic features for speech generation.
Compared with conventional seq2seq base VC, ARVC removes the attention-based duration conversion module since PPGs already contain duration information, thus reducing mispronunciation and improving training stability. Compared with conventional PPGs based VC, ARVC takes previous step acoustic features as the inputs to produce the next step outputs via the auto-regressive structure, thus generating smooth trajectory and causing less voice errors.

Comparison System
Two comparison systems are also implemented to verify the effectiveness of our proposed method.
System I [1]: It employs SI-ASR and Kullback-Leibler Divergence (KLD) based mapping approach to voice
conversion without using parallel training data. The acoustic difference between source and target speakers
is equalized with SI-ASR. KLD is chosen as a distortion measure to find an appropriate mapping from each
input source speaker’s frame to that of the target speaker. Finally, the STRAIGHT vocoder is used to generate the converted waveform.
System II [2]: It achieves top rank on naturalness and similarity in Voice Conversion Challenge 2018.
Firstly, the acoustic features of the source speaker (including Mel-cepstral coefficients (MCCs), F0 and band aperiodicities (BAPs))
are converted toward the target speaker using an LSTM-based conversion model.
Then, the waveform samples of the converted speech are synthesized by sending
the converted acoustic features into the WaveNet vocoder built for the target speaker.
We try our best to reproduce the work in [2].
However, compared with the original system in [2], there are still two major differences:
(1) There is no manually correction for F0 extraction errors, nor removal of speech segments with irregular phonation.
(2) Due to limited training data for VC, Liu et al. [2] train a speaker-dependent WaveNet by adapting a pre-trained multi-speaker model for the target speaker.
Differently, we have relatively enough data to train WaveNet in S2.
Therefore, we only train the WaveNet in S2 using the target speech.
Speech Samples
The voice conversion experiments are conducted on CMU-ARCTIC datasets:
Arctic Database
Source | I | II | Proposed | Target | |
---|---|---|---|---|---|
SLT->BDL (arctic_a0018) | |||||
RMS->BDL (arctic_a0015) | |||||
CLB->SLT (arctic_a0005) | |||||
BDL->SLT (arctic_a0009) |
SLT->BDL (arctic_a0018) | |
---|---|
Source | |
I | |
II | |
Proposed | |
Target |
RMS->BDL (arctic_a0015) | |
---|---|
Source | |
I | |
II | |
Proposed | |
Target |
CLB->SLT (arctic_a0005) | |
---|---|
Source | |
I | |
II | |
Proposed | |
Target |
BDL->SLT (arctic_a0009) | |
---|---|
Source | |
I | |
II | |
Proposed | |
Target |
References
[1] Xie, Feng-Long and Soong, Frank K and Li, Haifeng “Voice conversion with SI-DNN and KL divergence based mapping without parallel training data,” in Speech Communication, 2019, pp. 57-67.
[2] Liu, Li-Juan and Ling, Zhen-Hua and Jiang, Yuan and Zhou, Ming and Dai, Li-Rong, “WaveNet Vocoder with Limited Training Data for Voice Conversion,” in Proceedings of the Interspeech, 2018, pp. 1983-1987.