ARVC: An Auto-Regressive Voice Conversion System Without Parallel Training Data

Proposed Method

A block diagram of our auto-regressive model for voice conversion is shown in Figure 1. It consists of three key components: (1) Encoder; (2) Decoder; (3) Waveform synthesis. Specifically, we use the frame-level linguistic features, PPGs, as the inputs. The encoder maps input PPGs into context-dependent representations. Then the decoder predicts acoustic features from the encoder outputs. In the end, the LPCNet vocoder is conditioned on the predicted acoustic features for speech generation.

Compared with conventional seq2seq base VC, ARVC removes the attention-based duration conversion module since PPGs already contain duration information, thus reducing mispronunciation and improving training stability. Compared with conventional PPGs based VC, ARVC takes previous step acoustic features as the inputs to produce the next step outputs via the auto-regressive structure, thus generating smooth trajectory and causing less voice errors.

Block diagram of the ARVC system architecture. It includes an encoder that maps input PPGs to high-level representations, and an decoder that predicts acoustic features. Finally, the LPCNet vocoder is conditioned on the predicted acoustic features for speech synthesis.

Comparison System

Two comparison systems are also implemented to verify the effectiveness of our proposed method.

System I [1]: It employs SI-ASR and Kullback-Leibler Divergence (KLD) based mapping approach to voice conversion without using parallel training data. The acoustic difference between source and target speakers is equalized with SI-ASR. KLD is chosen as a distortion measure to find an appropriate mapping from each input source speaker’s frame to that of the target speaker. Finally, the STRAIGHT vocoder is used to generate the converted waveform.
System II [2]: It achieves top rank on naturalness and similarity in Voice Conversion Challenge 2018. Firstly, the acoustic features of the source speaker (including Mel-cepstral coefficients (MCCs), F0 and band aperiodicities (BAPs)) are converted toward the target speaker using an LSTM-based conversion model. Then, the waveform samples of the converted speech are synthesized by sending the converted acoustic features into the WaveNet vocoder built for the target speaker. We try our best to reproduce the work in [2]. However, compared with the original system in [2], there are still two major differences: (1) There is no manually correction for F0 extraction errors, nor removal of speech segments with irregular phonation. (2) Due to limited training data for VC, Liu et al. [2] train a speaker-dependent WaveNet by adapting a pre-trained multi-speaker model for the target speaker. Differently, we have relatively enough data to train WaveNet in S2. Therefore, we only train the WaveNet in S2 using the target speech.

Speech Samples

The voice conversion experiments are conducted on CMU-ARCTIC datasets:

Arctic Database

	Source	I	II	Proposed	Target
SLT->BDL (arctic_a0018)
RMS->BDL (arctic_a0015)
CLB->SLT (arctic_a0005)
BDL->SLT (arctic_a0009)

SLT->BDL (arctic_a0018)
Source
I
II
Proposed
Target

RMS->BDL (arctic_a0015)
Source
I
II
Proposed
Target

CLB->SLT (arctic_a0005)
Source
I
II
Proposed
Target

BDL->SLT (arctic_a0009)
Source
I
II
Proposed
Target

References

[1] Xie, Feng-Long and Soong, Frank K and Li, Haifeng “Voice conversion with SI-DNN and KL divergence based mapping without parallel training data,” in Speech Communication, 2019, pp. 57-67.

[2] Liu, Li-Juan and Ling, Zhen-Hua and Jiang, Yuan and Zhou, Ming and Dai, Li-Rong, “WaveNet Vocoder with Limited Training Data for Voice Conversion,” in Proceedings of the Interspeech, 2018, pp. 1983-1987.