Towards Fine-Grained Prosody Control for Voice Conversion

PPGs-based Voice Conversion

Phonetic PosteriorGrams (PPGs) has been successfully applied to non-parallel VC [1]. The PPG is a sequence of frame-level linguistic information representation obtained from the speaker-independent automatic speech recognition system. The PPGs-based VC frameworks mainly consist of two key components: the conversion model and vocoder. The conversion model converts PPGs extracted from the source speech into acoustic features of the target speaker. Then the vocoder uses these converted features to synthesize the speech waveform of the target speaker.

Baseline System

Vocoders influence the quality of converted speech. Prior works utilize conventional parametric vocoders for VC (such as STRAIGHT and WORLD). However, these vocoders limit the speech quality. To address this issue, researches focus on the WaveNet vocoder. The WaveNet vocoder can directly estimate the time domain waveform samples conditioned on input features. It is shown to improve the generated speech quality significantly. However, the WaveNet vocoder runs slower than the real time. Recently, an efficient neural vocoder called LPCNet [2] is proposed. This vocoder can synthesize speech with close to natural quality while running faster than real time on a standard CPU. Therefore, we use the LPCNet vocoder for VC. To control the prosody of the generated speech, we use the pitch and the vuv features as additional inputs.

Figure1. Block diagram of (a) training and (b) conversion stages of the baseline VC system.

Proposed Method

Despite the good performance of the baseline system, it still has some limitations. The prosody is related with many factors, including the intonation, stress, rhythm and pitch. In the baseline system, we only utilize the pitch to control the prosody of the generated speech. We find this system performs badly on some challenging situations, such as the source speech is a singing voice.

To overcome the above limitations, we utilize the reference encoder [3] to learn a latent prosody representation from the input speech directly. We introduce temporal structures in the reference encoder, thus enabling fine-grained control of the prosody of the converted speech.

Figure2. Block diagram of (a) training and (b) conversion stages of the proposed VC system.

The architecture of reference encoder is shown in Fig. 3. It takes a mel-spectrogram as the input. The reference encoder contains 6-layer 2D-convolutional layers. Each layer is composed of 3 X 3 filters with 1 X 2 stride, SAME padding and ReLU activation. The number of filters in each layer are [32, 32, 64, 64, 128, 128]. The outputs of the last convolutional layer is fed to a uni-directional Gated Recurrent Unit (GRU) with one unit and a tanh activation. The outputs of the GRU at every time step form the variable-length prosody embeddings.

Figure3. The prosody reference encoder module. A 6-layer stack of 2D convolutions with ReLU activations, followed by a single-layer GRU with 1 unit and a tanh activation.

Comparison System

Three comparison systems are also implemented to verify the effectiveness of our proposed method.

Comparison system 1: It comes from the baseline system. Ignoring f0, we do not use acoustic features of the source speech to control the prosody of converted speech. The WORLD vocoder is used for speech generation.
Comparison system 2: It comes from the baseline system. Ignoring f0, we do not use acoustic features of the source speech to control the prosody of converted speech. The LPCNet vocoder is used for speech generation.
Comparison system 3: It comes from the baseline system. Besides the pitch and vuv features, we also use the aperiodicity coefficient to control the prosody. The LPCNet vocoder is used for speech generation.

Speech Samples

The voice conversion experiments are conducted on our Mandarin corpora recorded by professional speakers:
Training corpus: One female speaker (TS) with 15000 utterances.
Evaluation corpus: One female speaker (MY) and one male speaker (YYX). Each speaker provides 20 samples from audiobook recordings. Additionally, we choose 20 songs from a male speaker for testing (Song).

Song to TS

	Source	Target	Baseline	Comparison 1	Comparison 2	Comparison 3	Proposed
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Sample 8

Sample 1
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 2
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 3
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 4
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 5
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 6
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 7
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 8
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

MY to TS

	Source	Target	Baseline	Comparison 1	Comparison 2	Comparison 3	Proposed
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5

Sample 1
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 2
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 3
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 4
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 5
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

YYX to TS

	Source	Target	Baseline	Comparison 1	Comparison 2	Comparison 3	Proposed
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5

Sample 1
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 2
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 3
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 4
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

Sample 5
Source
Target
Baseline
Comparison 1
Comparison 2
Comparison 3
Proposed

References

[1] Lifa Sun, Hao Wang, Shiyin Kang, Kun Li and Helen Meng, “Personalized, Cross-lingual TTS Using Phonetic Posteriorgrams,” in INTERSPEECH, 2016, pp. 322-326.

[2] Valin Jean-Marc and Skoglund Jan, “LPCNet: Improving neural speech synthesis through linear prediction,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2019, pp. 5891–5895.

[3] Skerry-Ryan RJ, Battenberg Eric, Xiao Ying, Wang Yuxuan, Stanton Daisy, Shor Joel, Weiss Ron J, Clark Rob and Saurous Rif A, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” in arXiv, 2018.