This paper presents a method that generates expressive singing voice of Peking opera from music score. The synthesis of expressive opera singing usually requires pitch contours to be extracted as the training data, which relies on techniques and is not able to be manually labeled. With the Duration Informed Attention Network (DurIAN), this paper makes use of musical note instead of pitch contours for expressive opera singing synthesis. The proposed method enables human annotation being combined with automatic extracted features to be used as training data thus the proposed method gives extra flexibility in data collection for Peking opera singing synthesis. Comparing with the expressive singing voice of Peking opera synthesised by pitch contour based system, the proposed musical note based system produces comparable singing voice in Peking opera with expressiveness in various aspects.
In short: the model generates singing voice of Peking Opera from note and phoneme sequence where pitch, dynamics and timbre are jointly sampled. A LSTM with Mixture density output is used as duration model and and Lagrange Multiplier optimization is performed to better predict phoneme duration under note duration constrain. (See more in paper.)
A phoneme encoder first encode contextual phoneme features.
Then alignment model is used to add note information as well as singer identity embedding and align all feature sequence with the output frame.
In training, phoneme duration is given by dataset annotation, while in inference, it is predicted by a duration model.
Auto-regressive decoder is used to generate spectrogram.
WaveRNN is used to generate audio signal from spectrogram at a sample rate of 24kHz and sample depth of 9 bits.
Audio Samples
Singing Synthesis
To verify the ability of the proposed method to successfully generate the pitch expressiveness out of musical note, the proposed method is compared with the original recording and a f0-based system where fundamental frequency (f0) is used as input instead of note.
Original
Proposed Method
Original
Proposed Method
Original
Proposed Method
Original
Proposed Method
Original
Proposed Method
Original
Proposed Method
Original
Proposed Method
Score Input
In training, our note is obtained by note-transcription algorithm. To prove it can be generalised to score input, we used the published score of the recording to synthesis.
To demonstrate the proposed duration model gives better duration prediction than other methods, the generated samples of using the proposed duration model are compared with samples generated using other fitting heuristic-based methods.
Generate using Peking opera score:
Proposed System
Proposed System
Proposed System
Proposed System
Proposed System
Proposed System
Proposed System
Singing with Peking opera style generated from pop songs:
Proposed System
Proposed System
Singing Conversion
As our model are conditioned on singer identity (an embedding vector for each singer), singer conversion is possible, specifically, singing conversion across gender:
Original-Female
Conversion-Male
Original-Female
Conversion-Male
Original-Male
Conversion-Female
Original-Male
Conversion-Female
Visualizing f0 in Generated Singing
f0 of the original singing and that of the reconstruction generation are compared to demonstrate the capability of the proposed system to generate singing with expressiveness.
f0 of singing generated by Peking opera score input along with the score note input are shown here to demonstrate the capability of the proposed system to generate singing with expressiveness.