Audio samples from "ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis"

Abstract: Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of ``tone shift", where a synthesized speech utterance contains correct base syllables but incorrect tones. To address the issue, we propose the ToneUnit framework, which leverages annotated data with tone labels as CTC supervision to learn tone-aware discrete speech units for Mandarin Chinese speech. Our findings indicate that the discrete units acquired through the TonUnit resolve the ``tone shift" issue in synthesized Chinese speech and yield favorable results in English synthesis. Moreover, the experimental results suggest that finite scalar quantization enhances the effectiveness of ToneUnit. Notably, ToneUnit can work effectively even with minimal annotated data.

Chinese Speech Synthesis

	面包价格会跟风上涨吗	确立了无人撼动的行业老大地位	从而把运输机转化为更为复杂的远程侦察机	清华大学通过调取监控录线发现	畜牧业产值占农业总产值比重百分之
Ground Truth
HuBERT + k-means
SPIRAL + k-means
SPIRAL + VQ
SPRIAL + FSQ

English Speech Synthesis

	The combined bands of both the countries played the music and a fine supper was served	I never knew of but one man who could ever please him	To say nothing said montalais so that when mademoiselle de tonnay charente thinks athenais is the only one who knows it	I expect you have been a very good girl andella since you were here last	Mister edison was a leader far ahead of the time
Ground Truth
HuBERT + k-means
SPIRAL + k-means
SPIRAL + VQ
SPIRAL + FSQ