RA3: Phonetically justified parameters for unit-selection and hybrid speech synthesis

Actual experiments in this RA will be based on the findings from RA1 and RA2; nevertheless, it can be stated now that, according to our preliminary experiments, the context of the selected/modelled units is very important [TIH12]. So, we assume that a proper proposition of “penalisation matrix” (which defines which contexts should be strictly respected during selection/modelling, like the type of labialization, nasality, or palatalizing effect, and which contexts can be interchanged) is a key to smooth concatenation or proper modelling as the violation of the context continuity leads to disruptive effects in synthetic speech (RA3a). To ensure smooth and imperceptible concatenation in the spectral domain, we would also like to propose phonetically justified parameters (as opposed to the traditionally employed MFCCs) and use them both to describe and to control speech properties during unit modelling and/or selection (RA3b). Our preliminary experiments show that spectral tilt (expressed by different phonetic indexes like Kitzing, Hammarberg or Alfa indexes, or other filter bank ratios, or measured only with few first MFCC coefficients) could be the effective parameter. Other experiments will focus on the continuity of prominent prosodic parameters (especially F0 and durational patterns) within the synthesized utterance. In addition to a static measurement of the continuity of the parameters around the concatenation point (which itself cannot prevent unnatural fluctuation of the patterns), we would also like to capture the tendencies of the prosodic patterns on the utterance level, and, in this way, to ensure the continuity of the monitored prosodic parameters (RA3c). Many positional parameters (position of a speech unit in a syllable, word or other supra-word units like phrases) are typically involved in the unit-selection process causing the selection process to be “overfitted” (in fact, in any real speech corpus, all positional parameters are hardly fulfilled, and the selection process has to “sacrifice” some parameters for others; thus the result is not optimal). In our experiments we would like to revise the parameters (e.g. the position in a syllable seems to be not so important in Czech comparing to some other languages), and, in order to ensure optimal result, to find a way of weighting all involved parameters to correspond to phonetic reality of speech perception (RA3d).

Activity Objective Workplace 2016 2017 2018 Dissemination
RA3a Context definition and penalization matrix CU x Jimp: 1, Jrec: 1, D: 4
RA3b Phonetically justified parameters (spectral tilt, ...) UWB x x
RA3c Continuity of prosodic patterns UWB x x
RA3d Revision of positional parameters and weighting UWB x