Project management of NTIS P1 Cybernetic Systems and Department of Cybernetics | WiKKY




RA1: Analysis of artifacts in synthetic speech

The main goal in this research area is to carry out a thorough analysis of problems present in synthetic speech, to list them (RA1a) and to describe them both on “technical” and phonetic levels. By the term “technical” we mean mainly the dependence of the disruptive effects on the internal mechanism the synthesizer is based on. For instance, the correlation of the occurrence of disruptive effects (collected by expert phonetician listening and/or listening test with naive listeners) and the values of criterial (penalization) function (either the total cumulative function or partial functions based on particular components within target and join cost functions like F0, spectral, energy, various positional parameters describing e.g. the position of a speech unit within a word/phrase/utterance etc.) or HMM outputs. Statistical analysis and outlier detection techniques are planned to be used for this purpose (RA1b). From the phonetic point of view, the description will focus on spectral discontinuities that have been shown to influence listeners in English [MIA06] and it is reasonable to expect a similar effect in Czech. For instance, the typical nasal formants at the mean level of 250 and 1000 Hz and the nasal antiformants (suppressed frequency bands) are not constant across all the nasalized vowels and their neighbourhood. Similarly, the lip-rounding during articulation lowers the position of formants in sonorous segments. Also, the same vowel will have different spectral characteristics depending on the prominence of the syllable in which it occurred. Thus, if two segments or two parts of a segment meet—from contexts with different degree of nasalization, lip- rounding, or prominence—discontinuity in the spectrum with perceptual consequences is likely to occur (RA1c). The knowledge acquired from these analyses will be used in other RAs. Within this RA, based on the previous findings, automatic cleaning of speech corpora (in sense of fixing the annotation of source speech recordings [MAT13]) will be performed by training a classifier/detector with positive (i.e., with the identified misannotation) and/or negative (i.e., with correct annotation) examples and subsequently by applying the classifier/detector to all source recordings (RA1d).

Activity Objective Workplace 2016 2017 2018 Dissemination
RA1a Analysis and cataloguing of artifacts CU x Jimp: 1, Jneimp: 1, D: 2
RA1b Technical description of artifacts UWB x
RA1c Phonetic description of artifacts UWB x
RA1d Automatic cleaning of speech corpora UWB x x