RA1: Analysis of artifacts in synthetic speech¶

The main goal in this research area is to carry out a thorough analysis of problems present in synthetic speech, to list them (RA1a) and to describe them both on “technical” and phonetic levels. By the term “technical” we mean mainly the dependence of the disruptive effects on the internal mechanism the synthesizer is based on. For instance, the correlation of the occurrence of disruptive effects (collected by expert phonetician listening and/or listening test with naive listeners) and the values of criterial (penalization) function (either the total cumulative function or partial functions based on particular components within target and join cost functions like F0, spectral, energy, various positional parameters describing e.g. the position of a speech unit within a word/phrase/utterance etc.) or HMM outputs. Statistical analysis and outlier detection techniques are planned to be used for this purpose (RA1b). From the phonetic point of view, the description will focus on spectral discontinuities that have been shown to influence listeners in English [MIA06] and it is reasonable to expect a similar effect in Czech. For instance, the typical nasal formants at the mean level of 250 and 1000 Hz and the nasal antiformants (suppressed frequency bands) are not constant across all the nasalized vowels and their neighbourhood. Similarly, the lip-rounding during articulation lowers the position of formants in sonorous segments. Also, the same vowel will have different spectral characteristics depending on the prominence of the syllable in which it occurred. Thus, if two segments or two parts of a segment meet—from contexts with different degree of nasalization, lip- rounding, or prominence—discontinuity in the spectrum with perceptual consequences is likely to occur (RA1c). The knowledge acquired from these analyses will be used in other RAs. Within this RA, based on the previous findings, automatic cleaning of speech corpora (in sense of fixing the annotation of source speech recordings [MAT13]) will be performed by training a classifier/detector with positive (i.e., with the identified misannotation) and/or negative (i.e., with correct annotation) examples and subsequently by applying the classifier/detector to all source recordings (RA1d).

Activity	Objective	Workplace	2016	2017	Dissemination
RA1a	Analysis and cataloguing of artifacts	CU	x		Jimp: 1, Jneimp: 1, D: 2
RA1b	Technical description of artifacts	UWB	x
RA1c	Phonetic description of artifacts	UWB	x
RA1d	Automatic cleaning of speech corpora	UWB	x	x

Files (0)

Project

General

Profile

HQSYN16

Wiki

RA1: Analysis of artifacts in synthetic speech¶