MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF TECHNOLOGY
-------------------------------
Thesis for the degree of
MASTER OF SCIENCE
Modeling the prosody
of Vietnamese language for speech
synthesis
Speciality: “Information processing and Communication”
Code:23.04.3898
MẠC ĐĂNG KHOA
Supervisor:
Prof. PHẠM THỊ NGỌC YẾN
Hanoi, 2007
Faculty of Information Technology
International research center of
Multimedia Information, Communication and Application
- 1 -
105 trang |
Chia sẻ: huyen82 | Lượt xem: 1652 | Lượt tải: 0
Tóm tắt tài liệu Modeling the prosody of Vietnamese language for speech synthesis, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Master thesis
Mạc Đăng Khoa
Acknowledgment
Many people provided me generous help and inspiration during my time of master
student.
First, I would like to express my deep sense of respect and gratitude towards my
supervisors: Dr. Eric Castelli and Prof. Phạm Thị Ngọc Yến. Thank you very much
for orienting and guiding my research in speech processing domain. Thank you for
all your useful advices, your true criticisms and your patience during my time of
master research.
Special thanks also goes to Mrs. Geneviève Caelen-Haumont, PhD students Trần
Đỗ Đạt, Vũ Minh Quang and all members of MICA’s speech group. I could not
have done this thesis without your supports. Thank all of you for all your
suggestions and your sincere remarks on entire of my research.
I would like to thank to Ms. Đồn Thị Ngọc Hiền, who guiding me in recording the
corpus. I would also like to thank to a lot of MICA member who spent much of time
for recording and testing for my research.
I am grateful to Prof. Nguyễn Trọng Giảng and MICA’s directorate supporting me
the best convenient conditions during time working in International Research
Center MICA.
Finally, I owe a great deal to my parents and my sister for their continued support. I
also give a very special thanks to my girl friend for her constant encouragement,
giving me strength and motivation in my work and in my life.
- 2 -
Master thesis
Mạc Đăng Khoa
Abstract
Text-To-Speech (TTS) system is a computer system which is able to produce the
speech from the text. In the TTS system, the naturalness of the produced speech
depends greatly on the variation of pitch, duration and energy during speaking. We
call it the “prosody controlling ability”. A TTS system with good prosody
controlling ability can be simulate the human speech prosody corresponding to the
context of speaking.
With tonal languages such as Vietnamese, the prosody of an utterance is the
combination results of the two components: "micro-prosody" corresponding to the
tone of each syllable in a sentence and "macro-prosody" corresponding to the whole
sentence.
The main goal of this thesis is to model the characteristics of Vietnamese prosody
for speech synthesis. It focuses on the influences of the macro-prosody on the
micro-prosody, in three types of sentence: assertive, interrogative and imperative.
The first task is to set up a “prosody corpus” and extract all possible prosody
parameters. Base on the extracted data, we defined seventy-two simple prosody
patterns for Vietnamese syllables in three types of sentence. After that, these
patterns were applied to synthesize some simple sentences. Finally, some perception
experiments were taken to evaluate these synthesized sentences. The results shown
that the proposed patterns can be applied successfully to generate the prosody of
simple sentence.
This work is our preliminary work in Vietnamese prosody, just concerning the
sentence types and the position of syllable in a sentence. In the future, we expect to
continue this research with more factors of Vietnamese prosody, improve our
pattern and apply them Vietnamese TTS system.
- 3 -
Master thesis
Mạc Đăng Khoa
- 4 -
Master thesis
Mạc Đăng Khoa
List of Figures
Figure 1-1: Category of methods for predicting syllable duration [6]....................23
Figure 2-1: Example of the contours of six tones, as described in [21]...................30
Figure 2-2: The shape of Tone 1 with female and male voice [18].........................31
Figure 2-3: The shape of Tone 2 with female and male voice [18].........................31
Figure 2-4: The shape of Tone 3 with female and male voice [18].........................32
Figure 2-5: The shape of Tone 4 with female and male voice [18].........................32
Figure 2-6: The shape of Tone 5 with female and male voice [18].........................32
Figure 2-7: The shape of Tone 5b with female and male voice [18].......................33
Figure 2-8: The shape of Tone 6 with female and male voice [18].........................33
Figure 2-9: The shape of Tone 6b with female and male voice [18].......................34
Figure 2-10: Sentence classification by structure [20]............................................35
Figure 2-11: The sentences “Lan thích ăn cơm khơng” in......................................36
Figure 2-12: The sentences “Bảo cố gắng tập đi” in...............................................36
Figure 2-13: The sentences “Tân bỏ đi chứ” in ......................................................37
Figure 2-14: The differences of F0 contour between Assertive and Interrogative
sentence [16] .........................................................................................................37
Figure 3-1: A general function diagram of TTS system [13] ..................................41
Figure 3-2: Fujisaki model.....................................................................................46
Figure 3-3: Fujisaki model for tonal language [19] ................................................46
Figure 3-4: Function diagram of proposal TTS system ..........................................47
Figure 3-5: Prosody generation module .................................................................48
Figure 4-1: Key-syllable segmentation ..................................................................56
Figure 4-2: Extracting F0 contour using PRAAT...................................................57
Figure 4-3: An example of prosody pattern............................................................60
Figure 5-1: An example of synthesized non-sense phrase ......................................73
Figure 5-2: Perception test 1 ..................................................................................74
Figure 5-3: An example of synthesized multi-type sentences.................................80
- 5 -
Master thesis
Mạc Đăng Khoa
Figure 5-4: Interface for Perception test 2..............................................................82
Figure 5-5: Correct recognition rate with 8 tones of last syllable ...........................85
Figure 5-6: Correct recognition rate (%) with other types of sentences ..................86
Figure 5-7: Result comparison of three experiments ..............................................87
- 6 -
Master thesis
Mạc Đăng Khoa
List of Tables
Table 1.1: Prosody functions .................................................................................16
Table 1.2:Links between levels of representation of prosodic phenomena [13]......17
Table 1.3: Intonation model classification .............................................................18
Table 2.1:Vietnamese vowels. ...............................................................................27
Table 2.2:Vietnamese consonants. .........................................................................28
Table 2.3: Arrangement of Vietnamese consonants. ..............................................28
Table 2.4:The phonological hierarchy of Vietnamese syllables with total numbers of
each phonetic unit [14]. .........................................................................................29
Table 2.5 The six Vietnamese tones.......................................................................30
Table 3.1: Comparison between direct pattern and model pattern ..........................50
Table 4.1: Prosody corpus structure.......................................................................52
Table 4.2: Prosody corpus text information ...........................................................53
Table 4.3: Recording information of Prosody corpus.............................................54
Table 5.1: Confusion matrix (in %) for 8 tones with male voice ............................75
Table 5.2: Confusion matrix (in %) for 8 tones with female voice .........................75
Table 5.3: Confusion matrix (%) of sentence types with male voice .....................76
Table 5.4: Confusion matrix (%) of sentence types with female voice ..................77
Table 5.5: Test data for Experiment 2....................................................................79
Table 5.6: Confusion matrix (in %) of sentence types (with male voice)................82
Table 5.7: Confusion matrix (in %) of sentence types (with female voice) ............83
Table 5.8: Confusion matrix (in %) of sentence types (average of Male and Female)
..............................................................................................................................84
Table 5.9: Correct recognition rate (%) with other types of sentences....................86
Table 5.10: Result of three experiments.................................................................87
- 7 -
Master thesis
Mạc Đăng Khoa
Table of contents
Acknowledgment .......................................................................................... 1
Abstract ........................................................................................................ 2
List of Figures............................................................................................... 4
List of Tables ................................................................................................ 6
Table of contents .......................................................................................... 7
0 INTRODUCTION ................................................................................. 9
1 PROSODY AND PROSODIC MODEL............................................. 12
1.1. Overview of prosody ...........................................................................................12
1.1.1. The concept of prosody............................................................................................................ 12
1.1.2. Major components of prosody ................................................................................................. 13
1.1.3. The functions of prosody ......................................................................................................... 14
1.1.4. Levels of representation of prosodic phenomena..................................................................... 16
1.2. Prosody modeling ................................................................................................17
1.2.1. Intonation models..................................................................................................................... 18
1.2.2. Duration modeling ................................................................................................................... 21
1.2.3. This thesis work approach........................................................................................................ 23
2 VIETNAMESE LANGUAGE AND PROSODY ............................... 25
2.1. Vietnamese language ...........................................................................................25
2.1.1. Vietnamese characteristics ....................................................................................................... 25
2.1.2. Vietnamese phoneme system ................................................................................................... 27
2.1.3. Syllable structure ..................................................................................................................... 29
2.2. Vietnamese prosody.............................................................................................29
2.2.1. Micro-prosody and tones system in Vietnamese...................................................................... 30
2.2.2. Macro-prosody and sentence types in Vietnamese .................................................................. 34
2.2.3. Some special phenomena in Vietnamese prosody ................................................................... 38
3 TTS SYSTEM AND PROSODY GENERATION............................. 40
3.1. An overview of TTS system ................................................................................40
3.2. Prosody generation ..............................................................................................41
3.2.1. Overview of prosody generation.............................................................................................. 41
3.2.2. From text to prosody................................................................................................................ 43
3.3. Other researches and our proposal.......................................................................45
4 PROSODY PATTERNS EXTRACTION .......................................... 51
4.1. Prosody corpus.....................................................................................................51
- 8 -
Master thesis
Mạc Đăng Khoa
4.1.1. Objectives ................................................................................................................................ 51
4.1.2. Define the corpus text .............................................................................................................. 52
4.1.3. Recording................................................................................................................................. 54
4.1.4. Sentence segmentation............................................................................................................. 54
4.2. Analysis and extracting prosody parameters .......................................................55
4.2.1. Segmentation............................................................................................................................ 55
4.2.2. Extracting prosody parameters of key-syllable ........................................................................ 56
4.3. Proposal the patterns for Vietnamese prosody ....................................................58
4.3.1. Methodology............................................................................................................................ 58
4.3.2. Prosody patterns....................................................................................................................... 59
4.3.3. Some visual remarks on extracted patterns .............................................................................. 70
5 EXPERIMENTS AND EVALUATION............................................. 72
5.1. Experiment 1: Tone and non-sense phrase ..........................................................72
5.1.1. Objectives ................................................................................................................................ 72
5.1.2. Method and Implementation .................................................................................................... 72
5.1.3. Results and discussion ............................................................................................................. 74
5.2. Experiment 2: Multi-type sentences ....................................................................79
5.2.1. Objectives ................................................................................................................................ 79
5.2.2. Method and Implementation .................................................................................................... 79
5.2.3. Results and discussion ............................................................................................................. 82
5.3. Comparison and conclusion.................................................................................87
6 CONCLUSION AND PERSPECTIVES ............................................ 89
REFERENCES........................................................................................... 92
APPENDIX................................................................................................. 95
A. Text for prosody corpus ..............................................................................................95
B: Datasheet of prosody patterns ...................................................................................100
- 9 -
Chapter 0: Introduction
Mạc Đăng Khoa
0
Introduction
Speech is the primary means of communication between people. Speech synthesis,
automatic generation of speech waveforms, has been under development for several
decades. Recent progress in speech synthesis has produced synthesizers with very
high intelligibility but the sound quality and naturalness remain a major problem.
Most of recent researches attempt to improve the naturalness of synthesized sound
to reach to human speech.
In Vietnam, there are currently some Vietnamese synthesis system like VnVoice
(develop by Institute of Information Technology) or HoaSung (develop by
International Research Center MICA). These researches obtained some encouraging
results. However, to release their systems to the market yet, they have to improve
the produced speech quality, especially the naturalness of speech prosody.
Thus, this thesis aims to study the characteristics of Vietnamese prosody for
applying to synthesize the speech. This work is carried out in International research
center of Multimedia Information, Communication and Application (MICA) and is
part of MICA’s project: VN-Synthesis.
With the research of PhD student Tran Do Dat in MICA, we have already
developed a speech synthesis system using sound samples concatenation
techniques. The first version now can produce sound from detailed text description,
which consists of:
- 10 -
Chapter 0: Introduction
Mạc Đăng Khoa
• The sequence of phonemes for composing the utterance: can be obtained
automatically from the raw text using a "phonetization” module, whose
development is currently underway.
• All information related to voice modulations: mostly pitch, energy and
duration variations that constitute the intonation or prosody of the uttered
statement. We call it “prosody description”.
For tonal languages such as Vietnamese, the prosody of speech is composed of two
components, which we call “micro-prosody” and “macro-prosody”:
• Micro-prosody is the variations of pitch, duration and intensity of
individual word or syllable. For tonal language, the micro-prosody is very
important to distinguish the syllable’s tone. Thus, the meaning of the
synthesized sound greatly depend on the quality of micro-prosody.
• Macro-prosody is the application of prosody to whole phrase or sentence.
It depends on the type of sentence, speaker's intentions, the emotions etc.
Therefore, the "naturalness" of synthesized speech is depends on ability
of macro-prosody controlling during speech synthesis process.
Objectives and Tasks
This thesis is part of MICA speech synthesis research and its main goal is to extract
characteristics of Vietnamese prosody to generate the “prosody description” for
speech synthesis.
In this thesis, we just focus on the differences of Vietnamese tones in different
positions in the sentence and in different types of sentences. In other words, these
are the influences of macro-prosody on micro-prosody.
The first task is setting up a corpus for researching Vietnamese prosody. With this
corpus, we extract and analysis parameters of fundamental frequency, duration and
intensity of the syllables in eight Vietnamese tones, in three positions and in three
type sentences.
- 11 -
Chapter 0: Introduction
Mạc Đăng Khoa
After that, using these prosody parameters, we defined the simple prosody patterns
for Vietnamese tones, corresponding to the cases of syllable in three types of
sentence: assertive, interrogative and imperative. By applying these patterns to re-
synthesize some simple sentences and doing some perception experiment, we can
examine the appropriateness of these prosody patterns.
Thesis outline
This thesis is structured as follows:
• Chapter 1 starts with Section 1.1 giving some background on prosody,
also some definitions and some term we use in this thesis book. Section
1.2 briefly presents modeling prosody and some prosodic models.
• Chapter 2 gives an overview of Vietnamese language and Vienamese
prosody.
• Chapter 3 starts with the introduction of Text-to-Speech system, the
general structure of TTS system and the prosody generation. In last
section of this chapter, we present some related work and propose a
simple structure for prosody generation module for TTS system.
• Chapter 4: Section 3.1 and 3.2 describes our work of setting up and
analyzing the Vietnamese prosody corpus. In section 3.3, we propose set
of prosody patterns for the Vietnamese syllables.
• In chapter 5, a series of perception experiments is presented for
evaluating our proposal patterns.
• Chapter 6 completes with the conclusions from the work presented in the
thesis and suggestions for further work
- 12 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
1
Prosody and Prosodic model
In this chapter, we give an overview of prosody and explain some terms we use in
this thesis. The concept of modeling prosody and some prosodic models are also
briefly presented after that.
1.1. Overview of prosody
1.1.1. The concept of prosody
There is not an exact definition of the term “prosody”. We can use the term
"prosody" broadly, meaning “a time series of speech-related information that is not
predictable from a reasonable window (i.e. word-sized or sentence-sized) applied to
the phoneme sequence” [1].
Viewed in the large, prosody is a parallel channel for communication, carrying
some information that cannot be simply deduced from the lexical channel. All
aspects of prosody are transmitted by muscle motions, and in most of them, the
recipient can perceive, fairly directly, the motions of the speaker.
Clearly, with that broad definition of prosody, hand gestures, eyebrow and face
motions, can be considered prosody, because they carry information that modifies
and can even reverse the meaning of the lexical channel. However, in the domain of
speech processing, we concentrate on the aspect of speech of prosody. Thus, the
prosody could include: “Pitch”, “Duration” and “Stress”. In the aspect of speech
- 13 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
signal, the prosody is represented by three components: “Fundamental frequency
(F0)”, “Duration” and “Intensity”.
“Prosody” and “Intonation”
The term prosody refers to certain properties of the speech signal such as audible
changes in pitch, loudness, and syllable length. For some authors the set of prosodic
features also includes other aspects related to speech timing such as rhythm and
speech rate. [13]
Some as a synonym for prosody use the term intonation. It is restricted to the tonal
(melodic) aspects of prosody by others. In the thesis, intonation refers to pitch
variation in speech production and is part of prosody. [13]. In other words, we have:
Prosody = Intonation + Duration
1.1.2. Major components of prosody
As we discuss above, the prosody consist of:
• Pitch (Fundamental frequency): Among prosodic event, the most overt are
changes in pitch, which together constitute the pitch contour of the
utterance. (F0 contour of speech signal). Some analysis of sentences-lever
pitch contours show that the pitch contour of longer utterances can be
broken down to a sequence of elementary contours, which can further be
divided into syllabic contours. [13]
• Duration: duration in prosody is concerning to the length of sentence,
phrase, word, syllable, voiced part in syllable, syllabic nuclei, and so on.
The duration of syllable and speech sounds depends on several
(dependent or interdependent) factor such as speech rate, rhythm,
phonetic nature, etc. Most of case, the absolute duration of an event is
easily measured. However sometime, it is not obvious to define the
boundary of an event.
- 14 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
• Stress (Intensity): stress is a prosodic property that has been described
since the very first work on prosody in phonetics. It was said to be related
to loudness and phonology force. Both these characterizations refer to the
perceptual form of prosody: the syllable carrying stress is prominent with
respect to the surrounding syllables, either due to its loudness or to its
dynamic properties.
1.1.3. The functions of prosody
Prosody, as expressed in pitch, gives clues to many channels of linguistic and para-
linguistic information. Linguistic functions such as stress and tone tend to be
expressed as local excursions of pitch movement. Intonation types and para-
linguistic functions may affect the global pitch setting, in addition to characteristic
local pitch excursion near the edge of the sentence (i.e. boundary tones). [1]
Prosody used to convey lexical meaning: Stress, accentual and tone languages.
• Stress language: English is an example of a stress language. Stress
location is part of the lexical entry of each English word. For example,
"apple" and "orange" both have stress on the first syllable, while
"banana" has stress on the second syllable. When an English word is
spoken in isolation in declarative intonation, f0 typically peaks on the
stressed syllable.
• Accentual language: Japanese is an example of an accentual language. A
word is lexically marked as accented (on a particular syllable) or un-
accented. A simplified description is that pitch rises near the beginning of
an accentual phrase and falls on the accented syllable. For detailed
analysis, see Beckman and Pierrehumbert (1988).
• Tone language: Mandarin, Vietnamese are the examples of a lexical tone
language. Each syllable is lexically marked with one lexical tones (.
Tones have distinctive pitch contours. Altering the pitch contour may
have the consequence of changing the lexical meaning of a word, and
- 15 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
perhaps the meaning of a sentence. For example in Vietnamese, the
meaning of syllables “ta” (we), “tà” (lap of dress), “tã” (nappy), “tả” (to
describe), “tá” (twelve), “tạ” (quintal) are different.
Prosody used to convey non-lexical information: Intonation type (Question vs.
declarative sentences).
Languages may employ prosody in different ways to differentiate declarative
sentences from questions. A general trend is that questions are associated with
higher pitch somewhere in the sentence, most commonly near the end. This may be
manifested as a final rising contour, or higher/expanded pitch range near the end of
the sentence. In English, declarative intonation is marked by a falling ending while
yes-no question intonation is marked by a rising one, as shown on the last digit
"one" in the English examples. Russian question, on the other hand, uses strong
emphasis on a key word instead of a rising tail. Chinese questions are manifested by
an expanded pitch range near the end of the sentences, however, the speaker
preserves the lexical tone shapes. [1]
Prosody used to convey discourse functions: Focus, prominence, discourse
segments, etc.
Topic initialization is typically associated with high pitch. Pitch is typically raised
in the discourse initial section and lowered in the discourse final section. Also, new
information in the discourse structure is typically accented while old information
de-accented. [1]
Prosody used to convey emotion.
Most experiments studying emotional speech study stylized emotion, as delivered
by actors and actresses. In these acted-out emotions, a few categories of emotions
can be reliably identified by listeners, and one can find consistent acoustic
correlates of these categories. For example, excitement is expressed by high pitch
and fast speed, while sadness is expressed by low pitch and slow speed. Hot anger is
characterized by over-articulation, fast, downward pitch movement, and overall
- 16 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
elevated pitch. Cold anger shares many attributes with hot anger, but the pitch range
is set lower.
The study of emotion in natural speech is a lot more complicated. It is generally
recognized that speakers show mixed feelings and ambiguous states of mind, and
the emotions do not fall into clear cut categories.[1]
We have the summary of prosody functions in Table 1.1:
Table 1.1: Prosody functions
modifying meaning not modifying meaning
Linguistic (Lexicon
information)
Paralinguistic
(non-lexicon information)
Discourse function Extra
linguistic
- Tone
- Accent
Sentence type:
- Assertive
- Interrogative
- Imperative
- Focus,
- Prominence
- …
- Emotion
- Sex of
speaker
- …
In this thesis work, we just focus on studying the functions of prosody which
modify meaning, namely tones and sentence types in Vietnamese prosody.
1.1.4. Levels of representation of prosodic phenomena
As for other properties of the speech signal, prosodic events can be studied at
various levels of representation (see Table 1.2) [13]
• First, the acoustic level: the acoustic manifestation of prosody
(fundamental frequency, amplitude, and duration) can be measured
directly, using specialized hardware or algorithms (such as pitch
determination algorithms).
• Second, the perceptual level represents the prosodic events as heard by
the listener. As for spectral properties of speech sounds, acoustic
characteristics that can be measured are not a._.lways perceptible. The
perceptual representation is accessible to the individual listener, but this
mental representation can hardly be measured. Alternatively it can be
computed with a fair amount of precision on the basis of our knowledge
about psychoacoustics.
- 17 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
• Finally, the linguistic level represents the prosody of an utterance as a
sequence of abstract units (signs, symbols), some of which have a
communicative function in speech, while others may just fulfill syntactic
requirements. The linguistic structure of prosody is not some hidden code
that simply can be revealed using some standard procedure.
Table 1.2:Links between levels of representation of prosodic phenomena [13]
Acoustic Perceptual Linguistic
Fundamental frequency (F0) Pitch Tone, intonation, aspect of stress
Intensity Loudness Aspect of stress
Duration Length Aspect of stress
Given the different nature of these representations, it is important to keep them
apart. It can be helpful to have the terminology reflect the lever of representation.
For instance, measuring loudness does not equal measuring signal energy. It is
obvious that the perception of loudness is not exclusively related to the amplitude at
one point of the signal, but also dependent on the duration of a speech fragment (the
loudness of which we are measuring), and relative to the loudness of other parts in
the signal.
As one moves away from acoustic level towards the perceptual and/or linguistic
levels, the measurement of some given prosodic property will progressively involve
segmentation (for example, into syllables), context (such as relative prominence),
and structural information (the linguistic interpretation of a syllabic tone, for
example, often depends on whether the related syllable is stressed or not, which
requires a prior analysis of the segmental layer).
1.2. Prosody modeling
Prosodic models serve two purposes: On one hand, they can be scientific
hypotheses that explain how we communicate with each other, and what we
communicate. On the other hand, they can be engineered software systems that are
part of a dialog system or speech synthesizer. To a lesser extent - and this is mostly
potential - a prosodic model can be the background for a system to recognize
prosody in human speech.
- 18 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
In general, a prosodic model is combined of two component, they are: intonation
model and duration model. In this section, we word like to give an overview of
some methods for prediction intonation (F0 contours) and duration which have
actually been applied in speech synthesis.
1.2.1. Intonation models
1.2.1.1 Intonation model classification
The primary goal of intonation research is to model natural f0 contours of speech,
preferably in relation to a transcription and a description of the prosodic intent of
the speaker. The starting point of intonation research is the time series of F0. But
the interpretation of the F0 information diverges widely among intonation models.
The Table 1.3 represents a view of how one can classify the various intonation
models.
Table 1.3: Intonation model classification
Intonation model classified by the way they describe prosody.
Under-specified - - Fully Specified
Single Component INTSINT ToBI, Xu Tilt, IPO Olive, Machine learning
Two components Grønnum - Fujisaki -
Multiple components - - - Van Santen
Under-specified or Fully specified
The shape of an accent may be fully-specified (i.e. defined without gaps) or under-
specified (defined by disconnected regions or isolated points). Along another
dimension, f0 values at any given time may be treated as a single component or as
the combination of multiple components.
The advantage of using an under-specified accent shape is that it allows sufficient
distance between specified accent targets to allow a smooth f0 transition, typically
by way of interpolation. The drawback is that it ignores changes of shape between
specified targets.
- 19 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
On the other hand, a system with fully specified accents leaves little room to resolve
conflicting targets. A simple concatenation of fully-specified accents will result in a
pitch curve with unnatural jumps at the concatenation joints. Many systems, such as
Fujisaki (1983, 1988), use filters to smooth out abrupt changes in F0. Alternatively,
van Santen (1997, 2000) requires each accent to begin and end at zero to ensure
smooth connections between accents.
Single component or many components?
Many intonation models treat surface intonation contours as the superposition of a
phrase component and an accent component. Grønnum (1992) and Fujisaki (1983,
1988) are representatives of this view.
Well-defined model that fully specifies accent shape and uses multiple components
is Van Santen's model (van Santen and Mưbius, 1997, 2000; van Santen et al.,
1998), where accents are represented by densely populated points, providing a
mechanism to describe highly complex accent shapes in detail. We characterize van
Santen's system as having multiple components, because in addition to the phrase
component, each accent in the phrase also adds a phrase-length component that
contributes to the surface f0 contour.
The advantage of multiple components is that it provides a mechanism to separate
individual accents from long-term effects. However, if one allows multiple
components, then one necessarily faces the problem that there is no unique solution
in the decomposition of a single f0 time series into multiple components [1]. Any
such decomposition depends on a model of the speech process, and is only as good
as the underlying model.
In contrast, Liberman and Pierrehumbert (1984) explicitly reject the notion of a
phrase curve and represent intonation contours as a single component. The
advantage of representing f0 information as a single component is that the
representation of accent heights will then be transparent, which lends itself to
convenient automatic labeling. [1]
- 20 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
1.2.1.2 Some prosody models
The following give an over view of intonation models in Table 1.3
• INTSINT (Hirst et al., 2000) is an underspecified intonation system that
defines an accent by a single point. Fitting quadratic spline curves
through these points generates surface f0.
• ToBI: The most widely used under-specified accent shape is represented
by the ToBI model (Beckman and Ayers, 1997; Silverman et al., 1992),
which developed from earlier works such as Pierrehumbert (1980),
Liberman and Pierrehumbert (1984), and Pierrehumbert and Beckman
(1988). Each accent is represented by no more than two points, which
specify abstractly the relative contrast of high (H) and low (L). One goal
of the ToBI system is to specify a minimal set of categorical labels for
intonation. These labels are usually interpreted as phonological
distinction between accent types.
• Xu: Xu et al. (1999) represents Chinese tones with under-specified, static
or dynamic targets. The surface f0 contours are generated with a model
that approaches these targets asymptotically within the domain of a
syllable.
• Tilt (Taylor, 2000; Taylor, 1998) allows more samples than ToBI near
the peak of an accent and leaves the other regions unspecified, hence its
status half way to a fully specified system. Tilt considers all accent types
to be continuous variations of a single class. Surface variations are
accounted for by changes in the continuous parameters.
• IPO (de Pijper, 1983) prepares a piecewise-linear approximation to the
pitch contour. They then associate the slope and height of these lines with
various types of accents. Olive (1975) described a very early fully-
specified system, following work by Levitt and Rabiner (1970). His
model stored the surface pitch vs. time contour as a function of the
grammatical structure of the sentence. The contour was then
- 21 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
approximated by polynomial splines attached to words, to allow for
duration variations.
• Machine-learning: Several works using machine-learning techniques
generate densely sampled f0 values, including Chen et al. (1992) and
Malfrère et al. (1998). We classify these works as fully specified systems
even though in some cases the concept of accent may not be clear. Ross
and Ostendorf (1999) described an interesting machine learning system
where a discrete learning system would predict vectors attached to
phonemes and syllables, and these vectors would in turn drive a (learned)
dynamical system to predict f0.
• Fujisaki: Fujisaki’s phonetic intonation model (Fujisaki and Kawai,
1982). Fujisaki’s model was developed from the filter method first
proposed by O¨ hman (1967). Fujisaki states that intonation contours are
comprised of two types of components, the phrase and the accent. The
production process is represented by a glottal oscillation mechanism
which takes phrase and accent information as input and produces a
continuous F0 contour as output. The input to the mechanism is in the
form of impulses, used to produce phrase shapes, and step functions
which produce accent shapes [10]. The Fujisaki model has been
successfully applied for decomposing F0 contours in many languages like
Japanese, German, and Finnish and in some tonal languages like Chinese,
Thai. Currently, some researches of applying Fujisaki model to
Vietnamese are on the way [11]. We will return to this model in Chapter
3.
1.2.2. Duration modeling
We now give a general overview of modeling the duration component of prosody.
Common methods to predict duration in speech synthesis differ in the following
aspects: [6]
- 22 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
• Durational Unit Predicted: the temporal unit predicted by most current
systems are either the phone (phoneme), often referred to as “segment”,
or the syllable. Since eventually phone duration are required for the
acoustic synthesis, all syllable-based models include some kind of
mechanism for calculating segment duration from the unit syllable
duration. For example, in Barbosa and Bailly’s model, the basic unit is
delimited by the onset of nuclear vowel and the onset of the following
vowel. They are computed by a sequential network constrained by an
internal clock (basically the speaking rate).
• Predictor factors: Every model uses a particular vector of input features,
which are extracted on the linguistic and phonetic levels. Most commonly
employed factors include:
on the syllabic level: the degree of accentuation and the position in
a higher-level unit, such as the foot or accent group.
on the segmental level: the properties of the phone to be
synthesized and its neighboring phones
on the phrase level: the location of a segment with respect to a
minor or major boundary an the position of the phrase in a
sentence.
• The Prediction Method: The algorithms used for calculating a numerical
duration value from the vector of input features can be roughly divided
into rule systems and statistical approaches. In the Figure 1-1: Category
of methods for predicting syllable duration [6]Figure 1-1Error!
Reference source not found., the statistical approaches are subdivided
into parametric and non-parametric regression models. Whereas the
structure of a parametric regression model in term of how it processes the
input factors is determined a priori, non-parametric regression models are
developed by unsupervised training and the model structure is determined
automatically (multi-layer perceptrons, CARTs). The main difference
between rule-base and statistical models is that a rule system can be build
- 23 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
on relatively little speech data. The formulation of the rules, however,
require a high amount of expert knowledge and considerable optimization
effort by trial-and-error. In contrast, statistical approaches are built from a
process is relatively effortless. Furthermore, the importance of individual
factors can be easily assessed by the way the statistical models prioritize
them.
Figure 1-1: Category of methods for predicting syllable duration [6]
• Pause Prediction. Some current approaches incorporate the prediction of
speech pauses as part of model, others treat pauses strictly separately.
• Speech Rate: Many current TTS systems produce different speech rates
by linearly scaling the duration output by the duration model. As the
speech rate not only affects the duration of individual segments, but also
the overall prosodic structure of an utterance, this kind of modification
needs to take place on an earlier step of processing when the phrasal
structure of an utterance is determined.
1.2.3. This thesis work approach
Modeling the intonation and duration in prosody is a complex field, relate to
linguistic and acoustic field. There are many different methods to predict the
- 24 -
Chapter 1: Prosody and Prosodic model
Mạc Đăng Khoa
intonation and duration of speech. However, there is currently no methods
completely apply in Vietnamese.
In the scope of this thesis, we use the statistical approach to extract some basic
patterns, just for modeling some basic cases in Vietnamese prosody.
The following are some information about our approach to modeling Vietnamese
prosody:
• Method: Statistical approach: calculate average value of F0, intensity and
duration from a corpus.
• Intonation modeling:
Under-specified: using 20 isolated point.
Single component
• Phrase level factors: Three type of phrase: Assertive, Interrogative and
Imperative
• Syllable level factors: Tone of syllable
• Extra-linguistic factors: Male/Female voice
This approach will be described more detail in Chapter 40.
- 25 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
2
Vietnamese language and prosody
The understanding of phonetic and phonological characteristics of a language has an
important role in the studies on speech processing in general and on prosody
analysis in particular. Thus, in this chapter, we give a review of Vietnamese
language and Vietnamese prosody.
2.1. Vietnamese language
2.1.1. Vietnamese characteristics
As we know, Vietnamese language is an amorphous language and a tonal/musical
language. It has the following characteristics [21]:
1. Vietnamese words are amorphous words, they do not change to show
grammatical categories, for instance, in French there are male and female
word étudiant - étudiante, nouveau – nouvelle, singular and plural word
amie, amies.
2. Vietnamese word structure does not use the affixes (prefixes, suffixes, and
infixes). Vietnamese language is a non-affix language. For instance, in
French or in English, an antonyms of one words is add the prefixes “im-”,
“ir-”, “un-”: impolite, unreadable, irregular…
3. Vietnamese word structure uses very few morphemes. Vietnamese language
has maximum twenty thousand syllables to create morphemes, thus
Vietnamese language does not have the features of flexional languages.
- 26 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
Vietnamese language’s morpheme index (number of morphemes M/ number
of words W) is about 1.06 [13], this is the least index in the 5000 languages
in the world [13]. The language, which its morpheme is less than 2, is an
analytic language.
The amorphous feature of our language is an essential characteristic, which
has an influence on other Vietnamese language’s characteristics.
4. Vietnamese language is a tonal/musical language. Vietnamese language has
six tones, and each tone could contribute to create the morpheme and
meaning of word, e.g. ba, bá, bà, bả, bã, bạ; me, mé, mè, mẻ, mẽ, mẹ. The
tones make Vietnamese language have a musical characteristic; make
sentences rhythmic and melodious.
5. A syllable (isolated word) of Vietnamese language in full structure has five
parts: initial sound (consonant), medial sound (semi-vowel), nucleus sound
(vowel or diphthong), final sound (consonant or semi-vowel) and tone. In
one word, consonant and vowel take an essential role, they are the core of the
word. They can create one syllable by themselves. Excepting the initial
consonant, the rest of one word is called a final (vần). Vietnamese has 155
basic finals. [13]
6. In Vietnamese, the boundary of syllable and morpheme’s is the same. One
syllable is one morpheme. In French: partir (come) has two syllables par-tir
and two morphemes part-ir, vendeur (seller) has two syllables ven-deur and
two morphemes vend-eur. In English: words have one syllable and two
morphemes. In Vietnamese: the sentence “ Đẹp vơ cùng tổ quốc ta ơi!” (Tố
Hữu) has seven morphemes, seven syllables, and five words (three mono
words: đẹp, ta, ơi and two compound word: vơ cùng, tổ quốc). In conclusion,
one Vietnamese word unit is one syllable, one morpheme and one real word.
- 27 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
7. Almost Vietnamese vocabulary is created by one or two morphemes, and is
monosyllable or bi-syllable, sometime polysyllable. There are 80% words
being bi-syllable words.
8. The difference between writing language and speaking language on
grammatical rules and phonetic rules is not large.
9. Through the period of foundation and development of Vietnamese language,
it has received quite many words from foreign languages. Number of Han
words is the greatest and next are French words, and a part of them were
converted fully into Vietnamese. For example, words: đấu tranh, giai cấp,
hồ bình, độc lập, tự do, hạnh phúc are Han words (Chinese words). Nhà ga
(gare), xà phịng (savon), cà phê (café) are French words.
2.1.2. Vietnamese phoneme system
Vietnamese phoneme system includes 14 vowels or vowel combinations and 22
consonants.
The Vietnamese vowels include 11 vowels and three diphthongs [21]. All vowels
are voiced sounds.
Table 2.1:Vietnamese vowels.
Transcription Reading Letters Example
/a/ a a a ha
/ắ/ ă ă con mắt
/Φ〈/ ấ â ân cần
/ε/ e e e dè
/e/ ê ê ê chề
/i/ short i & long y i ỉ eo, ý chí
// o o co ro
/o/ ơ ơ hồ đồ
/Φ/ ơ ơ bơ phờ
/u/ u u tù mù
/∝/ ư ư lừ đừ
/ie/ ia , yê ia, iê, ya,yê kia kìa, yêu kiều
/uo/ ua ua, uơ tua rua, luơn luơn
/∝Φ/ ưa ưa, ươ lưa thưa, lượt thượt
- 28 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
Vietnamese includes 22 consonants [21] as Table 2.2
Table 2.2:Vietnamese consonants.
Transcription Reading Letter Example
/b/ bê and bờ b bồng bềnh
/p/ pê and pờ p ốp ép
/v/ vê and vờ v vẩn vơ
/f/ phờ ph phơi pha
/m/ em-mờ and mờ m mơ màng
/d/ đê and đờ đ đất đai
/t/ tê and tờ t tin tưởng
/t’/ thờ th thơ thẩn
/s/ ich-xì and xờ x xa xơi
/z/ dê, giê and dờ d, gi duyên dáng, giữ gìn
/n/ en-nờ and nờ n nơn nĩng
/l/ e-lờ and lờ l long lanh
/τ/ trờ tr trồng trọt
/ş/ ét-sì and sờ s sung sướng
// e rờ and rờ r rĩc rách
/c/ chờ ch chơng chênh
// nhờ nh nhọc nhằn
/ŋ/ ngờ ng, ngh ngơ nghê
/k/ xê, ca, quy and cờ c,k,q con cá, kĩu kịt, qua quít
/x/ khờ kh khúc khích
/⊗/ gê, giê and gờ g, gh gồ ghề
/h/ hát and hờ h hả hê
The consonants in Vietnamese can be distinguished by the features: articulate mode
and articulate position, stop or fricative properties, voiced or unvoiced properties,
aspirate or non-aspirate, nasal or non-nasal and the place of consonant in syllable.
Based on these features, Vietnamese consonants can be arranged as Table 2.3.
Table 2.3: Arrangement of Vietnamese consonants.
apical articulate position
articulate method
labial
dental laminal
palate dorsal glottal
aspirate t
Unvoiced p t’ } c k noise non-
aspirate Voiced b d
Stop
sonant m n Ν
Unvoiced f s ♣ x h noise
Voiced v z Φ fricative
beside sonant l
- 29 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
2.1.3. Syllable structure
Vietnamese grammarians and linguists have long considered the syllable in
Vietnamese as a fundamental unit. A syllable in full structure (a tonal syllable) has
five parts: initial sound, medial sound, nucleus sound, final sound and tone (Error!
Reference source not found.) [21]. For instance, the syllable “tốn” has following
components: initial sound /t/, medial sound /o/, nucleus sound /a/, final sound /n/,
and tone “sắc” (or rising tone). One syllable has to have a nucleus sound. Other
components are optional. A nucleus sound could create one syllable, for instance a,
ơ, ê…Besides the initial sound (called INITIAL part), the rest of the syllable is
called a FINAL part. A tone is a fundamental frequency variation spreading over
the whole syllable. A tone has the same function as a phoneme. It always assigns for
syllable and its influence covers the entire of syllable. There are a few constraints: if
a syllable ends with unvoiced consonants /p,t,k/, only “sắc” and “nặng” tones are
possible; otherwise in all varieties of Vietnamese, the whole tonal paradigm can
occur.
Table 2.4:The phonological hierarchy of Vietnamese syllables with total numbers of each
phonetic unit [14].
TONAL SYLLABLE (6492)
BASE SYLLABLE (2376)
Final (155)
Medial (1) Nucleus (16) Ending (8)
Initial
(22)
TONE (6)
2.2. Vietnamese prosody
As a tonal language, Vietnamese prosody is composed of two components, which
we call “micro-prosody” and “macro-prosody”:
• Micro-prosody is the variation of pitch, duration and intensity of
individual word or syllable. For tonal language, the micro-prosody is very
important to distinguish the syllable’s tone. Thus, the lexical meaning of
the synthesized sound much depends on the quality of micro-prosody
• Macro-prosody is the application of prosody to whole phrase or
sentence. It depends on the type of sentence and speaker's intentions or
- 30 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
emotions. Therefore, the "naturalness" of synthesized sentences is much
depends on ability of macro-prosody controlling during speech synthesis
process.
2.2.1. Micro-prosody and tones system in Vietnamese
In Vietnamese, micro-prosody is much depends on the tone of syllable. Each tone
could contribute to construct the morpheme and meaning of word, it is also a
distinguish signal. The tone has the same function as a phoneme, it always assigns
for syllable and its influence cover the entire of syllable. The tones make
Vietnamese language have a musical characteristic; make sentences rhythmic and
melodious.
There are six tones in Vietnamese; they are showed in the Table 2.5
Table 2.5 The six Vietnamese tones.
Tone 1 Tone 2 Tone 3 Tone 4 Tone 5 Tone 6
ngang huyền ‘\’ ngã ‘~’ hỏi ‘?’ sắc ‘/’ nặng ‘.’
Figure 2-1: Example of the contours of six tones, as described in [21].
• Tone 1- Level tone (“ngang”): is a high tone. At the beginning of
syllable, it is the highest tone. The steady state of the level contour is
observed consistently. In the below figure, you can see the shape of tone
- 31 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
1 for male and female voice. (two line present the maximum and
minimum of F0 values)
Figure 2-2: The shape of Tone 1 with female and male voice [18]
• Tone 2 - Falling tone (“huyền”): the onset of the falling tone is
lower than tone 1, tone 5 and tone 3. The low F0 at the onset
gradually falls toward the end.
Figure 2-3: The shape of Tone 2 with female and male voice [18]
• Tone 3 - Broken tone (“ngã”): the onset is as high as that of the level
of tone 5, it is higher than the falling tone. The second third of the
contour of this tone is characterized by an abrupt dip caused by a
glottalization. In most cases, the bottom of the dip occurs between the
mid-point and the point two-thirds from onset. A creaky voice is
heard during this dip.
- 32 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
Figure 2-4: The shape of Tone 3 with female and male voice [18]
• Tone 4 - Curve tone (“hỏi”): the onset is the lowest among the six tones.
The low onset falls further gradually until the point two-thirds from the
onset. From this point, the extremely low F0 starts to rise toward the end.
Figure 2-5: The shape of Tone 4 with female and male voice [18]
• Tone 5 - Rising tone (“sắc”): the onset is also high. Starting from high
onset, the F0 gradually rises for the first two thirds of the duration. After
this point, the rise becomes more rapid.
Figure 2-6: The shape of Tone 5 with female and male voice [18]
- 33 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
• With tone 5 ending with stop consonants (t,p,c,k), the onset is higher than
tone 5a and the F0 rise rapidly with short duration. We call that tone is
tone 5b.
Figure 2-7: The shape of Tone 5b with female and male voice [18]
• Tone 6 - Drop tone (“nặng”): the onset is usually higher than that of the
falling or curve tone but considerably lower than the tone 1, tone 5 and
tone 3. This tone is characterized by a glottalization at the end and also by
its considerably shorter duration than the other tones. The duration of this
tone is approximately two thirds of the other tones. The main body of this
tone is almost leveled or slightly falling.
Figure 2-8: The shape of Tone 6 with female and male voice [18]
• Tone 6b (tone six ending with stop consonants): the onset is nearly equal
tone 2. The F0 falls toward the end with short duration.
- 34 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
Figure 2-9: The shape of Tone 6b with female and male voice [18]
These descriptions are only for the Northern dialect, in particular Hanoi dialect
which is the standard dialect of Vietnamese. They would be changed with the other
dialects in the South and the Center of Vietnam. In these regions, there are only 5
tones instead of 6 like the Hanoi dialect, because tone 3 and tone 4 are pronounced
identically.
In continuous speech, tones seldom reach their target values. They are generally
affected by context: stressed vs. unstressed syllable, influence of neighbouring
tones, tempo… and the affect of some phenomena in Vietnamese prosody, on which
we will discuss later.
2.2.2. Macro-prosody and sentence types in Vietnamese
As we talked above, the macro-prosody depends on the type of sentence, speaker's
intentions or emotions. In this thesis, we just discuss on role of sentence types in
Vietnamese prosody.
The classifition of Vietnamese sentences
• Classification by purpose: Sentences can be classified based on their
purpose:[8]
A assertive sentence or declaration: the most common type,
commonly makes a statement. Ex: Tơi sẽ về nhà. (I am going
home.)
- 35 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
An interrogative sentence or question: is commonly used to
request information. Ex: Khi nào anh sẽ làm việc? (When are you
going to work?)
An imperative sentence or command: is ordinarily used to make a
demand or request Ex: Mở cửa ra! (Open the door!)
An exclamatory sentence or exclamation: is generally a more
emphatic form of statement: Ex: Ngày hơm nay tuyệt quá! (What a
wonderful day this is!)
• Classification by structure: Sentences can also be classified based on their
structure (by the number and types of finite clauses) as the below diagram
Figure 2-10: Sentence classification by structure [20]
With the scope of this thesis, we have just studied the macro-prosody of assertive,
interrogative and imperative sentence with single structure.
In the researches of Nguyễn and Boulakia [8], they gave some characteristics of
prosody on three types of sentence (assertive, interrogative and imperative) as the
following:
• Duration (Tempo). Interrogative sentences (Q) are shorter than
Assertive sentence (S) and this difference is significant. Imperatives (I)
are even shorter, but the differences with Q and S are not significant.
- 36 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
• Intensity: The difference is significant between assertive and imperative
for the S/I pair, but not for the S/Q and Q/I ones.
• Fundamental frequency. The F0 mean value of Interrogative sentences
and Imperative utterances is higher than that of Statements, while there is
no difference between Interrogative and Imperative sentences. There is an
obvious difference in the last syllable. The phonologically "level" (high)
tone falls in Statements and is much higher and rising in Questions, while
the mean value and movement is half way between for Imperatives. The
rising tones, rise even more in the case of Interrogative and Imperative
than in Statement sentences. It means that there is an influence of the
intonation on the final-syllable tone of the sentence.
Figure 2-11: The sentences “Lan thích ăn cơm khơng” in
Assertive (S) and Interrogative (Q) mode [8]
Figure 2-12: The sentences “Bảo cố gắng tập đi” in
- 37 -
Chapter 2: Vietnamese language and prosody
Mạc Đăng Khoa
Assertive (S) and Imperative (Q) mode [8]
Figure 2-13: The sentences “Tân bỏ đi chứ” in
Interrogative (Q) and Imperative (I) mode [8]
In the research of Vu M. Q .et al [16], they found that the main part of differences in
intonation._.Imperative 7 43 50 100
Assertive 67 10 23 100
Interrogative 20 60 20 100
6
Imperative 17 30 53 100
Assertive 67 20 13 100
Interrogative 10 53 37 100
T
on
e
an
d
se
nt
en
ce
ty
pe
6b
Imperative 17 23 60 100
- 84 -
Chapter 6: Conclusion and Perspectives
Mạc Đăng Khoa
Table 5.8: Confusion matrix (in %) of sentence types (average of Male and Female)
Perceived type of sentence (%) Perceived
Intended Assertive Interrogative Imperative Total
Assertive 77 20 3 100
Interrogative 20 60 20 100
1
Imperative 10 33 57 100
Assertive 75 20 5 100
Interrogative 23 65 12 100
2
Imperative 42 17 42 100
Assertive 70 27 3 100
Interrogative 32 48 20 100
3
Imperative 52 23 25 100
Assertive 68 20 12 100
Interrogative 23 72 5 100
4
Imperative 25 25 50 100
Assertive 60 35 5 100
Interrogative 32 50 18 100
5
Imperative 15 28 57 100
Assertive 48 45 7 100
Interrogative 12 80 8 100
5b
Imperative 10 40 50 100
Assertive 50 12 38 100
Interrogative 27 63 10 100
6
Imperative 15 25 60 100
Assertive 63 13 23 100
Interrogative 30 43 27 100
T
on
e
an
d
se
nt
en
ce
ty
pe
6b
Imperative 22 15 63 100
- 85 -
Chapter 6: Conclusion and Perspectives
Mạc Đăng Khoa
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 5b 6 6b
Tone of last syllable
C
o
rr
e
c
t
re
c
o
g
n
it
io
n
r
a
te
(
%
)
Assertive
Interrogative
Imperative
Figure 5-5: Correct recognition rate with 8 tones of last syllable
As can be seen, the assertive sentence ending with tone 1, 2, 3 and 4 were correctly
identified in well about 70% to 80% of judgments. In tone 5, 5b, 6 and 6b, the
correction percent was fair, about 50% to 60 %. With tone 5b, 45% of assertive
sentences were confused with imperative sentence.
The results were fairly good with the interrogative sentences ending with tone 1,2,
4, 5b and 6 (over 60% of correction, even 80% with tone 5b). With the other tones,
it was 40% to 50%.
With imperative sentence ending with tone 1.4 .5 .5b, 6 and 6b, the correction
percent is fair, from 50% to 60%. With tone 2 and tone 3, 42% and 52% of
imperative sentences were confused with assertive sentence..
In summary, we have the average correct recognition rate as Table 5.9 and Figure
5-6
- 86 -
Chapter 6: Conclusion and Perspectives
Mạc Đăng Khoa
Table 5.9: Correct recognition rate (%) with other types of sentences
Rate (%)
Sentence type
Male voice Female Both
Assertive 60 68 64
Interrogative 57 63 60
Imperative 53 48 50
Global 56 60 58
0
10
20
30
40
50
60
70
80
Male Female Both voice
C
o
rr
e
ti
o
n
r
a
te
(
%
)
Assertive
Interrogative
Imperative
Figure 5-6: Correct recognition rate (%) with other types of sentences
The global correct recognition rate of assertive and interrogative in female voice is
better than in male voice. However, with imperative sentence, the male voice has a
better result. Overall, the average correct recognition rate of both voice is
approximately 64% for assertive sentences, 60% for imperative and 50 % for
interrogative sentences.
- 87 -
Chapter 6: Conclusion and Perspectives
Mạc Đăng Khoa
5.3. Comparison and conclusion
For comparison, we used another result of perception test on pseudo-sentences 1,
which was carried out in other research of MICA center [17. The On that test, the
listener had to choose between two types “interrogative” and “assertive” for the
pseudo-sentence they listened.
Table 5.10 and Figure 5-7 give the comparison between our result in two
experiments (with patterns base synthesized sentence) and the other experiment
with pseudo-sentences (we call Experiment M):
Table 5.10: Result of three experiments
Experiment 1
(non-sense sentence)
Experiment 2
(multi-type sentence)
Experiment M
(pseudo-sentence)
Experiment
and voice of
synthesis/
Sentence
type
Male Female Both Male Female Both Male Female Both
Assertive 68 55 61 60 68 64 58 69 64
Interrogative 49 50 50 57 63 60 61 74 68
Imperative 36 42 39 53 48 50 - - -
Global 51 49 50 56 60 58 60 72 66
0
10
20
30
40
50
60
70
80
Experiment 1 Experiment 2 Experiment M
C
o
rr
e
ti
o
n
r
a
te
(
%
)
Assertive
Interrogative
Imperrative
Figure 5-7: Result comparison of three experiments
1 The re-synthesized sentence without semantic information, which is resynthesized from vowel /a/ and
simulated the prosody of an actual sentence. [17]
- 88 -
Chapter 6: Conclusion and Perspectives
Mạc Đăng Khoa
Overall, the result in this experiment 2 is better than in its previous, which used
non-sense sentences (50% of correction). In all types of sentence, the correct
recognition rate of experiment 2 is approximately 10% higher than in experiment 1.
It can be explain that, in the first, we use the same tone for all syllables in the
sentence but we did not simulate the co-articulation phenomena. Therefore, the
synthesized sentences were very different from nature. With test sentences in
experiment 2, except the final syllable, the others were in tone 1 (or non-tone). The
co-articulation phenomenon is minimized, and the sentences are more natural.
Compare with the experiment M, the result of experiment 2 is better in assertive
sentences but worse in interrogative sentences. The test sentences in Experiment M
were synthesized by simulating the prosody of actual sentences. That is why they
could be more natural than the test sentences in Experiment 2, which used our
proposal prosody patterns. Therefore, the worse result in Experiment 2 can be
understandable and acceptable.
In this chapter, we have presented some experiments to evaluate our prosody
pattern. These patterns are currently very simple, just base on the position of
syllable in sentence and type of container sentence. However, the results of
experiments show that our proposal pattern can be apply to predict the prosody of
simple sentences.
- 89 -
Chapter 6: Conclusion and Perspectives
Mạc Đăng Khoa
6
Conclusion and Perspectives
The thesis subject is “Modeling the prosody of Vietnamese language for speech
synthesis”. But as all we know, finding a completed model to modeling Vietnamese
prosody currently is a large and complex field, which requires many researches on
linguistic, acoustic and on speech processing also. Therefore, in scope of master
thesis, we have studied on some basic factors of Vietnamese prosody, characterized
and tried to apply them to speech synthesis.
In chapter 3, we have proposed a method and structure of prosody generation
module. The simplest way to generate the prosody of whole sentence is
concatenation the direct prosody patterns of syllables. Hence, in chapter 4, we set
up a prosody corpus, analyzed and proposed 72 prosody patterns syllables,
corresponding to the initial, middle and final positions of syllable in three type of
sentence (assertive, interrogative and imperative).
Although these patterns are not enough to model all case of Vietnamese prosody,
but they are able to apply to generate the prosody of simple sentences. That was
proved by the results of experiments in chapter 5. In the perception tests, the listener
could correctly determine 64% of assertive sentences, 60% of interrogative
sentences and 50% of imperative sentences.
These patterns are our first prosody patterns for Vietnamese syllables. They were
extracted from a small prosody corpus and just concern to three factors: tone,
position of syllable and the sentence type. In the future work, we expect to improve
- 90 -
Chapter 6: Conclusion and Perspectives
Mạc Đăng Khoa
these patterns by set up and analyze a larger corpus. Additionally, we also research
others factor such as syntactic, lexical meaning factors or glottalization and co-
articulation phenomena to integrate into our patterns. Moreover, these prosody
patterns can be represented not only in absolute values of F0, duration and intensity
but also in a set of parameters of other prosody model (such as Fujisaki model). By
that way, we are be able to generate the prosody description in other types of
prosodic model and apply to other type of TTS system.
The following is a summary on the works we have done in this master thesis, the
limitations and future approaches:
• The works we have done:
Proposed a simple method a prosody generation and a structure of
prosody generation module in TTS system
Set up a corpus for researching prosody
Proposed 72 prosody patterns for Vietnamese syllable
Apply proposal prosody patterns to synthesize some simple
sentences and evaluate these sentence by perception tests
• The limited points
The corpus is small
The proposal patterns for syllables are simple, just concern to three
factors: tone, position of syllable and the sentence type
Intensity controlling in speech synthesis was done manually and
not very accurate
The listener in perception test was few (5 males and 5 females)
• Future approach
Set up a larger corpus, which concern to other factors and
phenomena in Vietnamese prosody
Define prosody patterns from that corpus
Research other prosodic model and apply to transfer these patterns
to other model parameters.
- 91 -
Chapter 6: Conclusion and Perspectives
Mạc Đăng Khoa
Develop a complete prosody generation module and integrate in a
TTS system.
The last words
This work is carried out in MICA center and I expected its result could be applied
into the MICA speech synthesis system.
This work is also my preliminary work in the domain of speech processing. With
the guiding of my supervisors, the instruction and support of others in MICA’s
speech processing group, my knowledge and skill of speech processing have been
more and more improved. That knowledge is the necessary background for my
future studies and researches. Once again, thank all of you very much!
- 92 -
Master thesis
Mạc Đăng Khoa
References
[1]. Chilin Shih , Greg Kochanski, “Prosody and Prosodic Models”,
www.prosodies.org.
[2]. Do T.D., Tran T.H., et al. (1998), “Intonation system - A survey of twenty
languages”, chap. 22, Cambridge University Press.
[3]. Dung Tien Nguyen, Hansjưrg Mixdorff, Mai Chi Luong, Huy Hoang Ngo,
Bang Kim Vu, (2005) “Fujisaki Model based F0 contours in Vietnamese
TTS”, Eurospeech proceeding
[4]. H. Fujisaki, S. Ohno, C. Wang (1974), “A command-response model for F0
contour generation in multilingual speech synthesis”, Journal of Phonetics,
vol. 2, pp 223-232,
[5]. Mixdorff H. (1998), “Intonation patterns of German - Model-based
quantitative analysis and synthesis of F0 contours”, PhD thesis, TU
Dresden
[6]. Mixdorff H. (2001), “An Integrated Approach to Modeling German
Prosody”, TU Dresden
[7]. Mixdorff H., Nguyen Hung Bach, Hiroya Fujisaki and Mai Chi Luong
(2003), “Quantitative Analysis and Synthesis of Syllabic Tones in
Vietnamese”, Eurospeech proceeding.
[8]. Nguyen T.T.H. and Boulakia G. (1999), "Another look at Vietnamese
intonation", ICPhS'99
[9]. Ninh Khanh Duy (2005), “Characterization of Vietnamese intonation for
questions”, Master Thesis, Hanoi University of Technology,
- 93 -
Master thesis
Mạc Đăng Khoa
[10]. Paul Alexander Taylor (1992), “A Phonetic Model of English Intonation”,
PhD thesis, University of Edinburgh,
[11]. Sami Lemmetty (1999), “Review of Speech Synthesis Technology”, MSc
thesis, Faculte Helsinki University of Technology
[12]. Thierry Dutoit (1993), “High Quality Text-To-Speech Synthesis of the
French Language”, PhD thesis, Faculte Polytechnique de Mons, TCTS Lab,
Belgium
[13]. Thierry Dutoit (1997), “An Introduction to Text-to-Speech Synthesis”,
Kluwer Academic Publishers.
[14]. Tran D.D., Castelli E., et al. (2005), "Influence of F0 on Vietnamese
syllable perception", Interspeech.
[15]. Tran Do Dat (2003), “Building a large Vietnamese Speech Database”,
Master Thesis, Hanoi University of Technology
[16]. Vu M.Q., Tran D.D., Castelli E. (2006), “Prosody of Interrogative and
Affirmative Sentences in Vietnamese Language: Analysis and Perceptive
Results”
[17]. Vu M.Q., Tran D.D. & Castelli E. (2006), "Intonation des phrases
interrogatives et affirmatives en langue vietnamienne", JEP2006, XXVIes
Journées d’Etude sur la Parole. Manoir de la Vicomté - Dinard, France
[18]. Nguyen Quoc Cuong (2002), “Reconnaissangce de la parole en langue
Vietnamienne”, These, INPG, UJF Grenoble, France, Juin 2002
[19]. Bạch Hưng Nguyên, Nguyễn Tiến Dũng,(2005), "Mơ hình Fujisaki và áp
dụng trong phân tích thanh điệu tiếng Việt"
[20]. Mai Ngọc Chừ, Vũ Đức Nghiệu, Hồng Trọng Phiến (2005) “Cơ sở ngơn
ngữ học và tiếng Việt”, NXB Giáo dục.
- 94 -
Master thesis
Mạc Đăng Khoa
[21]. Nguyễn Hữu Quỳnh (2001). “Ngữ Pháp Tiếng Việt”, Nhà xuất bản từ điển
Bách Khoa, pp.11-86, Hà Nội
- 95 -
Master thesis
Mạc Đăng Khoa
Appendix
A. Text for prosody corpus
Code Role Sentences
1_As_In_A A Bên ta theo địch đến tận căn cứ
1_As_Mid_A A Bên địch bị bên ta theo đến tận căn cứ
1_As_Fi_A A Bên địch bị đánh bật khỏi căn cứ bên ta
Context Kết thúc trận đánh, thủ trưởng hỏi:
1_Int_In_A A Này cậu. Bên ta theo địch đến tận căn cứ à?
1_As_In_B B Vâng. Bên ta theo bên địch đến tận căn cứ.
1_Int_Mid_A A Này cậu. Bên địch bị bên ta theo đến tận căn cứ à?
1_As_Mid_B B Đúng ạ. Bên địch bị bên ta theo đến tận căn cứ.
1_Int_Fi_A A Này cậu. Bên địch đã bị đánh bật khỏi căn cứ bên ta?
1_As_Fi_B B Đúng ạ. Bên địch bị đánh bật khỏi căn cứ bên ta.
Context Trong trận đánh, cấp dưới hỏi thủ trưởng
1_Int_In_B B Bên ta theo anh cĩ cần bám theo địch khơng?
1_Imp_In_A A Bên ta theo địch ngay cho tơi!
1_Int_Mid_B B Địch rút thì bên ta theo cĩ được khơng anh?
1_Imp_Mid_A A À. Thế thì bên ta theo ngay cho tơi!
Context Trong trận đánh, thủ trưởng ra lệnh:
1_Imp Fi_A A Gọi ngay quân cứu viện bên ta! Nhanh lên !
Code Role Sentence
2_As_In_A A Trên tà thêu một bơng hoa màu đỏ.
2_As_Mid_A A Áo của chị trên tà thêu một bơng hoa.
2_As_Fi_A A Áo của chị cĩ thêu một bơng hoa trên tà.
Context Một chị đi đặt may áo dài. Thợ may hỏi:
2_Int_In_A A Trên tà thêu gì khơng hả chị?
2_As_In_B B À... Trên tà thêu một bơng hoa màu đỏ.
2_Int_Mid_A A Áo chị trên tà thêu gì khơng thế?
2_As_Mid_B B À... Cái áo đấy trên tà thêu hoa màu đỏ.
- 96 -
Master thesis
Mạc Đăng Khoa
2_Int_Fi_A A Áo chị thêu gì trên tà ?
2_As_Fi_B B À. Cái áo đấy thêu một bơng hoa đỏ trên tà.
2_Int_In_B B Trên tà thêu gì khơng hả chị?
2_Imp_In_A A Giống cái kia kìa.Trên tà thêu y như thế cho chị!
2_Int_Mid_B B Áo chị đặt trên tà thêu gì khơng chị?
2_Imp_Mid_A A Giống cái lần trước ý. Cái áo này trên tà thêu cũng thế
nhé em!
2_Int_Fi_B B Áo chị đặt thêu gì trên tà?
2_Imp Fi_A A Giống lần trước. Cứ thêu cho chị một bơng hoa trên tà!
Code Role Sentence
3_As_In_A A Trên tã thêu hình một quả bĩng.
3_As_Mid_A A Chị nhìn thấy trên tã thêu hình quả bĩng.
3_As_Fi_A A Chị nhìn thấy hình một quả bĩng được thêu trên tã.
Context Chồng và vợ nĩi về cái tã mới của em bé:
3_Int_In_A A Trên tã thêu hình gì thế em ?
3_As_In_B B À. Trên tã thêu hình một quả bĩng anh ạ.
3_Int_Mid_A A Em cĩ thấy trên tã thêu hình gì khơng ?
3_As_Mid_B B À. Em thấy trên tã thêu hình quả bĩng anh ạ.
3_Int_Fi_A A Cĩ cả hình quả bĩng trên tã ?
3_As_Fi_B B Vâng. Em thấy cĩ hình một quả bĩng trên tã.
Context Vợ đang thêu tã cho con, hỏi chồng
3_Int_In_B B Trên tã thêu hình gì đây hả anh?
3_Imp_In_A A Con nĩ thích bĩng. Trên tã thêu hình quả bĩng đi!
3_Int_Mid_B B Khơng biết trên tã thêu cái gì hả anh?
3_Imp_Mid_A A Con cĩ vẻ thích bĩng.Theo anh trên tã thêu bĩng đi
em!
3_Int_Fi_B B Anh ơi! Thế mình thêu gì trên tã?
3_Imp Fi_A A Con thích bĩng.Thế thì em thêu cho con một quả bĩng
trên tã!
Code Role Sentence
4_As_In_A A Bên tả theo khuynh hướng bảo thủ.
4_As_Mid_A A Đảng chính trị bên tả theo khuynh hướng bảo thủ.
- 97 -
Master thesis
Mạc Đăng Khoa
4_As_Fi_A A Bảo thủ là khuynh hướng của đảng chính trị bên tả.
Context Hai người bàn về chính trị
4_Int_In_A A Bên tả theo khuynh hướng nào cậu nhỉ?
4_As_In_B B À. Bên tả theo khuynh hướng bảo thủ.
4_Int_Mid_A A Đảng chính trị bên tả theo khuynh hướng nào thế?
4_As_Mid_B B À. Đảng chính trị bên tả theo khuynh hướng bảo thủ.
4_Int_Fi_A A Anh ủng hộ đảng chính trị bên tả?
4_As_Fi_B B À vâng. Tơi ủng hộ đảng bên tả.
Context Trong buổi tổng duyệt điễu hành. Người chỉ huy hơ:
4_Imp_In_A A Bên tả theo ngay sau đội trống! Nhanh lên!
4_Imp_Mid_A A Chú ý. Tồn đội bên tả theo ngay sau đội trống!
4_Imp_Fi_A A Tất cả theo sau đội bên tả ! Nhanh lên.
Code Role Sentence
5_As_In_A A Đợt phong cấp lần này, lên tá theo anh cũng chẳng khĩ
khăn gì.
5_As_Mid_A A Anh ta được thăng lên tá theo cách nào khơng ai biết.
5_As_Fi_A A Cuối cùng thì anh cũng được thăng lên tá.
Context Trong cuộc họp bàn về phong cấp trong một đơn vị
quân đội:
5_Int_In_A A Đợt phong cấp lần này, lên tá theo anh cĩ khĩ lắm
khơng?
5_As_In_B B À. Lên tá theo tơi cũng chẳng khĩ khăn gì.
Context Thủ trưởng hỏi về một trường hợp vừa được phong lên
cấp tá
5_Int_Mid_A A Anh ta lên tá theo quyết định nào thế?
5_As_Mid_B B À. Anh ta được lên tá theo cách nào chẳng ai biết.
5_Int_Fi_A A Sao anh ta lại được lên tá?
5_As_Fi_B B Thưa anh, cũng khơng ai biết sao anh ta lại được lên tá.
Context Cấp dưới hỏi thủ trưởng
5_Int_In_B B Cịn đồng chí Nam, lên tá theo anh cĩ nên khơng?
5_Imp_In_A A Nên chứ. Lên tá theo mọi người luơn đi !
5_Int_Mid_B B Đồng chí Nam mà phong lên tá theo mọi người cĩ
được khơng?
5_Imp_Mid_A A Được. Phong lên tá theo họ luơn đi!
- 98 -
Master thesis
Mạc Đăng Khoa
5_Int_Fi_B B Anh khơng ủng hộ việc đồng chí đĩ sẽ được phong lên
tá ?
5_Imp_Fi_A A Đúng. Các anh đừng cĩ phong anh ta lên tá !
Code Role Sentence
6_As_In_A A Lên tạ theo đúng hướng dẫn là cách tập luyện rất tốt.
6_As_Mid_A A Cách tốt nhất là tập lên tạ theo đúng hướng dẫn.
6_As_Fi_A A Tất cả các vận động viên đều phải tập lên tạ.
Context Vận động viên hỏi huấn luyện viên về tập lên tạ:
6_Int_In_A A Lên tạ theo theo anh cĩ tốt khơng ?
6_As_In_B B Cĩ chứ. Lên tạ theo đúng hướng dẫn là cách tập luyện
rất tốt.
6_Int_Mid_A A Theo anh thì tập lên tạ theo hướng dẫn cĩ tốt khơng ?
6_As_Mid_B B Cĩ chứ. Cách tốt nhất là tập lên tạ theo đúng hướng
dẫn.
6_Int_Fi_A A Liệu em cĩ phải tập lên tạ ?
6_As_Fi_B B Cĩ. Mọi vận động viên đều phải tập lên tạ.
6_Int_In_B B Lên tạ theo cách nào hả anh ?
6_Imp_In_A A À. Lên tạ theo đúng hướng dẫn cho tơi!
6_Int_Mid_B B Thế em cĩ phải tập lên tạ bây giờ khơng?
6_Imp_Mid_A A Cĩ. Cậu tập lên tạ theo tơi ngay!
6_Int_Fi_B B Thế em phải tập cả lên tạ?
6_Imp_Fi_A A Chứ cịn gì nữa. Cứ theo hướng dẫn lên tạ!
Code Role Sentence
5b_As_In_A A Trong cuộc thi của nhà nơng, bên tát theo cách mới đã
giành thắng lợi.
5b_As_Mid_A A Chiến thắng thuộc về bên tát theo cách mới.
5b_As_Fi_A A Cuối cùng, phần thắng đã thuộc về bên tát.
Context Trong một cuộc thi của nhà nơng. Hai khán giả hỏi
nhau:
5b_Int_In_A A Bên tát theo cách mới là bên nào thế anh?
5b_As_In_B B À. Bên tát theo cách mới là bên mặc áo đỏ.
- 99 -
Master thesis
Mạc Đăng Khoa
5b_Int_Mid_A A Bên nào là bên tát theo cách mới hả anh?
5b_As_Mid_B B À. Bên đỏ là bên tát theo cách mới.
5b_Int_Fi_A A Trong lần thi này, liệu phần thắng cĩ thuộc về bên tát?
5b_As_Fi_B B Cĩ. Tơi nghĩ phần thắng sẽ thuộc về bên tát.
Context Hai anh em đang ở dưới ruộng, nĩi chuyện với nhau:
5b_Int_In_B B Ơ bố mẹ đang tát nước kìa. Lên tát theo bố mẹ khơng
anh?
5b_Imp_In_A A Cĩ. Lên tát theo bố mẹ đi !
5b_Int_Mid_B B Ơ, bố mẹ đang tát nước trên kia kìa. Mình cĩ lên tát
theo bố mẹ khơng?
5b_Imp_Mid_A A Cĩ, dưới này cũng làm xong rồi. Mình lên tát theo bố
mẹ đi !
5b_Int_Fi_B B Bố mẹ tát nước trên kia kìa. Hay là anh em mình cũng
lên tát?
5b_Imp_Fi_A A Ừ, dưới này cũng xong rồi. Nào anh em mình cùng lên
tát !
Code Role Sentence
6b_As_In_A A Trong cuộc thi điêu khắc, bên tạc theo cách mới đã
hồn thành bức tượng sớm hơn.
6b_As_Mid_A A Chiến thắng thuộc về bên tạc theo cách mới.
6b_As_Fi_A A Cuộc thi làm tượng nhanh diễn ra giữa bên đúc và bên
tạc.
Context Trong một cuộc thi điêu khắc, hai khán giả trị chuyện
với nhau:
6b_Int_In_A A Bên tạc theo cách mới là bên nào thế anh?
6b_As_In_B B À. Bên tạc theo cách mới là bên áo đỏ.
6b_Int_Mid_A A Bên nào là bên tạc theo cách mới thế?
6b_As_Mid_B B À. Bên áo đỏ là bên tạc theo cách mới.
6b_Int_Fi_A A Theo anh, liệu phần thắng cĩ thuộc về bên tạc?
6b_As_Fi_B B Cĩ. Tơi nghĩ phần thắng sẽ thuộc về bên tạc.
Context Trong một xưởng điêu khắc, thợ tạc hỏi người thợ cả:
6b_Int_In_B B Cĩ hai mẫu mới và cũ. Tạc theo mẫu nào hả anh ?
6b_Imp_In_A A À. Tạc theo mẫu mới cho tơi !
6b_Int_Mid_B B Anh ơi! Lơ tượng bên trên tạc theo mẫu nào hả anh ?
6b_Imp_Mid_A A À. Lơ bên trên tạc theo mẫu mới đi!
- 100 -
Master thesis
Mạc Đăng Khoa
6b_Int_Fi_B B Vụ tạc tượng phật trên núi đã đủ thợ chưa hả anh? Liệu
em cĩ phải lên tạc?
6b_Imp_Fi_A A Cĩ. Cả cậu cũng phải lên tạc!
B: Datasheet of prosody patterns
Initial part Middle part Final part
F0 F0 Intensity
(Hz) (Semitone) (dB)
F0 F0 Intensity
(Hz) (Semitone) (dB)
F0 F0 Intensity
(Hz) (Semitone) (dB)
M
al
e
162.65 8.23 70.99
163.03 8.27 71.53
162.96 8.25 71.82
162.76 8.23 71.92
162.37 8.19 71.87
162.07 8.15 71.70
161.80 8.13 71.48
161.84 8.14 71.25
161.78 8.13 71.04
161.42 8.09 70.84
161.55 8.11 70.62
161.49 8.10 70.33
161.27 8.08 69.94
161.07 8.06 69.40
160.69 8.03 68.65
160.42 8.00 67.65
160.02 7.96 66.34
159.48 7.91 64.73
158.63 7.82 62.83
157.04 7.65 60.75
Duration(ms)
Syllable 149
Voiced part 99
180.21 9.54 69.90
180.42 9.52 70.80
178.30 9.30 71.26
177.88 9.27 71.43
177.40 9.23 71.35
177.01 9.20 71.07
176.65 9.16 70.70
176.48 9.15 70.34
176.58 9.15 70.04
176.85 9.16 69.77
177.00 9.16 69.46
176.94 9.14 69.05
176.84 9.11 68.53
176.89 9.10 67.91
176.78 9.09 67.15
176.53 9.07 66.18
174.97 8.97 65.04
172.82 8.81 63.64
171.33 8.68 61.75
169.78 8.54 59.26
Duration(ms)
Syllable 177
Voiced part 121
153.85 7.28 71.28
153.42 7.23 71.94
153.01 7.17 72.14
152.51 7.11 72.12
152.42 7.09 71.99
152.38 7.08 71.84
152.45 7.09 71.66
152.22 7.06 71.37
151.97 7.03 71.01
151.27 6.96 70.61
150.76 6.90 70.14
150.17 6.84 69.45
149.58 6.77 68.36
149.10 6.73 66.81
149.37 6.80 65.00
147.78 6.62 63.27
146.89 6.52 61.54
145.75 6.40 59.80
146.23 6.45 57.75
146.34 6.47 55.54
Duration(ms)
Syllable 320
Voiced part 210
A
ss
er
tiv
e
se
nt
en
ce
Fe
m
al
e
295.52 18.65 69.11
292.33 18.47 69.54
289.19 18.28 69.77
287.26 18.16 69.89
285.73 18.07 69.93
284.65 18.01 69.92
283.93 17.97 69.85
283.23 17.92 69.72
282.77 17.90 69.51
282.00 17.85 69.22
281.85 17.84 68.85
282.13 17.85 68.48
282.72 17.89 68.13
283.01 17.90 67.75
283.29 17.92 67.25
283.33 17.92 66.54
283.32 17.92 65.53
282.34 17.85 64.11
281.63 17.81 62.21
278.94 17.65 59.87
Duration Syllable 168
F0 100
264.90 16.55 70.87
261.16 16.32 71.71
258.42 16.14 72.10
257.24 16.07 72.21
256.46 16.02 72.14
255.35 15.95 71.98
254.52 15.89 71.78
253.64 15.83 71.53
253.43 15.82 71.24
253.37 15.82 70.96
253.47 15.82 70.71
254.04 15.86 70.43
254.42 15.89 70.02
254.88 15.92 69.41
255.35 15.94 68.53
255.78 15.97 67.29
255.10 15.90 65.62
254.54 15.85 63.47
252.85 15.74 60.98
249.09 15.47 58.81
Duration Syllable 196
F0 126
240.01 14.65 70.26
236.88 14.42 72.06
235.57 14.32 72.57
235.05 14.28 72.53
235.11 14.28 72.32
235.47 14.31 72.10
235.45 14.32 71.83
235.21 14.30 71.37
234.78 14.27 70.83
233.65 14.20 70.27
232.86 14.13 69.62
232.18 14.09 68.81
231.20 14.02 67.77
229.98 13.93 66.65
229.68 13.91 65.43
229.24 13.88 64.07
228.43 13.82 62.26
225.83 13.63 60.17
224.13 13.51 57.80
225.79 13.63 55.54
Duration Syllable 361
F0 229
- 101 -
Master thesis
Mạc Đăng Khoa
M
al
e
181.86 10.18 73.56
180.69 10.06 74.16
179.54 9.93 74.50
179.11 9.89 74.69
178.00 9.78 74.75
177.18 9.69 74.69
176.87 9.65 74.51
176.28 9.59 74.25
175.96 9.55 73.92
175.81 9.53 73.55
175.40 9.48 73.12
175.06 9.45 72.63
174.67 9.41 72.05
174.38 9.38 71.34
174.17 9.36 70.47
173.86 9.33 69.36
173.25 9.26 67.95
171.56 9.10 66.20
170.71 9.00 64.18
169.21 8.85 62.07
Duration(ms)
Syllable 132
Voiced part 86
172.43 9.23 72.50
172.60 9.26 73.09
171.89 9.16 73.25
170.61 9.00 73.25
170.23 8.96 73.17
169.71 8.91 73.00
169.35 8.87 72.76
169.06 8.85 72.52
168.96 8.84 72.29
168.82 8.82 72.07
168.66 8.80 71.82
168.54 8.79 71.52
168.47 8.78 71.08
168.31 8.77 70.46
167.87 8.72 69.66
167.13 8.66 68.66
166.43 8.59 67.33
166.45 8.59 65.51
165.36 8.47 63.02
163.67 8.29 59.91
Duration(ms)
Syllable 185
Voiced part 127
169.79 8.85 71.48
168.78 8.78 72.34
167.25 8.63 72.38
166.22 8.53 71.95
165.54 8.47 71.32
165.42 8.46 70.92
165.44 8.47 70.64
165.22 8.45 70.34
165.00 8.43 70.14
164.76 8.41 69.99
164.93 8.42 69.74
165.48 8.47 69.22
165.51 8.47 68.35
166.50 8.56 67.23
166.36 8.53 66.13
166.20 8.51 65.13
166.44 8.52 64.03
164.40 8.25 62.55
164.01 8.20 60.67
164.35 8.24 58.35
Duration(ms)
Syllable 281
Voiced part 187
In
te
rr
og
at
iv
e
se
nt
en
ce
Fe
m
al
e
312.38 19.63 72.34
307.61 19.39 73.23
305.55 19.27 73.80
302.99 19.14 74.14
301.12 19.03 74.34
299.87 18.96 74.42
298.81 18.90 74.44
298.09 18.86 74.36
297.63 18.83 74.20
297.28 18.81 73.94
297.09 18.80 73.57
297.07 18.80 73.10
297.01 18.80 72.51
297.24 18.81 71.80
297.68 18.84 70.93
298.50 18.88 69.83
298.50 18.88 68.42
297.57 18.83 66.62
296.96 18.79 64.38
294.52 18.66 61.75
Duration Syllable 145
F0 87
298.25 18.82 70.57
294.71 18.63 71.83
292.32 18.49 72.73
290.69 18.40 73.36
289.01 18.29 73.78
287.49 18.21 74.01
286.66 18.16 74.09
285.86 18.11 74.06
285.43 18.09 73.93
285.56 18.09 73.72
285.83 18.11 73.42
286.29 18.14 73.04
286.90 18.18 72.59
287.53 18.22 72.07
288.06 18.25 71.40
288.71 18.29 70.46
289.32 18.33 69.14
289.54 18.34 67.35
288.86 18.30 65.07
286.81 18.18 62.32
Duration Syllable 168
F0 106
291.53 18.42 70.55
286.30 18.12 72.38
283.53 17.95 73.19
281.86 17.86 73.44
281.04 17.81 73.45
280.33 17.77 73.27
280.58 17.79 73.00
280.23 17.77 72.61
280.17 17.76 72.13
280.29 17.77 71.63
281.01 17.80 71.08
282.34 17.88 70.35
283.69 17.96 69.49
285.85 18.08 68.61
288.17 18.21 67.63
288.25 18.22 66.31
291.93 18.42 64.50
291.64 18.40 62.22
292.02 18.41 59.60
290.68 18.33 57.14
Duration Syllable 281
F0 199
- 102 -
Master thesis
Mạc Đăng Khoa
M
al
e
178.07 9.72 74.22
177.27 9.64 74.47
176.55 9.58 74.61
175.76 9.50 74.67
174.98 9.42 74.69
174.36 9.35 74.69
174.08 9.33 74.65
173.94 9.32 74.55
173.88 9.32 74.38
173.76 9.31 74.12
173.48 9.28 73.77
173.27 9.26 73.36
173.16 9.24 72.90
173.06 9.23 72.36
172.97 9.23 71.72
172.81 9.21 70.97
172.68 9.20 70.08
172.54 9.18 68.98
172.31 9.16 67.64
171.42 9.07 66.00
Duration(ms)
Syllable 129
Voiced part 75
166.28 8.52 72.52
165.50 8.43 73.52
165.40 8.41 74.14
164.87 8.35 74.47
164.13 8.26 74.65
162.93 8.11 74.74
161.65 7.95 74.76
161.41 7.94 74.68
162.59 8.14 74.53
162.85 8.20 74.28
162.68 8.18 73.97
162.43 8.15 73.59
162.09 8.11 73.15
162.02 8.10 72.63
161.85 8.09 71.97
161.67 8.07 71.11
161.44 8.04 69.96
161.17 8.01 68.41
160.71 7.96 66.46
159.67 7.86 64.15
Duration(ms)
Syllable 142
Voiced part 96
178.57 9.71 75.21
178.89 9.77 77.01
178.16 9.70 77.78
177.91 9.67 77.91
177.80 9.66 77.65
177.57 9.63 77.28
177.27 9.60 76.86
177.12 9.59 76.38
177.00 9.59 75.80
176.24 9.51 75.05
174.95 9.39 74.21
173.77 9.28 73.36
172.74 9.19 72.41
171.43 9.07 71.22
169.59 8.89 69.79
167.57 8.68 68.28
166.24 8.55 66.72
165.10 8.43 65.00
162.74 8.15 62.88
162.07 8.04 60.61
Duration(ms)
Syllable 291
Voiced part 195
Im
pe
ra
tiv
e
se
nt
en
ce
Fe
m
al
e
269.43 16.50 71.84
266.25 16.34 72.66
263.43 16.18 73.17
261.32 16.05 73.44
259.70 15.95 73.54
258.46 15.87 73.50
257.49 15.81 73.37
256.97 15.78 73.16
256.61 15.75 72.89
256.85 15.77 72.54
257.22 15.79 72.11
257.56 15.81 71.62
258.24 15.85 71.05
258.67 15.88 70.38
259.16 15.90 69.54
259.81 15.95 68.46
259.58 15.93 67.05
259.13 15.90 65.35
254.95 15.66 63.37
250.14 15.36 61.20
Duration Syllable 0.154
F0 0.092
293.55 18.56 71.79
289.61 18.34 72.91
287.29 18.21 73.65
285.88 18.12 74.12
284.37 18.03 74.35
283.76 17.99 74.36
282.70 17.93 74.22
281.91 17.88 74.01
281.38 17.84 73.75
281.21 17.83 73.46
281.18 17.83 73.15
281.21 17.82 72.81
280.92 17.80 72.39
281.13 17.81 71.83
281.27 17.82 71.10
281.49 17.83 70.16
281.07 17.80 68.92
281.60 17.83 67.31
280.33 17.75 65.32
280.52 17.76 63.02
Duration Syllable 0.182
F0 0.110
311.46 19.61 72.53
306.69 19.35 74.05
304.61 19.23 74.48
303.23 19.16 74.62
302.40 19.11 74.50
302.06 19.09 74.19
301.84 19.08 73.80
301.71 19.08 73.49
301.29 19.05 73.20
300.63 19.01 72.74
299.39 18.94 72.22
298.05 18.86 71.69
297.06 18.81 71.19
295.21 18.70 70.59
295.00 18.69 69.79
293.48 18.60 68.61
294.19 18.65 66.97
292.11 18.52 64.72
285.62 18.12 61.72
282.03 17.91 58.59
Duration Syllable 0.317
F0 0.205
._.
Các file đính kèm theo tài liệu này:
- LA3232.pdf