MINISTRY OF EDUCATION AND TRAINING 
HANOI UNIVERSITY OF TECHNOLOGY 
------------------------------- 
Thesis for the degree of 
MASTER OF SCIENCE 
Modeling the prosody 
of Vietnamese language for speech 
synthesis 
Speciality: “Information processing and Communication” 
Code:23.04.3898 
MẠC ĐĂNG KHOA 
Supervisor: 
Prof. PHẠM THỊ NGỌC YẾN 
Hanoi, 2007 
Faculty of Information Technology 
International research center of 
Multimedia Information, Communication and Application 
- 1 - 
                
              
                                            
                                
            
 
            
                 105 trang
105 trang | 
Chia sẻ: huyen82 | Lượt xem: 1866 | Lượt tải: 0 
              
            Tóm tắt tài liệu Modeling the prosody of Vietnamese language for speech synthesis, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Master thesis 
 Mạc Đăng Khoa 
Acknowledgment 
Many people provided me generous help and inspiration during my time of master 
student. 
First, I would like to express my deep sense of respect and gratitude towards my 
supervisors: Dr. Eric Castelli and Prof. Phạm Thị Ngọc Yến. Thank you very much 
for orienting and guiding my research in speech processing domain. Thank you for 
all your useful advices, your true criticisms and your patience during my time of 
master research. 
Special thanks also goes to Mrs. Geneviève Caelen-Haumont, PhD students Trần 
Đỗ Đạt, Vũ Minh Quang and all members of MICA’s speech group. I could not 
have done this thesis without your supports. Thank all of you for all your 
suggestions and your sincere remarks on entire of my research. 
I would like to thank to Ms. Đồn Thị Ngọc Hiền, who guiding me in recording the 
corpus. I would also like to thank to a lot of MICA member who spent much of time 
for recording and testing for my research. 
I am grateful to Prof. Nguyễn Trọng Giảng and MICA’s directorate supporting me 
the best convenient conditions during time working in International Research 
Center MICA. 
Finally, I owe a great deal to my parents and my sister for their continued support. I 
also give a very special thanks to my girl friend for her constant encouragement, 
giving me strength and motivation in my work and in my life. 
- 2 - 
Master thesis 
 Mạc Đăng Khoa 
Abstract 
Text-To-Speech (TTS) system is a computer system which is able to produce the 
speech from the text. In the TTS system, the naturalness of the produced speech 
depends greatly on the variation of pitch, duration and energy during speaking. We 
call it the “prosody controlling ability”. A TTS system with good prosody 
controlling ability can be simulate the human speech prosody corresponding to the 
context of speaking. 
With tonal languages such as Vietnamese, the prosody of an utterance is the 
combination results of the two components: "micro-prosody" corresponding to the 
tone of each syllable in a sentence and "macro-prosody" corresponding to the whole 
sentence. 
The main goal of this thesis is to model the characteristics of Vietnamese prosody 
for speech synthesis. It focuses on the influences of the macro-prosody on the 
micro-prosody, in three types of sentence: assertive, interrogative and imperative. 
The first task is to set up a “prosody corpus” and extract all possible prosody 
parameters. Base on the extracted data, we defined seventy-two simple prosody 
patterns for Vietnamese syllables in three types of sentence. After that, these 
patterns were applied to synthesize some simple sentences. Finally, some perception 
experiments were taken to evaluate these synthesized sentences. The results shown 
that the proposed patterns can be applied successfully to generate the prosody of 
simple sentence. 
This work is our preliminary work in Vietnamese prosody, just concerning the 
sentence types and the position of syllable in a sentence. In the future, we expect to 
continue this research with more factors of Vietnamese prosody, improve our 
pattern and apply them Vietnamese TTS system. 
- 3 - 
Master thesis 
 Mạc Đăng Khoa 
- 4 - 
Master thesis 
 Mạc Đăng Khoa 
List of Figures 
Figure 1-1: Category of methods for predicting syllable duration [6]....................23 
Figure 2-1: Example of the contours of six tones, as described in [21]...................30 
Figure 2-2: The shape of Tone 1 with female and male voice [18].........................31 
Figure 2-3: The shape of Tone 2 with female and male voice [18].........................31 
Figure 2-4: The shape of Tone 3 with female and male voice [18].........................32 
Figure 2-5: The shape of Tone 4 with female and male voice [18].........................32 
Figure 2-6: The shape of Tone 5 with female and male voice [18].........................32 
Figure 2-7: The shape of Tone 5b with female and male voice [18].......................33 
Figure 2-8: The shape of Tone 6 with female and male voice [18].........................33 
Figure 2-9: The shape of Tone 6b with female and male voice [18].......................34 
Figure 2-10: Sentence classification by structure [20]............................................35 
Figure 2-11: The sentences “Lan thích ăn cơm khơng” in......................................36 
Figure 2-12: The sentences “Bảo cố gắng tập đi” in...............................................36 
Figure 2-13: The sentences “Tân bỏ đi chứ” in ......................................................37 
Figure 2-14: The differences of F0 contour between Assertive and Interrogative 
sentence [16] .........................................................................................................37 
Figure 3-1: A general function diagram of TTS system [13] ..................................41 
Figure 3-2: Fujisaki model.....................................................................................46 
Figure 3-3: Fujisaki model for tonal language [19] ................................................46 
Figure 3-4: Function diagram of proposal TTS system ..........................................47 
Figure 3-5: Prosody generation module .................................................................48 
Figure 4-1: Key-syllable segmentation ..................................................................56 
Figure 4-2: Extracting F0 contour using PRAAT...................................................57 
Figure 4-3: An example of prosody pattern............................................................60 
Figure 5-1: An example of synthesized non-sense phrase ......................................73 
Figure 5-2: Perception test 1 ..................................................................................74 
Figure 5-3: An example of synthesized multi-type sentences.................................80 
- 5 - 
Master thesis 
 Mạc Đăng Khoa 
Figure 5-4: Interface for Perception test 2..............................................................82 
Figure 5-5: Correct recognition rate with 8 tones of last syllable ...........................85 
Figure 5-6: Correct recognition rate (%) with other types of sentences ..................86 
Figure 5-7: Result comparison of three experiments ..............................................87 
- 6 - 
Master thesis 
 Mạc Đăng Khoa 
List of Tables 
Table 1.1: Prosody functions .................................................................................16 
Table 1.2:Links between levels of representation of prosodic phenomena [13]......17 
Table 1.3: Intonation model classification .............................................................18 
Table 2.1:Vietnamese vowels. ...............................................................................27 
Table 2.2:Vietnamese consonants. .........................................................................28 
Table 2.3: Arrangement of Vietnamese consonants. ..............................................28 
Table 2.4:The phonological hierarchy of Vietnamese syllables with total numbers of 
each phonetic unit [14]. .........................................................................................29 
Table 2.5 The six Vietnamese tones.......................................................................30 
Table 3.1: Comparison between direct pattern and model pattern ..........................50 
Table 4.1: Prosody corpus structure.......................................................................52 
Table 4.2: Prosody corpus text information ...........................................................53 
Table 4.3: Recording information of Prosody corpus.............................................54 
Table 5.1: Confusion matrix (in %) for 8 tones with male voice ............................75 
Table 5.2: Confusion matrix (in %) for 8 tones with female voice .........................75 
Table 5.3: Confusion matrix (%) of sentence types with male voice .....................76 
Table 5.4: Confusion matrix (%) of sentence types with female voice ..................77 
Table 5.5: Test data for Experiment 2....................................................................79 
Table 5.6: Confusion matrix (in %) of sentence types (with male voice)................82 
Table 5.7: Confusion matrix (in %) of sentence types (with female voice) ............83 
Table 5.8: Confusion matrix (in %) of sentence types (average of Male and Female)
..............................................................................................................................84 
Table 5.9: Correct recognition rate (%) with other types of sentences....................86 
Table 5.10: Result of three experiments.................................................................87 
- 7 - 
Master thesis 
 Mạc Đăng Khoa 
Table of contents 
Acknowledgment .......................................................................................... 1 
Abstract ........................................................................................................ 2 
List of Figures............................................................................................... 4 
List of Tables ................................................................................................ 6 
Table of contents .......................................................................................... 7 
0 INTRODUCTION ................................................................................. 9 
1 PROSODY AND PROSODIC MODEL............................................. 12 
1.1. Overview of prosody ...........................................................................................12 
1.1.1. The concept of prosody............................................................................................................ 12 
1.1.2. Major components of prosody ................................................................................................. 13 
1.1.3. The functions of prosody ......................................................................................................... 14 
1.1.4. Levels of representation of prosodic phenomena..................................................................... 16 
1.2. Prosody modeling ................................................................................................17 
1.2.1. Intonation models..................................................................................................................... 18 
1.2.2. Duration modeling ................................................................................................................... 21 
1.2.3. This thesis work approach........................................................................................................ 23 
2 VIETNAMESE LANGUAGE AND PROSODY ............................... 25 
2.1. Vietnamese language ...........................................................................................25 
2.1.1. Vietnamese characteristics ....................................................................................................... 25 
2.1.2. Vietnamese phoneme system ................................................................................................... 27 
2.1.3. Syllable structure ..................................................................................................................... 29 
2.2. Vietnamese prosody.............................................................................................29 
2.2.1. Micro-prosody and tones system in Vietnamese...................................................................... 30 
2.2.2. Macro-prosody and sentence types in Vietnamese .................................................................. 34 
2.2.3. Some special phenomena in Vietnamese prosody ................................................................... 38 
3 TTS SYSTEM AND PROSODY GENERATION............................. 40 
3.1. An overview of TTS system ................................................................................40 
3.2. Prosody generation ..............................................................................................41 
3.2.1. Overview of prosody generation.............................................................................................. 41 
3.2.2. From text to prosody................................................................................................................ 43 
3.3. Other researches and our proposal.......................................................................45 
4 PROSODY PATTERNS EXTRACTION .......................................... 51 
4.1. Prosody corpus.....................................................................................................51 
- 8 - 
Master thesis 
 Mạc Đăng Khoa 
4.1.1. Objectives ................................................................................................................................ 51 
4.1.2. Define the corpus text .............................................................................................................. 52 
4.1.3. Recording................................................................................................................................. 54 
4.1.4. Sentence segmentation............................................................................................................. 54 
4.2. Analysis and extracting prosody parameters .......................................................55 
4.2.1. Segmentation............................................................................................................................ 55 
4.2.2. Extracting prosody parameters of key-syllable ........................................................................ 56 
4.3. Proposal the patterns for Vietnamese prosody ....................................................58 
4.3.1. Methodology............................................................................................................................ 58 
4.3.2. Prosody patterns....................................................................................................................... 59 
4.3.3. Some visual remarks on extracted patterns .............................................................................. 70 
5 EXPERIMENTS AND EVALUATION............................................. 72 
5.1. Experiment 1: Tone and non-sense phrase ..........................................................72 
5.1.1. Objectives ................................................................................................................................ 72 
5.1.2. Method and Implementation .................................................................................................... 72 
5.1.3. Results and discussion ............................................................................................................. 74 
5.2. Experiment 2: Multi-type sentences ....................................................................79 
5.2.1. Objectives ................................................................................................................................ 79 
5.2.2. Method and Implementation .................................................................................................... 79 
5.2.3. Results and discussion ............................................................................................................. 82 
5.3. Comparison and conclusion.................................................................................87 
6 CONCLUSION AND PERSPECTIVES ............................................ 89 
REFERENCES........................................................................................... 92 
APPENDIX................................................................................................. 95 
A. Text for prosody corpus ..............................................................................................95 
B: Datasheet of prosody patterns ...................................................................................100 
- 9 - 
Chapter 0: Introduction 
 Mạc Đăng Khoa 
0 
Introduction 
Speech is the primary means of communication between people. Speech synthesis, 
automatic generation of speech waveforms, has been under development for several 
decades. Recent progress in speech synthesis has produced synthesizers with very 
high intelligibility but the sound quality and naturalness remain a major problem. 
Most of recent researches attempt to improve the naturalness of synthesized sound 
to reach to human speech. 
In Vietnam, there are currently some Vietnamese synthesis system like VnVoice 
(develop by Institute of Information Technology) or HoaSung (develop by 
International Research Center MICA). These researches obtained some encouraging 
results. However, to release their systems to the market yet, they have to improve 
the produced speech quality, especially the naturalness of speech prosody. 
Thus, this thesis aims to study the characteristics of Vietnamese prosody for 
applying to synthesize the speech. This work is carried out in International research 
center of Multimedia Information, Communication and Application (MICA) and is 
part of MICA’s project: VN-Synthesis. 
With the research of PhD student Tran Do Dat in MICA, we have already 
developed a speech synthesis system using sound samples concatenation 
techniques. The first version now can produce sound from detailed text description, 
which consists of: 
- 10 - 
Chapter 0: Introduction 
 Mạc Đăng Khoa 
• The sequence of phonemes for composing the utterance: can be obtained 
automatically from the raw text using a "phonetization” module, whose 
development is currently underway. 
• All information related to voice modulations: mostly pitch, energy and 
duration variations that constitute the intonation or prosody of the uttered 
statement. We call it “prosody description”. 
For tonal languages such as Vietnamese, the prosody of speech is composed of two 
components, which we call “micro-prosody” and “macro-prosody”: 
• Micro-prosody is the variations of pitch, duration and intensity of 
individual word or syllable. For tonal language, the micro-prosody is very 
important to distinguish the syllable’s tone. Thus, the meaning of the 
synthesized sound greatly depend on the quality of micro-prosody. 
• Macro-prosody is the application of prosody to whole phrase or sentence. 
It depends on the type of sentence, speaker's intentions, the emotions etc. 
Therefore, the "naturalness" of synthesized speech is depends on ability 
of macro-prosody controlling during speech synthesis process. 
Objectives and Tasks 
This thesis is part of MICA speech synthesis research and its main goal is to extract 
characteristics of Vietnamese prosody to generate the “prosody description” for 
speech synthesis. 
In this thesis, we just focus on the differences of Vietnamese tones in different 
positions in the sentence and in different types of sentences. In other words, these 
are the influences of macro-prosody on micro-prosody. 
The first task is setting up a corpus for researching Vietnamese prosody. With this 
corpus, we extract and analysis parameters of fundamental frequency, duration and 
intensity of the syllables in eight Vietnamese tones, in three positions and in three 
type sentences. 
- 11 - 
Chapter 0: Introduction 
 Mạc Đăng Khoa 
After that, using these prosody parameters, we defined the simple prosody patterns 
for Vietnamese tones, corresponding to the cases of syllable in three types of 
sentence: assertive, interrogative and imperative. By applying these patterns to re-
synthesize some simple sentences and doing some perception experiment, we can 
examine the appropriateness of these prosody patterns. 
Thesis outline 
This thesis is structured as follows: 
• Chapter 1 starts with Section 1.1 giving some background on prosody, 
also some definitions and some term we use in this thesis book. Section 
1.2 briefly presents modeling prosody and some prosodic models. 
• Chapter 2 gives an overview of Vietnamese language and Vienamese 
prosody. 
• Chapter 3 starts with the introduction of Text-to-Speech system, the 
general structure of TTS system and the prosody generation. In last 
section of this chapter, we present some related work and propose a 
simple structure for prosody generation module for TTS system. 
• Chapter 4: Section 3.1 and 3.2 describes our work of setting up and 
analyzing the Vietnamese prosody corpus. In section 3.3, we propose set 
of prosody patterns for the Vietnamese syllables. 
• In chapter 5, a series of perception experiments is presented for 
evaluating our proposal patterns. 
• Chapter 6 completes with the conclusions from the work presented in the 
thesis and suggestions for further work 
- 12 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
1 
Prosody and Prosodic model 
In this chapter, we give an overview of prosody and explain some terms we use in 
this thesis. The concept of modeling prosody and some prosodic models are also 
briefly presented after that. 
1.1. Overview of prosody 
1.1.1. The concept of prosody 
There is not an exact definition of the term “prosody”. We can use the term 
"prosody" broadly, meaning “a time series of speech-related information that is not 
predictable from a reasonable window (i.e. word-sized or sentence-sized) applied to 
the phoneme sequence” [1]. 
Viewed in the large, prosody is a parallel channel for communication, carrying 
some information that cannot be simply deduced from the lexical channel. All 
aspects of prosody are transmitted by muscle motions, and in most of them, the 
recipient can perceive, fairly directly, the motions of the speaker. 
Clearly, with that broad definition of prosody, hand gestures, eyebrow and face 
motions, can be considered prosody, because they carry information that modifies 
and can even reverse the meaning of the lexical channel. However, in the domain of 
speech processing, we concentrate on the aspect of speech of prosody. Thus, the 
prosody could include: “Pitch”, “Duration” and “Stress”. In the aspect of speech 
- 13 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
signal, the prosody is represented by three components: “Fundamental frequency 
(F0)”, “Duration” and “Intensity”. 
“Prosody” and “Intonation” 
The term prosody refers to certain properties of the speech signal such as audible 
changes in pitch, loudness, and syllable length. For some authors the set of prosodic 
features also includes other aspects related to speech timing such as rhythm and 
speech rate. [13] 
Some as a synonym for prosody use the term intonation. It is restricted to the tonal 
(melodic) aspects of prosody by others. In the thesis, intonation refers to pitch 
variation in speech production and is part of prosody. [13]. In other words, we have: 
Prosody = Intonation + Duration 
1.1.2. Major components of prosody 
As we discuss above, the prosody consist of: 
• Pitch (Fundamental frequency): Among prosodic event, the most overt are 
changes in pitch, which together constitute the pitch contour of the 
utterance. (F0 contour of speech signal). Some analysis of sentences-lever 
pitch contours show that the pitch contour of longer utterances can be 
broken down to a sequence of elementary contours, which can further be 
divided into syllabic contours. [13] 
• Duration: duration in prosody is concerning to the length of sentence, 
phrase, word, syllable, voiced part in syllable, syllabic nuclei, and so on. 
The duration of syllable and speech sounds depends on several 
(dependent or interdependent) factor such as speech rate, rhythm, 
phonetic nature, etc. Most of case, the absolute duration of an event is 
easily measured. However sometime, it is not obvious to define the 
boundary of an event. 
- 14 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
• Stress (Intensity): stress is a prosodic property that has been described 
since the very first work on prosody in phonetics. It was said to be related 
to loudness and phonology force. Both these characterizations refer to the 
perceptual form of prosody: the syllable carrying stress is prominent with 
respect to the surrounding syllables, either due to its loudness or to its 
dynamic properties. 
1.1.3. The functions of prosody 
Prosody, as expressed in pitch, gives clues to many channels of linguistic and para-
linguistic information. Linguistic functions such as stress and tone tend to be 
expressed as local excursions of pitch movement. Intonation types and para-
linguistic functions may affect the global pitch setting, in addition to characteristic 
local pitch excursion near the edge of the sentence (i.e. boundary tones). [1] 
Prosody used to convey lexical meaning: Stress, accentual and tone languages. 
• Stress language: English is an example of a stress language. Stress 
location is part of the lexical entry of each English word. For example, 
"apple" and "orange" both have stress on the first syllable, while 
"banana" has stress on the second syllable. When an English word is 
spoken in isolation in declarative intonation, f0 typically peaks on the 
stressed syllable. 
• Accentual language: Japanese is an example of an accentual language. A 
word is lexically marked as accented (on a particular syllable) or un-
accented. A simplified description is that pitch rises near the beginning of 
an accentual phrase and falls on the accented syllable. For detailed 
analysis, see Beckman and Pierrehumbert (1988). 
• Tone language: Mandarin, Vietnamese are the examples of a lexical tone 
language. Each syllable is lexically marked with one lexical tones (. 
Tones have distinctive pitch contours. Altering the pitch contour may 
have the consequence of changing the lexical meaning of a word, and 
- 15 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
perhaps the meaning of a sentence. For example in Vietnamese, the 
meaning of syllables “ta” (we), “tà” (lap of dress), “tã” (nappy), “tả” (to 
describe), “tá” (twelve), “tạ” (quintal) are different. 
Prosody used to convey non-lexical information: Intonation type (Question vs. 
declarative sentences). 
Languages may employ prosody in different ways to differentiate declarative 
sentences from questions. A general trend is that questions are associated with 
higher pitch somewhere in the sentence, most commonly near the end. This may be 
manifested as a final rising contour, or higher/expanded pitch range near the end of 
the sentence. In English, declarative intonation is marked by a falling ending while 
yes-no question intonation is marked by a rising one, as shown on the last digit 
"one" in the English examples. Russian question, on the other hand, uses strong 
emphasis on a key word instead of a rising tail. Chinese questions are manifested by 
an expanded pitch range near the end of the sentences, however, the speaker 
preserves the lexical tone shapes. [1] 
Prosody used to convey discourse functions: Focus, prominence, discourse 
segments, etc. 
Topic initialization is typically associated with high pitch. Pitch is typically raised 
in the discourse initial section and lowered in the discourse final section. Also, new 
information in the discourse structure is typically accented while old information 
de-accented. [1] 
Prosody used to convey emotion. 
Most experiments studying emotional speech study stylized emotion, as delivered 
by actors and actresses. In these acted-out emotions, a few categories of emotions 
can be reliably identified by listeners, and one can find consistent acoustic 
correlates of these categories. For example, excitement is expressed by high pitch 
and fast speed, while sadness is expressed by low pitch and slow speed. Hot anger is 
characterized by over-articulation, fast, downward pitch movement, and overall 
- 16 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
elevated pitch. Cold anger shares many attributes with hot anger, but the pitch range 
is set lower. 
The study of emotion in natural speech is a lot more complicated. It is generally 
recognized that speakers show mixed feelings and ambiguous states of mind, and 
the emotions do not fall into clear cut categories.[1] 
We have the summary of prosody functions in Table 1.1: 
Table 1.1: Prosody functions 
modifying meaning not modifying meaning 
Linguistic (Lexicon 
information) 
Paralinguistic 
(non-lexicon information) 
Discourse function Extra 
linguistic 
- Tone 
- Accent 
Sentence type: 
- Assertive 
- Interrogative 
- Imperative 
- Focus, 
- Prominence 
- … 
- Emotion 
- Sex of 
speaker 
- … 
In this thesis work, we just focus on studying the functions of prosody which 
modify meaning, namely tones and sentence types in Vietnamese prosody. 
1.1.4. Levels of representation of prosodic phenomena 
As for other properties of the speech signal, prosodic events can be studied at 
various levels of representation (see Table 1.2) [13] 
• First, the acoustic level: the acoustic manifestation of prosody 
(fundamental frequency, amplitude, and duration) can be measured 
directly, using specialized hardware or algorithms (such as pitch 
determination algorithms). 
• Second, the perceptual level represents the prosodic events as heard by 
the listener. As for spectral properties of speech sounds, acoustic 
characteristics that can be measured are not a._.lways perceptible. The 
perceptual representation is accessible to the individual listener, but this 
mental representation can hardly be measured. Alternatively it can be 
computed with a fair amount of precision on the basis of our knowledge 
about psychoacoustics. 
- 17 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
• Finally, the linguistic level represents the prosody of an utterance as a 
sequence of abstract units (signs, symbols), some of which have a 
communicative function in speech, while others may just fulfill syntactic 
requirements. The linguistic structure of prosody is not some hidden code 
that simply can be revealed using some standard procedure. 
Table 1.2:Links between levels of representation of prosodic phenomena [13] 
Acoustic Perceptual Linguistic 
Fundamental frequency (F0) Pitch Tone, intonation, aspect of stress 
Intensity Loudness Aspect of stress 
Duration Length Aspect of stress 
Given the different nature of these representations, it is important to keep them 
apart. It can be helpful to have the terminology reflect the lever of representation. 
For instance, measuring loudness does not equal measuring signal energy. It is 
obvious that the perception of loudness is not exclusively related to the amplitude at 
one point of the signal, but also dependent on the duration of a speech fragment (the 
loudness of which we are measuring), and relative to the loudness of other parts in 
the signal. 
As one moves away from acoustic level towards the perceptual and/or linguistic 
levels, the measurement of some given prosodic property will progressively involve 
segmentation (for example, into syllables), context (such as relative prominence), 
and structural information (the linguistic interpretation of a syllabic tone, for 
example, often depends on whether the related syllable is stressed or not, which 
requires a prior analysis of the segmental layer). 
1.2. Prosody modeling 
Prosodic models serve two purposes: On one hand, they can be scientific 
hypotheses that explain how we communicate with each other, and what we 
communicate. On the other hand, they can be engineered software systems that are 
part of a dialog system or speech synthesizer. To a lesser extent - and this is mostly 
potential - a prosodic model can be the background for a system to recognize 
prosody in human speech. 
- 18 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
In general, a prosodic model is combined of two component, they are: intonation 
model and duration model. In this section, we word like to give an overview of 
some methods for prediction intonation (F0 contours) and duration which have 
actually been applied in speech synthesis. 
1.2.1. Intonation models 
1.2.1.1 Intonation model classification 
The primary goal of intonation research is to model natural f0 contours of speech, 
preferably in relation to a transcription and a description of the prosodic intent of 
the speaker. The starting point of intonation research is the time series of F0. But 
the interpretation of the F0 information diverges widely among intonation models. 
The Table 1.3 represents a view of how one can classify the various intonation 
models. 
Table 1.3: Intonation model classification 
Intonation model classified by the way they describe prosody. 
 Under-specified - - Fully Specified 
Single Component INTSINT ToBI, Xu Tilt, IPO Olive, Machine learning 
Two components Grønnum - Fujisaki - 
Multiple components - - - Van Santen 
Under-specified or Fully specified 
The shape of an accent may be fully-specified (i.e. defined without gaps) or under-
specified (defined by disconnected regions or isolated points). Along another 
dimension, f0 values at any given time may be treated as a single component or as 
the combination of multiple components. 
The advantage of using an under-specified accent shape is that it allows sufficient 
distance between specified accent targets to allow a smooth f0 transition, typically 
by way of interpolation. The drawback is that it ignores changes of shape between 
specified targets. 
- 19 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
On the other hand, a system with fully specified accents leaves little room to resolve 
conflicting targets. A simple concatenation of fully-specified accents will result in a 
pitch curve with unnatural jumps at the concatenation joints. Many systems, such as 
Fujisaki (1983, 1988), use filters to smooth out abrupt changes in F0. Alternatively, 
van Santen (1997, 2000) requires each accent to begin and end at zero to ensure 
smooth connections between accents. 
Single component or many components? 
Many intonation models treat surface intonation contours as the superposition of a 
phrase component and an accent component. Grønnum (1992) and Fujisaki (1983, 
1988) are representatives of this view. 
Well-defined model that fully specifies accent shape and uses multiple components 
is Van Santen's model (van Santen and Mưbius, 1997, 2000; van Santen et al., 
1998), where accents are represented by densely populated points, providing a 
mechanism to describe highly complex accent shapes in detail. We characterize van 
Santen's system as having multiple components, because in addition to the phrase 
component, each accent in the phrase also adds a phrase-length component that 
contributes to the surface f0 contour. 
The advantage of multiple components is that it provides a mechanism to separate 
individual accents from long-term effects. However, if one allows multiple 
components, then one necessarily faces the problem that there is no unique solution 
in the decomposition of a single f0 time series into multiple components [1]. Any 
such decomposition depends on a model of the speech process, and is only as good 
as the underlying model. 
In contrast, Liberman and Pierrehumbert (1984) explicitly reject the notion of a 
phrase curve and represent intonation contours as a single component. The 
advantage of representing f0 information as a single component is that the 
representation of accent heights will then be transparent, which lends itself to 
convenient automatic labeling. [1] 
- 20 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
1.2.1.2 Some prosody models 
The following give an over view of intonation models in Table 1.3 
• INTSINT (Hirst et al., 2000) is an underspecified intonation system that 
defines an accent by a single point. Fitting quadratic spline curves 
through these points generates surface f0. 
• ToBI: The most widely used under-specified accent shape is represented 
by the ToBI model (Beckman and Ayers, 1997; Silverman et al., 1992), 
which developed from earlier works such as Pierrehumbert (1980), 
Liberman and Pierrehumbert (1984), and Pierrehumbert and Beckman 
(1988). Each accent is represented by no more than two points, which 
specify abstractly the relative contrast of high (H) and low (L). One goal 
of the ToBI system is to specify a minimal set of categorical labels for 
intonation. These labels are usually interpreted as phonological 
distinction between accent types. 
• Xu: Xu et al. (1999) represents Chinese tones with under-specified, static 
or dynamic targets. The surface f0 contours are generated with a model 
that approaches these targets asymptotically within the domain of a 
syllable. 
• Tilt (Taylor, 2000; Taylor, 1998) allows more samples than ToBI near 
the peak of an accent and leaves the other regions unspecified, hence its 
status half way to a fully specified system. Tilt considers all accent types 
to be continuous variations of a single class. Surface variations are 
accounted for by changes in the continuous parameters. 
• IPO (de Pijper, 1983) prepares a piecewise-linear approximation to the 
pitch contour. They then associate the slope and height of these lines with 
various types of accents. Olive (1975) described a very early fully-
specified system, following work by Levitt and Rabiner (1970). His 
model stored the surface pitch vs. time contour as a function of the 
grammatical structure of the sentence. The contour was then 
- 21 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
approximated by polynomial splines attached to words, to allow for 
duration variations. 
• Machine-learning: Several works using machine-learning techniques 
generate densely sampled f0 values, including Chen et al. (1992) and 
Malfrère et al. (1998). We classify these works as fully specified systems 
even though in some cases the concept of accent may not be clear. Ross 
and Ostendorf (1999) described an interesting machine learning system 
where a discrete learning system would predict vectors attached to 
phonemes and syllables, and these vectors would in turn drive a (learned) 
dynamical system to predict f0. 
• Fujisaki: Fujisaki’s phonetic intonation model (Fujisaki and Kawai, 
1982). Fujisaki’s model was developed from the filter method first 
proposed by O¨ hman (1967). Fujisaki states that intonation contours are 
comprised of two types of components, the phrase and the accent. The 
production process is represented by a glottal oscillation mechanism 
which takes phrase and accent information as input and produces a 
continuous F0 contour as output. The input to the mechanism is in the 
form of impulses, used to produce phrase shapes, and step functions 
which produce accent shapes [10]. The Fujisaki model has been 
successfully applied for decomposing F0 contours in many languages like 
Japanese, German, and Finnish and in some tonal languages like Chinese, 
Thai. Currently, some researches of applying Fujisaki model to 
Vietnamese are on the way [11]. We will return to this model in Chapter 
3. 
1.2.2. Duration modeling 
We now give a general overview of modeling the duration component of prosody. 
Common methods to predict duration in speech synthesis differ in the following 
aspects: [6] 
- 22 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
• Durational Unit Predicted: the temporal unit predicted by most current 
systems are either the phone (phoneme), often referred to as “segment”, 
or the syllable. Since eventually phone duration are required for the 
acoustic synthesis, all syllable-based models include some kind of 
mechanism for calculating segment duration from the unit syllable 
duration. For example, in Barbosa and Bailly’s model, the basic unit is 
delimited by the onset of nuclear vowel and the onset of the following 
vowel. They are computed by a sequential network constrained by an 
internal clock (basically the speaking rate). 
• Predictor factors: Every model uses a particular vector of input features, 
which are extracted on the linguistic and phonetic levels. Most commonly 
employed factors include: 
 on the syllabic level: the degree of accentuation and the position in 
a higher-level unit, such as the foot or accent group. 
 on the segmental level: the properties of the phone to be 
synthesized and its neighboring phones 
 on the phrase level: the location of a segment with respect to a 
minor or major boundary an the position of the phrase in a 
sentence. 
• The Prediction Method: The algorithms used for calculating a numerical 
duration value from the vector of input features can be roughly divided 
into rule systems and statistical approaches. In the Figure 1-1: Category 
of methods for predicting syllable duration [6]Figure 1-1Error! 
Reference source not found., the statistical approaches are subdivided 
into parametric and non-parametric regression models. Whereas the 
structure of a parametric regression model in term of how it processes the 
input factors is determined a priori, non-parametric regression models are 
developed by unsupervised training and the model structure is determined 
automatically (multi-layer perceptrons, CARTs). The main difference 
between rule-base and statistical models is that a rule system can be build 
- 23 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
on relatively little speech data. The formulation of the rules, however, 
require a high amount of expert knowledge and considerable optimization 
effort by trial-and-error. In contrast, statistical approaches are built from a 
process is relatively effortless. Furthermore, the importance of individual 
factors can be easily assessed by the way the statistical models prioritize 
them. 
Figure 1-1: Category of methods for predicting syllable duration [6] 
• Pause Prediction. Some current approaches incorporate the prediction of 
speech pauses as part of model, others treat pauses strictly separately. 
• Speech Rate: Many current TTS systems produce different speech rates 
by linearly scaling the duration output by the duration model. As the 
speech rate not only affects the duration of individual segments, but also 
the overall prosodic structure of an utterance, this kind of modification 
needs to take place on an earlier step of processing when the phrasal 
structure of an utterance is determined. 
1.2.3. This thesis work approach 
Modeling the intonation and duration in prosody is a complex field, relate to 
linguistic and acoustic field. There are many different methods to predict the 
- 24 - 
Chapter 1: Prosody and Prosodic model 
 Mạc Đăng Khoa 
intonation and duration of speech. However, there is currently no methods 
completely apply in Vietnamese. 
In the scope of this thesis, we use the statistical approach to extract some basic 
patterns, just for modeling some basic cases in Vietnamese prosody. 
The following are some information about our approach to modeling Vietnamese 
prosody: 
• Method: Statistical approach: calculate average value of F0, intensity and 
duration from a corpus. 
• Intonation modeling: 
 Under-specified: using 20 isolated point. 
 Single component 
• Phrase level factors: Three type of phrase: Assertive, Interrogative and 
Imperative 
• Syllable level factors: Tone of syllable 
• Extra-linguistic factors: Male/Female voice 
This approach will be described more detail in Chapter 40. 
- 25 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
2 
Vietnamese language and prosody 
The understanding of phonetic and phonological characteristics of a language has an 
important role in the studies on speech processing in general and on prosody 
analysis in particular. Thus, in this chapter, we give a review of Vietnamese 
language and Vietnamese prosody. 
2.1. Vietnamese language 
2.1.1. Vietnamese characteristics 
As we know, Vietnamese language is an amorphous language and a tonal/musical 
language. It has the following characteristics [21]: 
1. Vietnamese words are amorphous words, they do not change to show 
grammatical categories, for instance, in French there are male and female 
word étudiant - étudiante, nouveau – nouvelle, singular and plural word 
amie, amies. 
2. Vietnamese word structure does not use the affixes (prefixes, suffixes, and 
infixes). Vietnamese language is a non-affix language. For instance, in 
French or in English, an antonyms of one words is add the prefixes “im-”, 
“ir-”, “un-”: impolite, unreadable, irregular… 
3. Vietnamese word structure uses very few morphemes. Vietnamese language 
has maximum twenty thousand syllables to create morphemes, thus 
Vietnamese language does not have the features of flexional languages. 
- 26 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
Vietnamese language’s morpheme index (number of morphemes M/ number 
of words W) is about 1.06 [13], this is the least index in the 5000 languages 
in the world [13]. The language, which its morpheme is less than 2, is an 
analytic language. 
The amorphous feature of our language is an essential characteristic, which 
has an influence on other Vietnamese language’s characteristics. 
4. Vietnamese language is a tonal/musical language. Vietnamese language has 
six tones, and each tone could contribute to create the morpheme and 
meaning of word, e.g. ba, bá, bà, bả, bã, bạ; me, mé, mè, mẻ, mẽ, mẹ. The 
tones make Vietnamese language have a musical characteristic; make 
sentences rhythmic and melodious. 
5. A syllable (isolated word) of Vietnamese language in full structure has five 
parts: initial sound (consonant), medial sound (semi-vowel), nucleus sound 
(vowel or diphthong), final sound (consonant or semi-vowel) and tone. In 
one word, consonant and vowel take an essential role, they are the core of the 
word. They can create one syllable by themselves. Excepting the initial 
consonant, the rest of one word is called a final (vần). Vietnamese has 155 
basic finals. [13] 
6. In Vietnamese, the boundary of syllable and morpheme’s is the same. One 
syllable is one morpheme. In French: partir (come) has two syllables par-tir 
and two morphemes part-ir, vendeur (seller) has two syllables ven-deur and 
two morphemes vend-eur. In English: words have one syllable and two 
morphemes. In Vietnamese: the sentence “ Đẹp vơ cùng tổ quốc ta ơi!” (Tố 
Hữu) has seven morphemes, seven syllables, and five words (three mono 
words: đẹp, ta, ơi and two compound word: vơ cùng, tổ quốc). In conclusion, 
one Vietnamese word unit is one syllable, one morpheme and one real word. 
- 27 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
7. Almost Vietnamese vocabulary is created by one or two morphemes, and is 
monosyllable or bi-syllable, sometime polysyllable. There are 80% words 
being bi-syllable words. 
8. The difference between writing language and speaking language on 
grammatical rules and phonetic rules is not large. 
9. Through the period of foundation and development of Vietnamese language, 
it has received quite many words from foreign languages. Number of Han 
words is the greatest and next are French words, and a part of them were 
converted fully into Vietnamese. For example, words: đấu tranh, giai cấp, 
hồ bình, độc lập, tự do, hạnh phúc are Han words (Chinese words). Nhà ga 
(gare), xà phịng (savon), cà phê (café) are French words. 
2.1.2. Vietnamese phoneme system 
Vietnamese phoneme system includes 14 vowels or vowel combinations and 22 
consonants. 
The Vietnamese vowels include 11 vowels and three diphthongs [21]. All vowels 
are voiced sounds. 
Table 2.1:Vietnamese vowels. 
Transcription Reading Letters Example 
/a/ a a a ha 
/ắ/ ă ă con mắt 
/Φ〈/ ấ â ân cần 
/ε/ e e e dè 
/e/ ê ê ê chề 
/i/ short i & long y i ỉ eo, ý chí 
// o o co ro 
/o/ ơ ơ hồ đồ 
/Φ/ ơ ơ bơ phờ 
/u/ u u tù mù 
/∝/ ư ư lừ đừ 
/ie/ ia , yê ia, iê, ya,yê kia kìa, yêu kiều 
/uo/ ua ua, uơ tua rua, luơn luơn 
/∝Φ/ ưa ưa, ươ lưa thưa, lượt thượt 
- 28 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
Vietnamese includes 22 consonants [21] as Table 2.2 
Table 2.2:Vietnamese consonants. 
Transcription Reading Letter Example 
/b/ bê and bờ b bồng bềnh 
/p/ pê and pờ p ốp ép 
/v/ vê and vờ v vẩn vơ 
/f/ phờ ph phơi pha 
/m/ em-mờ and mờ m mơ màng 
/d/ đê and đờ đ đất đai 
/t/ tê and tờ t tin tưởng 
/t’/ thờ th thơ thẩn 
/s/ ich-xì and xờ x xa xơi 
/z/ dê, giê and dờ d, gi duyên dáng, giữ gìn 
/n/ en-nờ and nờ n nơn nĩng 
/l/ e-lờ and lờ l long lanh 
/τ/ trờ tr trồng trọt 
/ş/ ét-sì and sờ s sung sướng 
// e rờ and rờ r rĩc rách 
/c/ chờ ch chơng chênh 
// nhờ nh nhọc nhằn 
/ŋ/ ngờ ng, ngh ngơ nghê 
/k/ xê, ca, quy and cờ c,k,q con cá, kĩu kịt, qua quít 
/x/ khờ kh khúc khích 
/⊗/ gê, giê and gờ g, gh gồ ghề 
/h/ hát and hờ h hả hê 
The consonants in Vietnamese can be distinguished by the features: articulate mode 
and articulate position, stop or fricative properties, voiced or unvoiced properties, 
aspirate or non-aspirate, nasal or non-nasal and the place of consonant in syllable. 
Based on these features, Vietnamese consonants can be arranged as Table 2.3. 
Table 2.3: Arrangement of Vietnamese consonants. 
apical articulate position 
articulate method 
labial 
dental laminal 
palate dorsal glottal 
aspirate t 
Unvoiced p t’ } c k noise non-
aspirate Voiced b d 
Stop 
sonant m n  Ν 
Unvoiced f s ♣ x h noise 
Voiced v z  Φ fricative 
beside sonant l 
- 29 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
2.1.3. Syllable structure 
Vietnamese grammarians and linguists have long considered the syllable in 
Vietnamese as a fundamental unit. A syllable in full structure (a tonal syllable) has 
five parts: initial sound, medial sound, nucleus sound, final sound and tone (Error! 
Reference source not found.) [21]. For instance, the syllable “tốn” has following 
components: initial sound /t/, medial sound /o/, nucleus sound /a/, final sound /n/, 
and tone “sắc” (or rising tone). One syllable has to have a nucleus sound. Other 
components are optional. A nucleus sound could create one syllable, for instance a, 
ơ, ê…Besides the initial sound (called INITIAL part), the rest of the syllable is 
called a FINAL part. A tone is a fundamental frequency variation spreading over 
the whole syllable. A tone has the same function as a phoneme. It always assigns for 
syllable and its influence covers the entire of syllable. There are a few constraints: if 
a syllable ends with unvoiced consonants /p,t,k/, only “sắc” and “nặng” tones are 
possible; otherwise in all varieties of Vietnamese, the whole tonal paradigm can 
occur. 
Table 2.4:The phonological hierarchy of Vietnamese syllables with total numbers of each 
phonetic unit [14]. 
TONAL SYLLABLE (6492) 
BASE SYLLABLE (2376) 
Final (155) 
Medial (1) Nucleus (16) Ending (8) 
Initial 
(22) 
TONE (6) 
2.2. Vietnamese prosody 
As a tonal language, Vietnamese prosody is composed of two components, which 
we call “micro-prosody” and “macro-prosody”: 
• Micro-prosody is the variation of pitch, duration and intensity of 
individual word or syllable. For tonal language, the micro-prosody is very 
important to distinguish the syllable’s tone. Thus, the lexical meaning of 
the synthesized sound much depends on the quality of micro-prosody 
• Macro-prosody is the application of prosody to whole phrase or 
sentence. It depends on the type of sentence and speaker's intentions or 
- 30 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
emotions. Therefore, the "naturalness" of synthesized sentences is much 
depends on ability of macro-prosody controlling during speech synthesis 
process. 
2.2.1. Micro-prosody and tones system in Vietnamese 
In Vietnamese, micro-prosody is much depends on the tone of syllable. Each tone 
could contribute to construct the morpheme and meaning of word, it is also a 
distinguish signal. The tone has the same function as a phoneme, it always assigns 
for syllable and its influence cover the entire of syllable. The tones make 
Vietnamese language have a musical characteristic; make sentences rhythmic and 
melodious. 
There are six tones in Vietnamese; they are showed in the Table 2.5 
Table 2.5 The six Vietnamese tones. 
Tone 1 Tone 2 Tone 3 Tone 4 Tone 5 Tone 6 
ngang huyền ‘\’ ngã ‘~’ hỏi ‘?’ sắc ‘/’ nặng ‘.’ 
Figure 2-1: Example of the contours of six tones, as described in [21]. 
• Tone 1- Level tone (“ngang”): is a high tone. At the beginning of 
syllable, it is the highest tone. The steady state of the level contour is 
observed consistently. In the below figure, you can see the shape of tone 
- 31 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
1 for male and female voice. (two line present the maximum and 
minimum of F0 values) 
Figure 2-2: The shape of Tone 1 with female and male voice [18] 
• Tone 2 - Falling tone (“huyền”): the onset of the falling tone is 
lower than tone 1, tone 5 and tone 3. The low F0 at the onset 
gradually falls toward the end. 
Figure 2-3: The shape of Tone 2 with female and male voice [18] 
• Tone 3 - Broken tone (“ngã”): the onset is as high as that of the level 
of tone 5, it is higher than the falling tone. The second third of the 
contour of this tone is characterized by an abrupt dip caused by a 
glottalization. In most cases, the bottom of the dip occurs between the 
mid-point and the point two-thirds from onset. A creaky voice is 
heard during this dip. 
- 32 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
Figure 2-4: The shape of Tone 3 with female and male voice [18] 
• Tone 4 - Curve tone (“hỏi”): the onset is the lowest among the six tones. 
The low onset falls further gradually until the point two-thirds from the 
onset. From this point, the extremely low F0 starts to rise toward the end. 
Figure 2-5: The shape of Tone 4 with female and male voice [18] 
• Tone 5 - Rising tone (“sắc”): the onset is also high. Starting from high 
onset, the F0 gradually rises for the first two thirds of the duration. After 
this point, the rise becomes more rapid. 
Figure 2-6: The shape of Tone 5 with female and male voice [18] 
- 33 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
• With tone 5 ending with stop consonants (t,p,c,k), the onset is higher than 
tone 5a and the F0 rise rapidly with short duration. We call that tone is 
tone 5b. 
Figure 2-7: The shape of Tone 5b with female and male voice [18] 
• Tone 6 - Drop tone (“nặng”): the onset is usually higher than that of the 
falling or curve tone but considerably lower than the tone 1, tone 5 and 
tone 3. This tone is characterized by a glottalization at the end and also by 
its considerably shorter duration than the other tones. The duration of this 
tone is approximately two thirds of the other tones. The main body of this 
tone is almost leveled or slightly falling. 
Figure 2-8: The shape of Tone 6 with female and male voice [18] 
• Tone 6b (tone six ending with stop consonants): the onset is nearly equal 
tone 2. The F0 falls toward the end with short duration. 
- 34 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
Figure 2-9: The shape of Tone 6b with female and male voice [18] 
These descriptions are only for the Northern dialect, in particular Hanoi dialect 
which is the standard dialect of Vietnamese. They would be changed with the other 
dialects in the South and the Center of Vietnam. In these regions, there are only 5 
tones instead of 6 like the Hanoi dialect, because tone 3 and tone 4 are pronounced 
identically. 
In continuous speech, tones seldom reach their target values. They are generally 
affected by context: stressed vs. unstressed syllable, influence of neighbouring 
tones, tempo… and the affect of some phenomena in Vietnamese prosody, on which 
we will discuss later. 
2.2.2. Macro-prosody and sentence types in Vietnamese 
As we talked above, the macro-prosody depends on the type of sentence, speaker's 
intentions or emotions. In this thesis, we just discuss on role of sentence types in 
Vietnamese prosody. 
The classifition of Vietnamese sentences 
• Classification by purpose: Sentences can be classified based on their 
purpose:[8] 
 A assertive sentence or declaration: the most common type, 
commonly makes a statement. Ex: Tơi sẽ về nhà. (I am going 
home.) 
- 35 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
 An interrogative sentence or question: is commonly used to 
request information. Ex: Khi nào anh sẽ làm việc? (When are you 
going to work?) 
 An imperative sentence or command: is ordinarily used to make a 
demand or request Ex: Mở cửa ra! (Open the door!) 
 An exclamatory sentence or exclamation: is generally a more 
emphatic form of statement: Ex: Ngày hơm nay tuyệt quá! (What a 
wonderful day this is!) 
• Classification by structure: Sentences can also be classified based on their 
structure (by the number and types of finite clauses) as the below diagram 
Figure 2-10: Sentence classification by structure [20] 
With the scope of this thesis, we have just studied the macro-prosody of assertive, 
interrogative and imperative sentence with single structure. 
In the researches of Nguyễn and Boulakia [8], they gave some characteristics of 
prosody on three types of sentence (assertive, interrogative and imperative) as the 
following: 
• Duration (Tempo). Interrogative sentences (Q) are shorter than 
Assertive sentence (S) and this difference is significant. Imperatives (I) 
are even shorter, but the differences with Q and S are not significant. 
- 36 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
• Intensity: The difference is significant between assertive and imperative 
for the S/I pair, but not for the S/Q and Q/I ones. 
• Fundamental frequency. The F0 mean value of Interrogative sentences 
and Imperative utterances is higher than that of Statements, while there is 
no difference between Interrogative and Imperative sentences. There is an 
obvious difference in the last syllable. The phonologically "level" (high) 
tone falls in Statements and is much higher and rising in Questions, while 
the mean value and movement is half way between for Imperatives. The 
rising tones, rise even more in the case of Interrogative and Imperative 
than in Statement sentences. It means that there is an influence of the 
intonation on the final-syllable tone of the sentence. 
Figure 2-11: The sentences “Lan thích ăn cơm khơng” in 
Assertive (S) and Interrogative (Q) mode [8] 
Figure 2-12: The sentences “Bảo cố gắng tập đi” in 
- 37 - 
Chapter 2: Vietnamese language and prosody 
 Mạc Đăng Khoa 
Assertive (S) and Imperative (Q) mode [8] 
Figure 2-13: The sentences “Tân bỏ đi chứ” in 
Interrogative (Q) and Imperative (I) mode [8] 
In the research of Vu M. Q .et al [16], they found that the main part of differences in 
intonation._.Imperative 7 43 50 100 
Assertive 67 10 23 100 
Interrogative 20 60 20 100 
6 
Imperative 17 30 53 100 
Assertive 67 20 13 100 
Interrogative 10 53 37 100 
T
on
e 
an
d 
se
nt
en
ce
 ty
pe
6b 
Imperative 17 23 60 100 
- 84 - 
Chapter 6: Conclusion and Perspectives 
 Mạc Đăng Khoa 
Table 5.8: Confusion matrix (in %) of sentence types (average of Male and Female) 
Perceived type of sentence (%) Perceived 
Intended Assertive Interrogative Imperative Total 
Assertive 77 20 3 100 
Interrogative 20 60 20 100 
1 
Imperative 10 33 57 100 
Assertive 75 20 5 100 
Interrogative 23 65 12 100 
2 
Imperative 42 17 42 100 
Assertive 70 27 3 100 
Interrogative 32 48 20 100 
3 
Imperative 52 23 25 100 
Assertive 68 20 12 100 
Interrogative 23 72 5 100 
4 
Imperative 25 25 50 100 
Assertive 60 35 5 100 
Interrogative 32 50 18 100 
5 
Imperative 15 28 57 100 
Assertive 48 45 7 100 
Interrogative 12 80 8 100 
5b 
Imperative 10 40 50 100 
Assertive 50 12 38 100 
Interrogative 27 63 10 100 
6 
Imperative 15 25 60 100 
Assertive 63 13 23 100 
Interrogative 30 43 27 100 
T
on
e 
an
d 
se
nt
en
ce
 ty
pe
6b 
Imperative 22 15 63 100 
- 85 - 
Chapter 6: Conclusion and Perspectives 
 Mạc Đăng Khoa 
0
10
20
30
40
50
60
70
80
90
1 2 3 4 5 5b 6 6b
Tone of last syllable
C
o
rr
e
c
t 
re
c
o
g
n
it
io
n
 r
a
te
 (
%
)
Assertive
Interrogative
Imperative
Figure 5-5: Correct recognition rate with 8 tones of last syllable 
As can be seen, the assertive sentence ending with tone 1, 2, 3 and 4 were correctly 
identified in well about 70% to 80% of judgments. In tone 5, 5b, 6 and 6b, the 
correction percent was fair, about 50% to 60 %. With tone 5b, 45% of assertive 
sentences were confused with imperative sentence. 
The results were fairly good with the interrogative sentences ending with tone 1,2, 
4, 5b and 6 (over 60% of correction, even 80% with tone 5b). With the other tones, 
it was 40% to 50%. 
With imperative sentence ending with tone 1.4 .5 .5b, 6 and 6b, the correction 
percent is fair, from 50% to 60%. With tone 2 and tone 3, 42% and 52% of 
imperative sentences were confused with assertive sentence.. 
In summary, we have the average correct recognition rate as Table 5.9 and Figure 
5-6 
- 86 - 
Chapter 6: Conclusion and Perspectives 
 Mạc Đăng Khoa 
Table 5.9: Correct recognition rate (%) with other types of sentences 
 Rate (%) 
Sentence type 
Male voice Female Both 
Assertive 60 68 64 
Interrogative 57 63 60 
Imperative 53 48 50 
Global 56 60 58 
0
10
20
30
40
50
60
70
80
Male Female Both voice
C
o
rr
e
ti
o
n
 r
a
te
 (
%
)
Assertive
Interrogative
Imperative
Figure 5-6: Correct recognition rate (%) with other types of sentences 
The global correct recognition rate of assertive and interrogative in female voice is 
better than in male voice. However, with imperative sentence, the male voice has a 
better result. Overall, the average correct recognition rate of both voice is 
approximately 64% for assertive sentences, 60% for imperative and 50 % for 
interrogative sentences. 
- 87 - 
Chapter 6: Conclusion and Perspectives 
 Mạc Đăng Khoa 
5.3. Comparison and conclusion 
For comparison, we used another result of perception test on pseudo-sentences 1, 
which was carried out in other research of MICA center [17. The On that test, the 
listener had to choose between two types “interrogative” and “assertive” for the 
pseudo-sentence they listened. 
Table 5.10 and Figure 5-7 give the comparison between our result in two 
experiments (with patterns base synthesized sentence) and the other experiment 
with pseudo-sentences (we call Experiment M): 
Table 5.10: Result of three experiments 
Experiment 1 
(non-sense sentence) 
Experiment 2 
(multi-type sentence) 
Experiment M 
(pseudo-sentence) 
Experiment 
and voice of 
synthesis/ 
Sentence 
type 
Male Female Both Male Female Both Male Female Both 
Assertive 68 55 61 60 68 64 58 69 64 
Interrogative 49 50 50 57 63 60 61 74 68 
Imperative 36 42 39 53 48 50 - - - 
Global 51 49 50 56 60 58 60 72 66 
0
10
20
30
40
50
60
70
80
Experiment 1 Experiment 2 Experiment M
C
o
rr
e
ti
o
n
 r
a
te
 (
%
)
Assertive
Interrogative
Imperrative
Figure 5-7: Result comparison of three experiments 
1 The re-synthesized sentence without semantic information, which is resynthesized from vowel /a/ and 
simulated the prosody of an actual sentence. [17] 
- 88 - 
Chapter 6: Conclusion and Perspectives 
 Mạc Đăng Khoa 
Overall, the result in this experiment 2 is better than in its previous, which used 
non-sense sentences (50% of correction). In all types of sentence, the correct 
recognition rate of experiment 2 is approximately 10% higher than in experiment 1. 
It can be explain that, in the first, we use the same tone for all syllables in the 
sentence but we did not simulate the co-articulation phenomena. Therefore, the 
synthesized sentences were very different from nature. With test sentences in 
experiment 2, except the final syllable, the others were in tone 1 (or non-tone). The 
co-articulation phenomenon is minimized, and the sentences are more natural. 
Compare with the experiment M, the result of experiment 2 is better in assertive 
sentences but worse in interrogative sentences. The test sentences in Experiment M 
were synthesized by simulating the prosody of actual sentences. That is why they 
could be more natural than the test sentences in Experiment 2, which used our 
proposal prosody patterns. Therefore, the worse result in Experiment 2 can be 
understandable and acceptable. 
In this chapter, we have presented some experiments to evaluate our prosody 
pattern. These patterns are currently very simple, just base on the position of 
syllable in sentence and type of container sentence. However, the results of 
experiments show that our proposal pattern can be apply to predict the prosody of 
simple sentences. 
- 89 - 
Chapter 6: Conclusion and Perspectives 
 Mạc Đăng Khoa 
6 
Conclusion and Perspectives 
The thesis subject is “Modeling the prosody of Vietnamese language for speech 
synthesis”. But as all we know, finding a completed model to modeling Vietnamese 
prosody currently is a large and complex field, which requires many researches on 
linguistic, acoustic and on speech processing also. Therefore, in scope of master 
thesis, we have studied on some basic factors of Vietnamese prosody, characterized 
and tried to apply them to speech synthesis. 
In chapter 3, we have proposed a method and structure of prosody generation 
module. The simplest way to generate the prosody of whole sentence is 
concatenation the direct prosody patterns of syllables. Hence, in chapter 4, we set 
up a prosody corpus, analyzed and proposed 72 prosody patterns syllables, 
corresponding to the initial, middle and final positions of syllable in three type of 
sentence (assertive, interrogative and imperative). 
Although these patterns are not enough to model all case of Vietnamese prosody, 
but they are able to apply to generate the prosody of simple sentences. That was 
proved by the results of experiments in chapter 5. In the perception tests, the listener 
could correctly determine 64% of assertive sentences, 60% of interrogative 
sentences and 50% of imperative sentences. 
These patterns are our first prosody patterns for Vietnamese syllables. They were 
extracted from a small prosody corpus and just concern to three factors: tone, 
position of syllable and the sentence type. In the future work, we expect to improve 
- 90 - 
Chapter 6: Conclusion and Perspectives 
 Mạc Đăng Khoa 
these patterns by set up and analyze a larger corpus. Additionally, we also research 
others factor such as syntactic, lexical meaning factors or glottalization and co-
articulation phenomena to integrate into our patterns. Moreover, these prosody 
patterns can be represented not only in absolute values of F0, duration and intensity 
but also in a set of parameters of other prosody model (such as Fujisaki model). By 
that way, we are be able to generate the prosody description in other types of 
prosodic model and apply to other type of TTS system. 
The following is a summary on the works we have done in this master thesis, the 
limitations and future approaches: 
• The works we have done: 
 Proposed a simple method a prosody generation and a structure of 
prosody generation module in TTS system 
 Set up a corpus for researching prosody 
 Proposed 72 prosody patterns for Vietnamese syllable 
 Apply proposal prosody patterns to synthesize some simple 
sentences and evaluate these sentence by perception tests 
• The limited points 
 The corpus is small 
 The proposal patterns for syllables are simple, just concern to three 
factors: tone, position of syllable and the sentence type 
 Intensity controlling in speech synthesis was done manually and 
not very accurate 
 The listener in perception test was few (5 males and 5 females) 
• Future approach 
 Set up a larger corpus, which concern to other factors and 
phenomena in Vietnamese prosody 
 Define prosody patterns from that corpus 
 Research other prosodic model and apply to transfer these patterns 
to other model parameters. 
- 91 - 
Chapter 6: Conclusion and Perspectives 
 Mạc Đăng Khoa 
 Develop a complete prosody generation module and integrate in a 
TTS system. 
The last words 
This work is carried out in MICA center and I expected its result could be applied 
into the MICA speech synthesis system. 
This work is also my preliminary work in the domain of speech processing. With 
the guiding of my supervisors, the instruction and support of others in MICA’s 
speech processing group, my knowledge and skill of speech processing have been 
more and more improved. That knowledge is the necessary background for my 
future studies and researches. Once again, thank all of you very much! 
- 92 - 
Master thesis 
 Mạc Đăng Khoa 
References 
[1]. Chilin Shih , Greg Kochanski, “Prosody and Prosodic Models”, 
www.prosodies.org. 
[2]. Do T.D., Tran T.H., et al. (1998), “Intonation system - A survey of twenty 
languages”, chap. 22, Cambridge University Press. 
[3]. Dung Tien Nguyen, Hansjưrg Mixdorff, Mai Chi Luong, Huy Hoang Ngo, 
Bang Kim Vu, (2005) “Fujisaki Model based F0 contours in Vietnamese 
TTS”, Eurospeech proceeding 
[4]. H. Fujisaki, S. Ohno, C. Wang (1974), “A command-response model for F0 
contour generation in multilingual speech synthesis”, Journal of Phonetics, 
vol. 2, pp 223-232, 
[5]. Mixdorff H. (1998), “Intonation patterns of German - Model-based 
quantitative analysis and synthesis of F0 contours”, PhD thesis, TU 
Dresden 
[6]. Mixdorff H. (2001), “An Integrated Approach to Modeling German 
Prosody”, TU Dresden 
[7]. Mixdorff H., Nguyen Hung Bach, Hiroya Fujisaki and Mai Chi Luong 
(2003), “Quantitative Analysis and Synthesis of Syllabic Tones in 
Vietnamese”, Eurospeech proceeding. 
[8]. Nguyen T.T.H. and Boulakia G. (1999), "Another look at Vietnamese 
intonation", ICPhS'99 
[9]. Ninh Khanh Duy (2005), “Characterization of Vietnamese intonation for 
questions”, Master Thesis, Hanoi University of Technology, 
- 93 - 
Master thesis 
 Mạc Đăng Khoa 
[10]. Paul Alexander Taylor (1992), “A Phonetic Model of English Intonation”, 
PhD thesis, University of Edinburgh, 
[11]. Sami Lemmetty (1999), “Review of Speech Synthesis Technology”, MSc 
thesis, Faculte Helsinki University of Technology 
[12]. Thierry Dutoit (1993), “High Quality Text-To-Speech Synthesis of the 
French Language”, PhD thesis, Faculte Polytechnique de Mons, TCTS Lab, 
Belgium 
[13]. Thierry Dutoit (1997), “An Introduction to Text-to-Speech Synthesis”, 
Kluwer Academic Publishers. 
[14]. Tran D.D., Castelli E., et al. (2005), "Influence of F0 on Vietnamese 
syllable perception", Interspeech. 
[15]. Tran Do Dat (2003), “Building a large Vietnamese Speech Database”, 
Master Thesis, Hanoi University of Technology 
[16]. Vu M.Q., Tran D.D., Castelli E. (2006), “Prosody of Interrogative and 
Affirmative Sentences in Vietnamese Language: Analysis and Perceptive 
Results” 
[17]. Vu M.Q., Tran D.D. & Castelli E. (2006), "Intonation des phrases 
interrogatives et affirmatives en langue vietnamienne", JEP2006, XXVIes 
Journées d’Etude sur la Parole. Manoir de la Vicomté - Dinard, France 
[18]. Nguyen Quoc Cuong (2002), “Reconnaissangce de la parole en langue 
Vietnamienne”, These, INPG, UJF Grenoble, France, Juin 2002 
[19]. Bạch Hưng Nguyên, Nguyễn Tiến Dũng,(2005), "Mơ hình Fujisaki và áp 
dụng trong phân tích thanh điệu tiếng Việt" 
[20]. Mai Ngọc Chừ, Vũ Đức Nghiệu, Hồng Trọng Phiến (2005) “Cơ sở ngơn 
ngữ học và tiếng Việt”, NXB Giáo dục. 
- 94 - 
Master thesis 
 Mạc Đăng Khoa 
[21]. Nguyễn Hữu Quỳnh (2001). “Ngữ Pháp Tiếng Việt”, Nhà xuất bản từ điển 
Bách Khoa, pp.11-86, Hà Nội 
- 95 - 
Master thesis 
 Mạc Đăng Khoa 
Appendix 
A. Text for prosody corpus 
Code Role Sentences 
1_As_In_A A Bên ta theo địch đến tận căn cứ 
1_As_Mid_A A Bên địch bị bên ta theo đến tận căn cứ 
1_As_Fi_A A Bên địch bị đánh bật khỏi căn cứ bên ta 
 Context Kết thúc trận đánh, thủ trưởng hỏi: 
1_Int_In_A A Này cậu. Bên ta theo địch đến tận căn cứ à? 
1_As_In_B B Vâng. Bên ta theo bên địch đến tận căn cứ. 
1_Int_Mid_A A Này cậu. Bên địch bị bên ta theo đến tận căn cứ à? 
1_As_Mid_B B Đúng ạ. Bên địch bị bên ta theo đến tận căn cứ. 
1_Int_Fi_A A Này cậu. Bên địch đã bị đánh bật khỏi căn cứ bên ta? 
1_As_Fi_B B Đúng ạ. Bên địch bị đánh bật khỏi căn cứ bên ta. 
 Context Trong trận đánh, cấp dưới hỏi thủ trưởng 
1_Int_In_B B Bên ta theo anh cĩ cần bám theo địch khơng? 
1_Imp_In_A A Bên ta theo địch ngay cho tơi! 
1_Int_Mid_B B Địch rút thì bên ta theo cĩ được khơng anh? 
1_Imp_Mid_A A À. Thế thì bên ta theo ngay cho tơi! 
 Context Trong trận đánh, thủ trưởng ra lệnh: 
1_Imp Fi_A A Gọi ngay quân cứu viện bên ta! Nhanh lên ! 
Code Role Sentence 
2_As_In_A A Trên tà thêu một bơng hoa màu đỏ. 
2_As_Mid_A A Áo của chị trên tà thêu một bơng hoa. 
2_As_Fi_A A Áo của chị cĩ thêu một bơng hoa trên tà. 
 Context Một chị đi đặt may áo dài. Thợ may hỏi: 
2_Int_In_A A Trên tà thêu gì khơng hả chị? 
2_As_In_B B À... Trên tà thêu một bơng hoa màu đỏ. 
2_Int_Mid_A A Áo chị trên tà thêu gì khơng thế? 
2_As_Mid_B B À... Cái áo đấy trên tà thêu hoa màu đỏ. 
- 96 - 
Master thesis 
 Mạc Đăng Khoa 
2_Int_Fi_A A Áo chị thêu gì trên tà ? 
2_As_Fi_B B À. Cái áo đấy thêu một bơng hoa đỏ trên tà. 
2_Int_In_B B Trên tà thêu gì khơng hả chị? 
2_Imp_In_A A Giống cái kia kìa.Trên tà thêu y như thế cho chị! 
2_Int_Mid_B B Áo chị đặt trên tà thêu gì khơng chị? 
2_Imp_Mid_A A Giống cái lần trước ý. Cái áo này trên tà thêu cũng thế 
nhé em! 
2_Int_Fi_B B Áo chị đặt thêu gì trên tà? 
2_Imp Fi_A A Giống lần trước. Cứ thêu cho chị một bơng hoa trên tà! 
Code Role Sentence 
3_As_In_A A Trên tã thêu hình một quả bĩng. 
3_As_Mid_A A Chị nhìn thấy trên tã thêu hình quả bĩng. 
3_As_Fi_A A Chị nhìn thấy hình một quả bĩng được thêu trên tã. 
 Context Chồng và vợ nĩi về cái tã mới của em bé: 
3_Int_In_A A Trên tã thêu hình gì thế em ? 
3_As_In_B B À. Trên tã thêu hình một quả bĩng anh ạ. 
3_Int_Mid_A A Em cĩ thấy trên tã thêu hình gì khơng ? 
3_As_Mid_B B À. Em thấy trên tã thêu hình quả bĩng anh ạ. 
3_Int_Fi_A A Cĩ cả hình quả bĩng trên tã ? 
3_As_Fi_B B Vâng. Em thấy cĩ hình một quả bĩng trên tã. 
 Context Vợ đang thêu tã cho con, hỏi chồng 
3_Int_In_B B Trên tã thêu hình gì đây hả anh? 
3_Imp_In_A A Con nĩ thích bĩng. Trên tã thêu hình quả bĩng đi! 
3_Int_Mid_B B Khơng biết trên tã thêu cái gì hả anh? 
3_Imp_Mid_A A Con cĩ vẻ thích bĩng.Theo anh trên tã thêu bĩng đi 
em! 
3_Int_Fi_B B Anh ơi! Thế mình thêu gì trên tã? 
3_Imp Fi_A A Con thích bĩng.Thế thì em thêu cho con một quả bĩng 
trên tã! 
Code Role Sentence 
4_As_In_A A Bên tả theo khuynh hướng bảo thủ. 
4_As_Mid_A A Đảng chính trị bên tả theo khuynh hướng bảo thủ. 
- 97 - 
Master thesis 
 Mạc Đăng Khoa 
4_As_Fi_A A Bảo thủ là khuynh hướng của đảng chính trị bên tả. 
 Context Hai người bàn về chính trị 
4_Int_In_A A Bên tả theo khuynh hướng nào cậu nhỉ? 
4_As_In_B B À. Bên tả theo khuynh hướng bảo thủ. 
4_Int_Mid_A A Đảng chính trị bên tả theo khuynh hướng nào thế? 
4_As_Mid_B B À. Đảng chính trị bên tả theo khuynh hướng bảo thủ. 
4_Int_Fi_A A Anh ủng hộ đảng chính trị bên tả? 
4_As_Fi_B B À vâng. Tơi ủng hộ đảng bên tả. 
 Context Trong buổi tổng duyệt điễu hành. Người chỉ huy hơ: 
4_Imp_In_A A Bên tả theo ngay sau đội trống! Nhanh lên! 
4_Imp_Mid_A A Chú ý. Tồn đội bên tả theo ngay sau đội trống! 
4_Imp_Fi_A A Tất cả theo sau đội bên tả ! Nhanh lên. 
Code Role Sentence 
5_As_In_A A Đợt phong cấp lần này, lên tá theo anh cũng chẳng khĩ 
khăn gì. 
5_As_Mid_A A Anh ta được thăng lên tá theo cách nào khơng ai biết. 
5_As_Fi_A A Cuối cùng thì anh cũng được thăng lên tá. 
 Context Trong cuộc họp bàn về phong cấp trong một đơn vị 
quân đội: 
5_Int_In_A A Đợt phong cấp lần này, lên tá theo anh cĩ khĩ lắm 
khơng? 
5_As_In_B B À. Lên tá theo tơi cũng chẳng khĩ khăn gì. 
 Context Thủ trưởng hỏi về một trường hợp vừa được phong lên 
cấp tá 
5_Int_Mid_A A Anh ta lên tá theo quyết định nào thế? 
5_As_Mid_B B À. Anh ta được lên tá theo cách nào chẳng ai biết. 
5_Int_Fi_A A Sao anh ta lại được lên tá? 
5_As_Fi_B B Thưa anh, cũng khơng ai biết sao anh ta lại được lên tá. 
 Context Cấp dưới hỏi thủ trưởng 
5_Int_In_B B Cịn đồng chí Nam, lên tá theo anh cĩ nên khơng? 
5_Imp_In_A A Nên chứ. Lên tá theo mọi người luơn đi ! 
5_Int_Mid_B B Đồng chí Nam mà phong lên tá theo mọi người cĩ 
được khơng? 
5_Imp_Mid_A A Được. Phong lên tá theo họ luơn đi! 
- 98 - 
Master thesis 
 Mạc Đăng Khoa 
5_Int_Fi_B B Anh khơng ủng hộ việc đồng chí đĩ sẽ được phong lên 
tá ? 
5_Imp_Fi_A A Đúng. Các anh đừng cĩ phong anh ta lên tá ! 
Code Role Sentence 
6_As_In_A A Lên tạ theo đúng hướng dẫn là cách tập luyện rất tốt. 
6_As_Mid_A A Cách tốt nhất là tập lên tạ theo đúng hướng dẫn. 
6_As_Fi_A A Tất cả các vận động viên đều phải tập lên tạ. 
 Context Vận động viên hỏi huấn luyện viên về tập lên tạ: 
6_Int_In_A A Lên tạ theo theo anh cĩ tốt khơng ? 
6_As_In_B B Cĩ chứ. Lên tạ theo đúng hướng dẫn là cách tập luyện 
rất tốt. 
6_Int_Mid_A A Theo anh thì tập lên tạ theo hướng dẫn cĩ tốt khơng ? 
6_As_Mid_B B Cĩ chứ. Cách tốt nhất là tập lên tạ theo đúng hướng 
dẫn. 
6_Int_Fi_A A Liệu em cĩ phải tập lên tạ ? 
6_As_Fi_B B Cĩ. Mọi vận động viên đều phải tập lên tạ. 
6_Int_In_B B Lên tạ theo cách nào hả anh ? 
6_Imp_In_A A À. Lên tạ theo đúng hướng dẫn cho tơi! 
6_Int_Mid_B B Thế em cĩ phải tập lên tạ bây giờ khơng? 
6_Imp_Mid_A A Cĩ. Cậu tập lên tạ theo tơi ngay! 
6_Int_Fi_B B Thế em phải tập cả lên tạ? 
6_Imp_Fi_A A Chứ cịn gì nữa. Cứ theo hướng dẫn lên tạ! 
Code Role Sentence 
5b_As_In_A A Trong cuộc thi của nhà nơng, bên tát theo cách mới đã 
giành thắng lợi. 
5b_As_Mid_A A Chiến thắng thuộc về bên tát theo cách mới. 
5b_As_Fi_A A Cuối cùng, phần thắng đã thuộc về bên tát. 
 Context Trong một cuộc thi của nhà nơng. Hai khán giả hỏi 
nhau: 
5b_Int_In_A A Bên tát theo cách mới là bên nào thế anh? 
5b_As_In_B B À. Bên tát theo cách mới là bên mặc áo đỏ. 
- 99 - 
Master thesis 
 Mạc Đăng Khoa 
5b_Int_Mid_A A Bên nào là bên tát theo cách mới hả anh? 
5b_As_Mid_B B À. Bên đỏ là bên tát theo cách mới. 
5b_Int_Fi_A A Trong lần thi này, liệu phần thắng cĩ thuộc về bên tát? 
5b_As_Fi_B B Cĩ. Tơi nghĩ phần thắng sẽ thuộc về bên tát. 
 Context Hai anh em đang ở dưới ruộng, nĩi chuyện với nhau: 
5b_Int_In_B B Ơ bố mẹ đang tát nước kìa. Lên tát theo bố mẹ khơng 
anh? 
5b_Imp_In_A A Cĩ. Lên tát theo bố mẹ đi ! 
5b_Int_Mid_B B Ơ, bố mẹ đang tát nước trên kia kìa. Mình cĩ lên tát 
theo bố mẹ khơng? 
5b_Imp_Mid_A A Cĩ, dưới này cũng làm xong rồi. Mình lên tát theo bố 
mẹ đi ! 
5b_Int_Fi_B B Bố mẹ tát nước trên kia kìa. Hay là anh em mình cũng 
lên tát? 
5b_Imp_Fi_A A Ừ, dưới này cũng xong rồi. Nào anh em mình cùng lên 
tát ! 
Code Role Sentence 
6b_As_In_A A Trong cuộc thi điêu khắc, bên tạc theo cách mới đã 
hồn thành bức tượng sớm hơn. 
6b_As_Mid_A A Chiến thắng thuộc về bên tạc theo cách mới. 
6b_As_Fi_A A Cuộc thi làm tượng nhanh diễn ra giữa bên đúc và bên 
tạc. 
 Context Trong một cuộc thi điêu khắc, hai khán giả trị chuyện 
với nhau: 
6b_Int_In_A A Bên tạc theo cách mới là bên nào thế anh? 
6b_As_In_B B À. Bên tạc theo cách mới là bên áo đỏ. 
6b_Int_Mid_A A Bên nào là bên tạc theo cách mới thế? 
6b_As_Mid_B B À. Bên áo đỏ là bên tạc theo cách mới. 
6b_Int_Fi_A A Theo anh, liệu phần thắng cĩ thuộc về bên tạc? 
6b_As_Fi_B B Cĩ. Tơi nghĩ phần thắng sẽ thuộc về bên tạc. 
 Context Trong một xưởng điêu khắc, thợ tạc hỏi người thợ cả: 
6b_Int_In_B B Cĩ hai mẫu mới và cũ. Tạc theo mẫu nào hả anh ? 
6b_Imp_In_A A À. Tạc theo mẫu mới cho tơi ! 
6b_Int_Mid_B B Anh ơi! Lơ tượng bên trên tạc theo mẫu nào hả anh ? 
6b_Imp_Mid_A A À. Lơ bên trên tạc theo mẫu mới đi! 
- 100 - 
Master thesis 
 Mạc Đăng Khoa 
6b_Int_Fi_B B Vụ tạc tượng phật trên núi đã đủ thợ chưa hả anh? Liệu 
em cĩ phải lên tạc? 
6b_Imp_Fi_A A Cĩ. Cả cậu cũng phải lên tạc! 
B: Datasheet of prosody patterns 
Initial part Middle part Final part 
F0 F0 Intensity 
(Hz) (Semitone) (dB) 
F0 F0 Intensity 
(Hz) (Semitone) (dB) 
F0 F0 Intensity 
(Hz) (Semitone) (dB) 
M
al
e 
162.65 8.23 70.99 
163.03 8.27 71.53 
162.96 8.25 71.82 
162.76 8.23 71.92 
162.37 8.19 71.87 
162.07 8.15 71.70 
161.80 8.13 71.48 
161.84 8.14 71.25 
161.78 8.13 71.04 
161.42 8.09 70.84 
161.55 8.11 70.62 
161.49 8.10 70.33 
161.27 8.08 69.94 
161.07 8.06 69.40 
160.69 8.03 68.65 
160.42 8.00 67.65 
160.02 7.96 66.34 
159.48 7.91 64.73 
158.63 7.82 62.83 
157.04 7.65 60.75 
Duration(ms) 
Syllable 149 
Voiced part 99 
180.21 9.54 69.90 
180.42 9.52 70.80 
178.30 9.30 71.26 
177.88 9.27 71.43 
177.40 9.23 71.35 
177.01 9.20 71.07 
176.65 9.16 70.70 
176.48 9.15 70.34 
176.58 9.15 70.04 
176.85 9.16 69.77 
177.00 9.16 69.46 
176.94 9.14 69.05 
176.84 9.11 68.53 
176.89 9.10 67.91 
176.78 9.09 67.15 
176.53 9.07 66.18 
174.97 8.97 65.04 
172.82 8.81 63.64 
171.33 8.68 61.75 
169.78 8.54 59.26 
Duration(ms) 
Syllable 177 
Voiced part 121 
153.85 7.28 71.28 
153.42 7.23 71.94 
153.01 7.17 72.14 
152.51 7.11 72.12 
152.42 7.09 71.99 
152.38 7.08 71.84 
152.45 7.09 71.66 
152.22 7.06 71.37 
151.97 7.03 71.01 
151.27 6.96 70.61 
150.76 6.90 70.14 
150.17 6.84 69.45 
149.58 6.77 68.36 
149.10 6.73 66.81 
149.37 6.80 65.00 
147.78 6.62 63.27 
146.89 6.52 61.54 
145.75 6.40 59.80 
146.23 6.45 57.75 
146.34 6.47 55.54 
Duration(ms) 
Syllable 320 
Voiced part 210 
A
ss
er
tiv
e 
se
nt
en
ce
Fe
m
al
e 
295.52 18.65 69.11 
292.33 18.47 69.54 
289.19 18.28 69.77 
287.26 18.16 69.89 
285.73 18.07 69.93 
284.65 18.01 69.92 
283.93 17.97 69.85 
283.23 17.92 69.72 
282.77 17.90 69.51 
282.00 17.85 69.22 
281.85 17.84 68.85 
282.13 17.85 68.48 
282.72 17.89 68.13 
283.01 17.90 67.75 
283.29 17.92 67.25 
283.33 17.92 66.54 
283.32 17.92 65.53 
282.34 17.85 64.11 
281.63 17.81 62.21 
278.94 17.65 59.87 
Duration Syllable 168 
 F0 100 
264.90 16.55 70.87 
261.16 16.32 71.71 
258.42 16.14 72.10 
257.24 16.07 72.21 
256.46 16.02 72.14 
255.35 15.95 71.98 
254.52 15.89 71.78 
253.64 15.83 71.53 
253.43 15.82 71.24 
253.37 15.82 70.96 
253.47 15.82 70.71 
254.04 15.86 70.43 
254.42 15.89 70.02 
254.88 15.92 69.41 
255.35 15.94 68.53 
255.78 15.97 67.29 
255.10 15.90 65.62 
254.54 15.85 63.47 
252.85 15.74 60.98 
249.09 15.47 58.81 
Duration Syllable 196 
 F0 126 
240.01 14.65 70.26 
236.88 14.42 72.06 
235.57 14.32 72.57 
235.05 14.28 72.53 
235.11 14.28 72.32 
235.47 14.31 72.10 
235.45 14.32 71.83 
235.21 14.30 71.37 
234.78 14.27 70.83 
233.65 14.20 70.27 
232.86 14.13 69.62 
232.18 14.09 68.81 
231.20 14.02 67.77 
229.98 13.93 66.65 
229.68 13.91 65.43 
229.24 13.88 64.07 
228.43 13.82 62.26 
225.83 13.63 60.17 
224.13 13.51 57.80 
225.79 13.63 55.54 
Duration Syllable 361 
 F0 229 
- 101 - 
Master thesis 
 Mạc Đăng Khoa 
M
al
e 
181.86 10.18 73.56 
180.69 10.06 74.16 
179.54 9.93 74.50 
179.11 9.89 74.69 
178.00 9.78 74.75 
177.18 9.69 74.69 
176.87 9.65 74.51 
176.28 9.59 74.25 
175.96 9.55 73.92 
175.81 9.53 73.55 
175.40 9.48 73.12 
175.06 9.45 72.63 
174.67 9.41 72.05 
174.38 9.38 71.34 
174.17 9.36 70.47 
173.86 9.33 69.36 
173.25 9.26 67.95 
171.56 9.10 66.20 
170.71 9.00 64.18 
169.21 8.85 62.07 
Duration(ms) 
Syllable 132 
Voiced part 86 
172.43 9.23 72.50 
172.60 9.26 73.09 
171.89 9.16 73.25 
170.61 9.00 73.25 
170.23 8.96 73.17 
169.71 8.91 73.00 
169.35 8.87 72.76 
169.06 8.85 72.52 
168.96 8.84 72.29 
168.82 8.82 72.07 
168.66 8.80 71.82 
168.54 8.79 71.52 
168.47 8.78 71.08 
168.31 8.77 70.46 
167.87 8.72 69.66 
167.13 8.66 68.66 
166.43 8.59 67.33 
166.45 8.59 65.51 
165.36 8.47 63.02 
163.67 8.29 59.91 
Duration(ms) 
Syllable 185 
Voiced part 127 
169.79 8.85 71.48 
168.78 8.78 72.34 
167.25 8.63 72.38 
166.22 8.53 71.95 
165.54 8.47 71.32 
165.42 8.46 70.92 
165.44 8.47 70.64 
165.22 8.45 70.34 
165.00 8.43 70.14 
164.76 8.41 69.99 
164.93 8.42 69.74 
165.48 8.47 69.22 
165.51 8.47 68.35 
166.50 8.56 67.23 
166.36 8.53 66.13 
166.20 8.51 65.13 
166.44 8.52 64.03 
164.40 8.25 62.55 
164.01 8.20 60.67 
164.35 8.24 58.35 
Duration(ms) 
Syllable 281 
Voiced part 187 
In
te
rr
og
at
iv
e 
se
nt
en
ce
Fe
m
al
e 
312.38 19.63 72.34 
307.61 19.39 73.23 
305.55 19.27 73.80 
302.99 19.14 74.14 
301.12 19.03 74.34 
299.87 18.96 74.42 
298.81 18.90 74.44 
298.09 18.86 74.36 
297.63 18.83 74.20 
297.28 18.81 73.94 
297.09 18.80 73.57 
297.07 18.80 73.10 
297.01 18.80 72.51 
297.24 18.81 71.80 
297.68 18.84 70.93 
298.50 18.88 69.83 
298.50 18.88 68.42 
297.57 18.83 66.62 
296.96 18.79 64.38 
294.52 18.66 61.75 
Duration Syllable 145 
 F0 87 
298.25 18.82 70.57 
294.71 18.63 71.83 
292.32 18.49 72.73 
290.69 18.40 73.36 
289.01 18.29 73.78 
287.49 18.21 74.01 
286.66 18.16 74.09 
285.86 18.11 74.06 
285.43 18.09 73.93 
285.56 18.09 73.72 
285.83 18.11 73.42 
286.29 18.14 73.04 
286.90 18.18 72.59 
287.53 18.22 72.07 
288.06 18.25 71.40 
288.71 18.29 70.46 
289.32 18.33 69.14 
289.54 18.34 67.35 
288.86 18.30 65.07 
286.81 18.18 62.32 
Duration Syllable 168 
 F0 106 
291.53 18.42 70.55 
286.30 18.12 72.38 
283.53 17.95 73.19 
281.86 17.86 73.44 
281.04 17.81 73.45 
280.33 17.77 73.27 
280.58 17.79 73.00 
280.23 17.77 72.61 
280.17 17.76 72.13 
280.29 17.77 71.63 
281.01 17.80 71.08 
282.34 17.88 70.35 
283.69 17.96 69.49 
285.85 18.08 68.61 
288.17 18.21 67.63 
288.25 18.22 66.31 
291.93 18.42 64.50 
291.64 18.40 62.22 
292.02 18.41 59.60 
290.68 18.33 57.14 
Duration Syllable 281 
 F0 199 
- 102 - 
Master thesis 
 Mạc Đăng Khoa 
M
al
e 
178.07 9.72 74.22 
177.27 9.64 74.47 
176.55 9.58 74.61 
175.76 9.50 74.67 
174.98 9.42 74.69 
174.36 9.35 74.69 
174.08 9.33 74.65 
173.94 9.32 74.55 
173.88 9.32 74.38 
173.76 9.31 74.12 
173.48 9.28 73.77 
173.27 9.26 73.36 
173.16 9.24 72.90 
173.06 9.23 72.36 
172.97 9.23 71.72 
172.81 9.21 70.97 
172.68 9.20 70.08 
172.54 9.18 68.98 
172.31 9.16 67.64 
171.42 9.07 66.00 
Duration(ms) 
Syllable 129 
Voiced part 75 
166.28 8.52 72.52 
165.50 8.43 73.52 
165.40 8.41 74.14 
164.87 8.35 74.47 
164.13 8.26 74.65 
162.93 8.11 74.74 
161.65 7.95 74.76 
161.41 7.94 74.68 
162.59 8.14 74.53 
162.85 8.20 74.28 
162.68 8.18 73.97 
162.43 8.15 73.59 
162.09 8.11 73.15 
162.02 8.10 72.63 
161.85 8.09 71.97 
161.67 8.07 71.11 
161.44 8.04 69.96 
161.17 8.01 68.41 
160.71 7.96 66.46 
159.67 7.86 64.15 
Duration(ms) 
Syllable 142 
Voiced part 96 
178.57 9.71 75.21 
178.89 9.77 77.01 
178.16 9.70 77.78 
177.91 9.67 77.91 
177.80 9.66 77.65 
177.57 9.63 77.28 
177.27 9.60 76.86 
177.12 9.59 76.38 
177.00 9.59 75.80 
176.24 9.51 75.05 
174.95 9.39 74.21 
173.77 9.28 73.36 
172.74 9.19 72.41 
171.43 9.07 71.22 
169.59 8.89 69.79 
167.57 8.68 68.28 
166.24 8.55 66.72 
165.10 8.43 65.00 
162.74 8.15 62.88 
162.07 8.04 60.61 
Duration(ms) 
Syllable 291 
Voiced part 195 
Im
pe
ra
tiv
e 
se
nt
en
ce
Fe
m
al
e 
269.43 16.50 71.84 
266.25 16.34 72.66 
263.43 16.18 73.17 
261.32 16.05 73.44 
259.70 15.95 73.54 
258.46 15.87 73.50 
257.49 15.81 73.37 
256.97 15.78 73.16 
256.61 15.75 72.89 
256.85 15.77 72.54 
257.22 15.79 72.11 
257.56 15.81 71.62 
258.24 15.85 71.05 
258.67 15.88 70.38 
259.16 15.90 69.54 
259.81 15.95 68.46 
259.58 15.93 67.05 
259.13 15.90 65.35 
254.95 15.66 63.37 
250.14 15.36 61.20 
Duration Syllable 0.154 
 F0 0.092 
293.55 18.56 71.79 
289.61 18.34 72.91 
287.29 18.21 73.65 
285.88 18.12 74.12 
284.37 18.03 74.35 
283.76 17.99 74.36 
282.70 17.93 74.22 
281.91 17.88 74.01 
281.38 17.84 73.75 
281.21 17.83 73.46 
281.18 17.83 73.15 
281.21 17.82 72.81 
280.92 17.80 72.39 
281.13 17.81 71.83 
281.27 17.82 71.10 
281.49 17.83 70.16 
281.07 17.80 68.92 
281.60 17.83 67.31 
280.33 17.75 65.32 
280.52 17.76 63.02 
Duration Syllable 0.182 
 F0 0.110 
311.46 19.61 72.53 
306.69 19.35 74.05 
304.61 19.23 74.48 
303.23 19.16 74.62 
302.40 19.11 74.50 
302.06 19.09 74.19 
301.84 19.08 73.80 
301.71 19.08 73.49 
301.29 19.05 73.20 
300.63 19.01 72.74 
299.39 18.94 72.22 
298.05 18.86 71.69 
297.06 18.81 71.19 
295.21 18.70 70.59 
295.00 18.69 69.79 
293.48 18.60 68.61 
294.19 18.65 66.97 
292.11 18.52 64.72 
285.62 18.12 61.72 
282.03 17.91 58.59 
Duration Syllable 0.317 
 F0 0.205 
._.
            Các file đính kèm theo tài liệu này:
 LA3232.pdf LA3232.pdf