Chapter 2 Human Speech Production and Perception 2.1 Human Speech Production Speech signals are composed of a sequence of sounds. These sounds and the transition between them serve as a symbolic representation of information. The arrangement of sounds (symbols) is governed by the rules of the language. The study of the rules and classification of speech is called phonetics. The purpose of processing speech signals is to enhance and extract information, which is helpful in providing as much knowledge as possible about the signal’s structure i.e., about the way in which information is encoded in the signal.
When air flow from glottal through the vocal cords, the vibration of the vocal cords produce pitch harmonics. Rate at which the vocal folds vibrate is the frequency of the pitch. So when the vocal folds oscillate at 300 times per second ,they are said to be producing a pitch of 300 Hz .Some other features are voiced and unvoiced information ,short term energy and zero crossing
It comes so naturally to us that we don’t realize how complex a phenomenon speech is. When humans speak, air passes from the lungs through the mouth and nasal cavity, and this air stream is restricted and changed depending on the position of tongue, teeth and lips. This produces contractions and expansions of the air, an acoustic wave, a sound. The sounds so forms are usually called phonemes. The phonemes are combined together to form words [1].
These speech units are then mapped to a set of lip poses, called Visemes . Visemes are visual counterpart of phonemes, which can be interpolated to produce smooth facial animation. The shape of the mouth during speech not only depends on the phoneme currently pronounced, but also on the phonemes coming before and after. This phenomenon is called co-articulation. Co-articulation can affect up to 10 neighboring phonemes simultaneously.
Vowels are generally those that are produced with an open vocal tract and consonants are those that are produced with a constriction anywhere in the vocal tract. Vowels are the most sonorant, or intense, and the most audible of sounds in speech. Vocal fold vibration is the sound source for vowels. The vocal tract above the glottis acts as an acoustic resonator affecting the sound made by the vocal folds. The shape of this resonator determines the quality of the vowel: [i] versus [u] versus [a], for example.
It consists of two components which are articulatory control system act as internal voice and the phonological store act as inner ear - (not the physical ear canals). The phonological store that linked to speech perception holds information in spoken communication-based course for example spoken words for 1-2 minutes. Spoken words enter the store directly. Written words must first be changed into an articulator (spoken) code before they can go into the phonological store. The second one is The articulatory control process (linked to speech production) works like an inner voice rehearsing information from the phonological store.
Introduction: Speech is traditionally thought of an exclusively auditory percept. However, when the face of the speaker is visible, information contained primarily in the movement of the lips contributes powerfully to our perception of speech. This combined interaction between auditory and visual modalities improves our ability to interpret speech accurately; particularly in low signal to noise ratio (Bertelson, 2003).This multisensory integration provides a natural and important means for communication. The benefit of integrating audio visual cues has been well documented in normally hearing individuals especially in difficult listening conditions and for listeners with hearing impairment (Sumby& Pollack, 1954). The benefit derived from speech reading can be substantial allowing unintelligible speech to become comprehensive, or even exceeding the benefit derived from the use of assistive listening devices, counseling or training especially those with hearing impairment (Walden et al, 1981).
The sound of one’s voice changes as the rate of vibrations changes. As if the number of vibrations increases, the pitch increases as well meaning that the voice would sound higher. - Projection, - Speaking style. 2.3.2.2) Nonverbal Expression Nonverbal expression includes those aspects of communication, such as gestures and facial expressions, which do not involve verbal communication. Defined in The concise Corsini encyclopedia of psychology and behavioral science these techniques involve the conscious and unconscious processes of encoding and decoding.
1.7 Speech analysis One of the important characteristics of a speech waveform is the time-varying nature of the content of the speech pressure. Determination of the time-varying parameters of speech is a key area of analysis required in speech research. Another key area is classification of speech waveform segments into voiced or voiceless (mixed excitation is usually considered voiced). As mentioned previously, in the case where speech is voiced, the most important parameter is the fundamental frequency value f0. This section introduces these two areas of analysis and discusses the principles and limitation involved.
Phonetics is defined as the scientific description of speech sounds. It deals with the physiological processes involved in the production of sounds i.e., Phonetics is concerned with the ways the sounds are produced and the points at which they are articulated. Thus, Phonetics can be considered as a class of natural sciences (Daniel 2011: 1). It is the objective study of sounds of a language (Nasr, 1997: 5). Speech is produced by the movement of air via the vocal tract, which can be studied in different ways which are : "Articulatory Phonetics" that is defined as the study of the way speech sounds are produced, "Acoustic phonetics" which deals with the physical properties of speech sounds and "Auditory Phonetics" which studies the perceptual