How exactly does human speech transmit multiple layers of communicative meanings through an articulation process? This is the central concern of my research. To address this issue, some fundamental questions need to be answered: What are the kinds of meanings transmitted by speech? What are the encoding mechanisms? What are the decoding mechanisms? Since it is impossible to answer these questions all at once, a realistic strategy is to divide and conquer. That is, to always prioritize the kind of questions for which other things are realtively established.
My research priority has been based on the following understanding of the state of the art in speech science:
- With regard to encoding and decoding mechanisms, the static aspects of speech sounds, whether in terms of acoustic patterns or articulatory correlates, are relatively well established. What remains unclear, even to this day, is the basic dynamic mechanisms of speech production and their processing in perception.
- With regard to meanings, lexical meanings are the most easily established; all other meanings are up for grabs.
My early work therefore focused on Mandarin tones in continuous speech. The functional meaning of tone is clear: to distinguish morphemes that are otherwise identical in terms of CV structure. The canonical forms of Mandarin tones had also been previously well established. What my work further established is the basic patterns of contextual tonal variation (Xu 1993, 1994, 1997, 1998, 2001a). This has led to the Target Approximation (TA) model of tone production (Xu & Wang, 2001). The TA model was then applied to intonation of both Mandarin and English (Xu, 1999; Xu & Xu, 2005). The success of these applications led to further expansion of the approach in a number of new directions.
- The Parallel Encoding and Target Approximation (PENTA) model -- A generalized intonation model with TA at its core and a conceptual scheme that allows parallel encoding of multiple layers of communicative meanings (Xu, 2005).
- The qTA model -- A computational realization of PENTA (Prom-on, Xu & Thipakorn, 2009), now implemented as a Praat script: PENTAtrainer.
- The Time Structure model of the syllable -- A new conceptualization of the syllable as the basic temoral organization structure that assigns time intervals to both segmental and laryngeal units. It offers a drastically different view on issues such as the nature of coarticulation, coarticulation resistence, locus equation, time interval of segments and temporal alignment of segemental and tonal events (Xu & Liu, 2006). It also makes it possible to fully integrate segmental and suprasegmental aspects of speech, treating them as following the same basic articulatory dynamics (Xu, 2007a, Xu & Liu, in press).
- The perceptual learnability of the dynamic output of TA, as demonstrated by unsupervised-learning simulations of tone acquisition using self-organizing maps (SOMs) (Gauthier, Shi & Xu, 2007a, 2007b). These findings also suggests that perceptual learning does not need to involve distinctive features.
- The Near Ceiling Performance (NCP) Hypothesis -- Speech is produced near an overall performance ceiling in terms of articulatory effort (Xu & Sun, 2002; Xu, 2007b; Xu & Wang, 2009; Cheng & Xu, 2009). This view differs from the widely accepted principle of economy of effort (Lindblom, 1990).
- The Bio-informational Dimensions (BID) theory of vocal expression of emotions (Xu, Kelly & Smillie, in press). Emotional and attitudinal meanings are vocal expressed by simultaneously manipulating a number of bio-informational dimensions -- size projection, dynamicity, audibility and association. At least the first two dimensions have been found to be highly relevant for a number of emotions (Chuenwattanapranithi et al., 2008; Xu & Kelly, 2010; Xu, Kelly & Smillie, in press).
- The Single Origin of PFC hypothesis -- The use of post-focus compression (PFC) as a prosodic marker of focus is likely to have a single historical origin, possibly the hypothetical proto-Nostratic language (Xu, Chen & Wang, in press; Xu, 2011).
- Post-focus Compression (PFC) is absent in Taiwanese, Taiwan Mandarin and Cantonese (Chen, Wang & Xu, 2009; Wu & Xu, 2010)
- Syllable organization is done through direct adjustment of syllable duration without mediation of stress or prominence (Xu & Wang, 2009)
- Segmental and tonal elements all start about 26-48 ms earlier than conventional segmentation (Xu and Liu, 2006, 2007).
- The maximum speed of pitch change is often approached in speech,
which may be the likely source of many observed F0 contour and alignment
patterns (Xu & Sun, 2002).
- Post-low bouncing -- F0 after a Low tone bounces back in the subsequent syllables, especially if they carry the neutral tone (Chen & Xu, 2006).
- The neutral tone is not toneless. Rather, it is likely to have a [mid] pitch target accompanied by weak articulatory strength (Chen & Xu, 2006).
- F0 peak delay is closely related to the interaction of tonal targets
articulatory constraints (Xu, 2001a).
- Focus in Mandarin consists of both on-focus pitch range expansion and post-focus
pitch range compression, and final focus is very similar to broad/neutral
focus (Xu, 1999).
- Tonal targets are synchronously implemented with the entire syllable
rather than with only nucleus vowel or syllable rime (Xu, 1998).
- Contextual tonal variations are robustly asymmetrical: Carryover
effects are strong and assimilatory, whereas anticipatory effects are
weak and largely dissimilatory (Xu, 1993, 1994, 1997, 1999).
- Lexical tones in Mandarin that are distorted due to articulatory constraints are still perceptually identifiable. No categorical changes therefore is likely to have taken
place (Xu, 1993, 1994).
- Perceptual compensation for contextual tonal variations is not complete. (Xu, 1993, 1994).
- Mandarin tone 3 (the Low tone) sandhi occurs during a short-term
memory task, indicating the phonetic nature of the working memory (Xu, 1991).
- In a Mandarin syllable with a final nasal, the duration of the
nucleus vowel is inversely related to vowel height: the higher the vowel,
the shorter the duration; This duration variation is compensated for by
the duration of the nasal murmur: the shorter the vowel, the longer the
nasal murmur; Thus in a syllable with a low vowel such as /bang/, there
is often hardly any nasal murmur, whereas in a syllable with a high vowel
such as /bing/, the nasal murmur can be longer than the vowel (Xu, 1986);
- Mandarin final nasals are realized as nasalization on the preceding
nucleus vowel with no nasal murmur if the following syllable begins with
a vowel or a glide (Xu, 1986);
- In a Mandarin disyllabic word or phrase, the initial consonant
position in syllable 2 is much less "consonantal" than the initial consonant
position in syllable 1: where the consonant is shorter, more likely
to become voiced if voiceless, a stop or affricate is more likely to lose
its closure and become a fricative, and a fricative is more likely to
lose its frication (Xu, 1986).