![]() |
|
© 1997 Copyright on this material is held by the authors.
Paradoxically, the increasing popularity of GUIs for sighted users has been followed by a recent trend, incongruous to the purpose of a GUI, the development and use of "displayless" interfaces. These interfaces offer voice access to applications in which a visual display cannot be used, such as telephone-based interactions, or those in which the user's hands and eyes are busy with other tasks, such as piloting an aircraft. Although displayless interface technology introduces certain issues specific to spoken language recognition, it also presents underlying challenges similar to those of GUI access technology for users with visual impairments. Each must address the unique problems presented by nonvisual access to data, especially data which is either inherently spatial in nature, i.e. geographical maps, or textual data presented through a visuospatial display metaphor. While methods have been developed for dealing with the problem of spatial presentation of textual data [4], the problem of accessing inherently spatial data without vision remains an open issue. This dissertation research is based on the assumption that accessing spatial data strictly via speech interaction produces a cognitive burden for the individual, regardless of visual capability. A selective survey of the literature, given in [5], reviews research originating from many viewpoints, including psychology, education, and human-computer interaction which supports this assumption.
GUI access researchers argue, however, that voice input should be included as an option in providing nonvisual access to graphical and spatial data [2]. While speech alone may not provide optimal access to spatial data, used with other input and output modalities, it offers certain advantages; e.g., voice input frees the hands to be used for other tasks, such as accessing a tactile output device.
In addition, the portability of speech interfaces makes them well-suited for certain applications in which access to spatial data is required. One example, Back Seat Driver, is a navigational system developed at the MIT Media Lab [6] for taxi drivers in the Boston area. In response to voice input from the driver, it consults an inertial guidance system and map database to provide directions via synthesized speech. Research described in [7] investigated the use of similar technology for a portable navigational aid for visually impaired travelers in unfamiliar environments. The Soldier's Computer, designed to meet the needs of the modern soldier, offers another example of a portable speech interface providing rapid access to map and directional data in time-critical situations [8].
Despite their potential benefits, many human factors issues must be resolved if speech interfaces are to enjoy widespread use. These are discussed in [1,9]. One issue in particular, prosodics, is a central focus of this research. Prosodics encompasses the nonverbal aspects of spoken language, such as pauses and intonation, which are useful in both speech synthesis and recognition. Research, reviewed in [10], has examined the effects of various psychological and cognitive stresses on the prosodics of human speech, e.g. fundamental frequency (F0) (which corresponds to pitch), the length and location of pauses, and speaking rate. This dissertation research examines the possible connection between the cognitive burden produced by nonvisual access to spatial data through spoken language and the impact of this burden on the prosodics of the user's speech. An understanding of how this cognitive burden impacts the human speaker's prosodics would contribute to the development of more robust interfaces for situations in which this type of access is necessary. Insight gained from an investigation of this issue could be used, for example, to improve algorithms used in the recognition component by identifying certain prosodic patterns or variations in patterns occurring for this type application. The limitations of many of the algorithms developed for prosodic pattern detection in using only limited acoustic cues, i.e., primarily F0 features are noted in [11]. It is argued in [11] that additional acoustic cues including pauses and durational features, such as speaking rate, should be used for more robust prosodic pattern detection. This argument has particular relevance for displayless access to spatial data due to the additional cognitive load such access places on the user and its potential impact on the prosodics of the user's speech.
Since a central assumption of the research is the difficulty of nonvisual access to spatial data for all users, both those with and without sight loss will be tested. However, since the conditions in the second experiment must differ for the two groups, no formal statistical comparisons can be made between the two groups. Therefore, data will be gathered and labeled separately for each group.
After all sessions are completed, acoustic features for the analysis will be extracted from the recordings, including F0, pauses, and durational features. To determine significance of differences in the prosodic features produced from the two experiments, statistical tests will be performed comparing the means of the differences in the measurements of each of the extracted features. More details on the the experimental design and the criteria for the subjects are available in [5].
2. Boyd, L.H., W.L. Boyd, J. Berliss, M. Sutton, and G.C. Vanderheiden. The paradox of the graphical user interface: Unprecedented computer power for blind people. Closing the Gap 14 (October):24-25, 60-61, 1992.
3. Vanderheiden, G.C., and D.C. Kunz. Systems 3: An interface to graphic computers for blind users. In Proceedings of the 13th Annual Conference of RESNA held in Washington, D.C. 20-24 June 1990, 150-200, 1993.
4. Mynatt, E.D., and G. Weber. Nonvisual presentation of graphical user interfaces: contrasting two approaches. In Proceedings of the ACM CHI '94 Conference, held in Boston, MA, 24-28 April 1994, 166-172, 1994.
5. Baca, J. Displayless access to spatial data: Effects on speaker prosodics. An unpublished dissertation proposal, Submitted to Mississippi State University, in the Department of Computer Science, 1996.
6. Davis, J.R. and C. Schmandt. The back seat driver: Real time spoken driving instructions. In Vehicle Navigation and Information Systems, 146-150, 1989.
7. Loomis, J.M., R.G. Golledge, R.L. Klatzky, J. Speigle, and J. Tietz. 1994. Personal guidance system for the visually impaired. In ASSETS 94, The First Annual ACM Conference on Assistive Technologies, October 31-November 1, 1994, Los Angeles, CA, 85-91, 1994.
8. Weinstein, C.J. Applications of voice processing technology. Voice Communication Between Humans and Machines, National Academy Press, Washington, D.C., 1994.
9. Kamm, C. User interfaces for voice applications. Voice Communication Between Humans and Machines. National Academy Press, Washington, D.C., 1994.
10. Scherer, K.R. Speech and emotional states. Speech Evaluation in Psychiatry, 189-220, New York:Grune-Stratton, 1981.
11. Wightman, C.W. and M. Ostendorf. Automatic labeling of prosodic patterns. IEEE Transactions on Speech and Audio Processing 2(4):469-481, 1994.
![]() |
|