A Sub-system to Map Natural-language Utterances to Situated Parameterized Dialog Acts

Get Complete Project Material File(s) Now! »

A Step Towards Continuous Listening for Spoken Interaction

The way an SDS listens to the user has several effects on how the system handles the dialog. Current deployed commercial systems chose to give the initiative to the user to segment the signal. Most of them base that process on a binarystate whose value (either listening or paused) is set by restrictive methods such as gesture recognition, keyword detection, button presses, etc. In practice, it means that every turn starts with the user signaling his/her intent to address the system before actually uttering his/her request. The previous way of pacing the interaction is inspired by the sequential processing of dialog turns and thus it doesn’t reflect the manner with which humans share information in conversations. Despite that, the reliability of such makes it an appropriate choice for SDSs, avoiding the hurdles of continuous uncontrolled listening.
When the turn-taking control is left to the machine and the capturing device is continuously recording, we talk about continuous listening. It means that the user interacts freely with the agent and that the latter segments and filters the signal. Continuous listening introduces challenges: noisy signal, speech-free segments, start and end of utterance marking, distant speech recognition, echo
removal, out-of-scope utterances, etc.
This document proposes a method for continuous listening applied to the vocal interaction with a mobile companion robot. A head-mounted microphone records permanently in the home environment and the processing, based on sound classification, parallel processing of speech, alignment scoring and attention level computation,tries to deal with the inherent issues.

A Sub-system to Map Natural-language Ut- terances to Situated Parametrized Dialog Acts

NLs are any languages which arise, unpremeditated, in the brains of human beings. Typically, these are the languages humans use to communicate with each other 1. Unlike computer languages that are deterministically parsable and that obey some structural rules, NL grammars are flexible and vary from one speaker to another.
In research, NLU groups aim at extracting the meaning out of the NL input of a system, so as to build computing machines that ’understand’ human language.
The modality for SDSs is speech. Thus, an NLU component’s role is to analyse the (segmented) speech signal to extract meaningful information for the dialog. A unit of dialog move is called DA. It is specific to a dialog and a system, i.e. each system defines its own set of DAs and the dynamics between interaction domains. The NLU establishes a mapping between a spoken input and a DA. Note that this matching may not exists, thus a NLU sub-system needs to implement mechanisms to reject and recover from such out-of-scope utterances.
Early NLU systems based their interpretation of spoken utterances on the detection of designer-defined keywords or patterns [48, 204]. The scalability of such an approach is very limited; and so is their applicability for different domains. Later, Context-free Grammars (CFGs) and Probabilistic Context-free Grammars (PCFGs) were applied to the understanding process [53, 68, 132, 201, 202, 203] with the aim of building parse trees covering the sequence of recognized words. This path has been taken by many who additionally proposed methods to infer grammars from data.
To avoid the burden of extracting static structures from observed data and in anticipation of the big data era, Chronus [12, 144, 145] based its processing on HMM modeling. The mapping between words in the utterances and semantic symbols was learned from data.

Commercial and Research Perspectives

Lately speech and other types of NL are experiencing an increased acceptance when being used for interacting with “intelligent” computing systems. Companies increasingly provide us with potentially new application scenarios where the modality is seen as the best way of operation. This trend is particularly reflected by recent technology releases such as Apple’s Siri, Google’s Google Now, Microsoft’s Cortana, Samsung’s S Voice and LG’s Voice Mate. While these products clearly demonstrate the industry’s vision of how we should be interacting with our current and future devices, they also highlight some of the great challenges that still remain [140]. Indeed, criticisms have been received questioning the reliability, the usefulness, the data protection, and the proprietary aspects of the technologies, etc. NL interaction is, despite those recent advances, still not reliable enough to be used by the majority of users and hence hardly accepted as an efficient way to communicate with a machine. We face a socio-technological problem where the use of those error-prone technologies may easily lead to unsatisfying user experiences. While for some talking to a computer may simply convey a great user experience, for others, it can offer a significant alleviation when interacting with a piece of technology. However, the leap forward taken by ASR has demonstrated how a technology entering the virtuous circle of machine learning Figure 2.9) may significantly improve its performance.

READ Laboratory investigations of the foraging behaviour of New Zealand scampi

Table of contents :

List of Figures
List of Tables
List of Acronyms
Abstract
1 Introduction
1.1 A Modular Open-source Platform for Spoken Dialog Systems
1.2 A Step Towards Continuous Listening for Spoken Interaction
1.3 A Sub-system to Map Natural-language Utterances to Situated Parametrized Dialog Acts
1.4 The Linked-form Filling language: A New Paradigm to Create and Update Task-based Dialog Models
2 A Modular Open-source Platform for Spoken Dialog Systems
2.1 Introduction
2.1.1 SDS Definition
2.1.2 Deployed Research Systems
2.1.3 Commercial and Research Perspectives
2.2 A New SDS Platform
2.2.1 Desired characteristics
2.2.2 Architecture
2.2.3 Communication
2.2.4 Grounding
2.3 Interaction Example
2.3.1 Simulated Service Description
2.3.2 One Interaction Turn
2.3.3 Interaction Turns
2.4 Speech synthesis
2.5 Natural Language Generation
2.6 Speech Recognition
2.6.1 Local Implementation
2.6.2 Web Service
2.6.3 Speech Recognizers Benchmarking
2.7 Real-user Data Collection
2.7.1 Wizard of Oz
2.7.2 WoZ-based Lab trials
2.7.3 System Trials
2.8 Conclusion
3 A Step Towards Continuous Listening for Spoken Interaction
3.1 Introduction
3.2 Automatic Speech Recognition: An Introduction
3.2.1 Recording Speech
3.2.2 Parameters Extraction
3.2.3 Search Graph
3.2.4 Language Modeling
3.2.5 Lexicon
3.2.6 Acoustic Modeling
3.2.7 Software and Tools
3.3 The most common listening method
3.4 CompanionAble Project Setup and Task
3.5 Speech Recognition Issues
3.5.1 Acoustic Mismatch
3.5.2 Distant Speaker
3.5.3 Echo
3.5.4 Uncontrolled Background Noise
3.5.5 Controlled Background Noise
3.5.6 Single Input Channel
3.5.7 System Attention
3.6 Primary Continuous Listening System
3.6.1 Architecture
3.6.2 Signal Segmentation
3.6.3 Sound Classification
3.6.4 Speech Recognition
3.6.5 Language Models Interpolation
3.6.6 Similarity Test
3.6.7 Noise-labeled Segments filter
3.6.8 Attention Level
3.6.9 Acoustic Adaptation
3.7 Evaluation
3.7.1 Expected Improvement Axes
3.7.2 Improving the ASR Reliability
3.7.3 Detecting the Intended Segments
3.7.4 Evaluation Corpus
3.7.5 Noise-free Evaluation
3.7.6 Evaluation in noisy conditions
3.8 Another System for Hand-held Devices
3.8.1 Introduction
3.8.2 Listening Context
3.8.3 Architecture
3.8.4 Signal Segmentation
3.8.5 Confidence Scoring
3.8.6 Semantic Appropriateness
3.8.7 Error Recovery
3.8.8 Multiple Hypothesis Testing
3.9 Evaluation
3.9.1 Introduction
3.9.2 First Setup
3.9.3 Observations
3.9.4 User Feedback
3.9.5 Second Setup
3.9.6 Observations
3.9.7 User Feedback
3.10 Conclusion and Future Work
4 A Sub-system to Map Natural-language Utterances to Situated Parameterized Dialog Acts
4.1 Introduction
4.2 State-of-the-art Methods for NLU
4.2.1 Keywords Spotting
4.2.2 Context-free Grammars
4.2.3 Probabilistic Context-free Grammars (cf. 3.2.4)
4.2.4 Hidden Markov Model
4.2.5 Hidden Vector State Model
4.2.6 Semantic Frame: A Meaning Representation
4.3 Natural Language Understanding: Issues and Challenges
4.3.1 Challenges
4.3.2 Issues
4.4 Platform’s NLU System Overview
4.5 Semantic Parsing
4.5.1 Training
4.5.2 Decoding
4.5.3 Slot Values Clustering
4.6 Semantic Unifier and Reference Resolver
4.7 Context Catcher
4.8 Reference Resolution
4.8.1 Dialog Context References
4.8.2 Extended Dialog History
4.8.3 External References
4.9 Semantic Unification
4.10 Mapping Semantic Frames to Dialog Acts
4.11 Dealing With Multiple Hypotheses
4.12 Evaluation
4.12.1 Corpus and method
4.12.2 SP Evaluation
4.12.3 NLU Evaluation
4.13 Conclusion
5 The Linked-form Filling language: A New Paradigm to Create and Update Task-based Dialog Models
5.1 Introduction
5.2 Related Work in Dialog Management
5.2.1 Flow Graphs
5.2.2 Adjacency Pairs
5.2.3 The Information State
5.2.4 Example-based Dialog Modeling
5.2.5 Markov Decision Processes
5.2.6 Partially Observable Markov Decision Processes
5.3 The Task Hierarchy Paradigm
5.3.1 Principles
5.3.2 The ANSI/CEA-218 Standard
5.3.3 Related Issues
5.4 Disco: A Dialog Management Library
5.4.1 Introduction
5.4.2 Embedding Disco
5.5 Linked-form Filling Language Description
5.5.1 LFF Principles
5.5.2 Syntax
5.6 From Linked-form Filling to ANSI/CEA-2018
5.6.1 Variables and Actions
5.6.2 Forms
5.7 Linked-form Filling Evaluation
5.7.1 Model’s Comparison
5.7.2 Design Comparison
5.7.3 Characteristics summary
5.7.4 Compared to…
5.8 Conclusion
5.9 Future work: proposal For the Evaluation of Dialog Management Methods
5.9.1 vAssist Field Trials
5.9.2 Switching Dialog Managers
5.9.3 Comparison With a Statistical Dialog Manager
6 Conclusion and Future Work
6.1 Conclusion
6.2 Future Work
7 Publications
8 R´esum´e long
8.1 R´esum´e
8.2 Introduction
8.3 La plate-forme de dialogue vocal
8.3.1 Introduction
8.3.2 Vue d’ensemble
8.3.3 Fonctionnement des composants
8.3.4 Configuration
8.4 ´Ecoute permanente et robustesse de la reconnaissance de la parole
8.4.1 Introduction
8.4.2 Probl´ematiques
8.4.3 M´ethode d’´ecoute continue
8.4.4 Evaluation
8.5 Compr´ehension du langage appliqu´e au dialogue vocal
8.5.1 Introduction
8.5.2 Extraire les concepts s´emantiques du texte
8.5.3 R´esoudre les r´ef´erences locales
8.5.4 R´esoudre les r´ef´erences communes et situer l’interaction
8.5.5 Unifier les espaces s´emantiques
8.5.6 Joindre les attentes du gestionnaire dialogue
8.5.7 Conclusion
8.6 Mod´elisation des dialogues: linked-form filling
8.6.1 Introduction
8.6.2 Hi´erarchies de tˆaches pour mod´eliser le dialogue
8.6.3 Principes du Linked-Form Filling
8.6.4 Transformation en hi´erarchie de tˆaches
8.6.5 ´Evaluation
8.6.6 Conclusion
8.7 Conclusion et perspectives
References