There is little argument that dialog systems are among the booming language-aware applications these days. Natural language interactions are quickly becoming a vital part of human daily lives by enabling us to access computational resources and engage in communications in the highly intuitive ways.
In Part 1 of this series published here on technology.org, we presented dialog systems from the business perspective. In this part, we will start looking at what keeps the minds of present-day R&D people busy in their attempt to deliver truly interactive and efficient dialog systems. You will also get introduced to the general NLP pipeline of building dialog systems.
Dialog Systems from The Researcher’s Perspective
There are a few reasons why you should desire to build a dialogue system:
- To supply a low barrier access for users, allowing them to communicate in an intuitive manner with services, resources, and data on the internet. With dialogue systems, it is not urgent to master an interface—in theory, at least. You can say anything you wish and the assistant is supposed to support and entertain you as a social companion. If this interaction occurs in commercial settings, the assistant should provide customer self-service and automated help.
- To meet the challenge of building a computational model for human conversational capability. Being able to talk in a natural way, give appropriate responses, and perceive the partner’s emotional state is one of the top-level mental skills that facilitates social interaction.
- To replicate human conversational performance so that the dialogue system resembles talking to a human. Passing the Turing test, however, is not a vital condition when implementing a truly effective dialogue system. On the contrary, sometimes users may feel awkward with a dialogue system that is able to deceive them into believing they are talking to a human. One should therefore be careful with this approach and strive for the right balance to meet the ethical concerns.
Dialog Systems from The Developer’s Perspective
There are three major approaches to the construction of dialogue systems: rule-based, statistical data-driven, and entirely neural. In rule-based systems conversation flow and other facets of the interface are handmade using best practice guidelines that voice user interface designers have proposed over the past decades. These contain guidelines on components of communications such as:
- how to design effective prompts;
- how to sound natural;
- how to act in a cooperative manner;
- how to propose assistance at any time;
- how to prevent errors; and
- how to recover from errors when they occur.
There are also higher-level guidelines, for instance:
- how to endorse engagement and retention;
- how to make the customer experience more special and enjoyable; and
- the use of personas and branding.
Some of these guidelines address linguistic facets of communications such as retaining the context in multi-turn discourses, asking follow-up questions, and sustaining and varying topics. Others are more related to social competence such as endorsing engagement, exhibiting personality, and conveying and understanding emotion. At last, there are psychological factors such as being capable to perceive the beliefs and intentions of the dialog partner. All of these factors are crucial for a conversational agent to be productive as well as attractive for the user.
In the statistical data-driven and entirely neural approaches, conversational strategies are mastered from data. Statistical data-driven dialogue systems appeared in the late 1990s and entirely neural deep learning-based dialogue systems first surfaced around 2014.
If you are interested in learning more technical stuff about all these three approaches, come back for our follow-up articles on Dialog Systems at technology.org!
Dialog Systems NLP Pipeline
A dialog system calls for four sorts of processing along with a database to store previous statements and responses. Each of the four processing stages can involve one or more processing algorithms running in parallel or in series:
- Parsing—Converting a statement in natural language to structured numerical data (features) through multiple preprocessing steps including tokenization, part-of-speech tagging, named entity recognition, and vectorization. At this stage, we get the dialog system “listen” to you by translating human language into computer jargon.
- Analyzing—Merging features to evaluate grammar, sentiment and semantics of your input. Now the system is interpreting what you have just said.
- Generating—Employing templates, language models and search to produce possible responses of the dialog system. At this stage, we teach the system to speak “human language”.
- Executing—Choosing the sequence of responses according to the conversation history and objectives. This is the management stage where a plan has to be devised what response should follow the previous ones to maintain a conversation that “makes sense” to you.
Most dialog systems will encompass aspects of all four of the processing stages shown as numbered color boxes in the picture above. However, many applications need only straightforward algorithms for quite a few of these steps. Some dialog systems are more suited for answering factual questions, while others will excel at producing extended and rather sophisticated responses that might even convince you about chatting with a human. Each of these capacities call for different approaches; we hope to be able to uncover those to you in this series on Dialog Systems.
In short, this series will be about using machine learning to rescue you from having to predict all the scenarios humans might follow when communicating things in natural language. With each part of the series, you will gradually build a better understanding of the basic items in the NLP pipeline for a dialog system. As you master the tools of NLP, you will ultimately figure out how to set up your own pipeline to support human-like conversations with a machine the way you want.
Finally, don’t worry if you see a lot of terms in the block diagram above that may look confusing or simply make no sense to you. And we will not even attempt to describe most of these terms in the next part of the series. That would be hardly possible, to be honest. However, if you are ready to show persistence, we will attempt to explain in plain language – slowly, step by step – what’s hidden behind all those color boxes there. Just keep reading the series articles at technology.org, and you won’t regret it!
What About the Future?
What potential should future dialogue systems have to allow them to immerse into truly human-like dialogues? One way to answer this is to explore the Apple Knowledge Navigator video shown below, which was presented in 1987 by Apple. The video was proposed as a vision of what would be achievable someday.
In the video a university professor returns home and turns on his computer to be greeted by his Personal Digital Assistant (PDA). The PDA appears on the screen and informs the professor that he has several messages and some events on his calendar. The professor engages in a dialogue with the PDA to collect data for an upcoming lecture.
Some of the capabilities of the PDA are real at present, for example, reporting about messages, missed calls, and imminent calendar events. The professor requests the PDA to display some pictures, e.g., show only universities with geography nodes. Voice search for textual and visual data is also feasible with modern dialogue systems on smartphones and smart speakers with visual displays.
Examine, however, this conversation in which the professor (PROF) asks the PDA for particular data for his lecture [McTear, 2021]:
PDA1: You have a lecture at 4.15 on deforestation in the Amazon rainforest.
PROF1: Let me see the lecture notes from last semester.
PDA retrieves the lecture notes and displays them.
PROF2: No that’s not enough.
PROF3: I need to review more recent literature.
PROF4: Pull up all the new articles I haven’t read yet.
PDA2: Journal articles only?
PDA3: Your friend Jill Gilbert has published an article about deforestation in the Amazon
and its effect on rainfall in the sub-Sahara, it also covers droughts effect on food
production in Africa and increasing imports of food.
PROF6: Contact Jill.
PDA4: She’s not available
Some parts of this scenario are achievable today, such as requesting to contact someone (PROF6) and being told that they are not available (PDA4), also getting a specific type of document, i.e.,
journal article, and reading out its description (PDA3). Other parts are more demanding. For
instance, since fetching the lecture notes from last semester (PROF1) is specified only partially, this instruction can merely be perceived properly if the PDA keeps up the context of the topic of the upcoming lecture (PDA1). The expression “more recent literature” is also tricky as it concerns dealing with the expression “more recent” to fetch lecture notes dated later than those retrieved. Lastly, collecting articles that the professor hasn’t read yet calls for a user model that keeps track of what the professor has read on this topic and deleting them from the list of current and new articles.
Most of these challenges influence the Natural Language Understanding (NLU) part, not so much in terms of grasping exact content but more in terms of making sense of discourse events such as underspecified reference. Modern NLU systems demand input that is more explicit, so that the system can find the desired items in its search.
Given these challenges there is still much work ahead for researchers toward the goal of the ultimate intelligent dialogue system.
This was the second article in the technology.org series on Dialog Systems, where we presented things developers and researchers should concentrate the most when building valuable conversational AI today as well as in the near future. You also got familiar with the general NLP pipeline used in this process.
Technical details of the pipeline will be the focus in the upcoming parts of the series. Keep checking our technology.org site for continuation!
Darius Miniotas is a data scientist and technical writer with Neurotechnology in Vilnius, Lithuania. He is also Associate Professor at VILNIUSTECH where he has taught analog and digital signal processing. Darius holds a Ph.D. in Electrical Engineering, but his early research interests focused on multimodal human-machine interactions combining eye gaze, speech, and touch. Currently he is passionate about prosocial and conversational AI. At Neurotechnology, Darius is pursuing research and education projects that attempt to address the remaining challenges of dealing with multimodality in visual dialogues and multiparty interactions with social robots.
Andrew R. Freed. Conversational AI. Manning Publications, 2021.
Rashid Khan and Anik Das. Build Better Chatbots. Apress, 2018.
Hobson Lane, Cole Howard, and Hannes Max Hapke. Natural Language Processing in Action. Manning Publications, 2019.
Michael McTear. Conversational AI. Morgan & Claypool, 2021.
Sumit Raj. Building Chatbots with Python. Apress, 2019.
Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana. Practical Natural Language Processing. O’Reilly Media, 2020.