It was time after Alexa Prize 2017 when our team was looking for some new challenges to tackle on. One of the possibilities which we saw was to apply our knowledge of conversational AI in some practical task. The opportunity appeared unexpectedly. One company located in the same building as our team asked us to create voice interface for their robotic barista. I accepted the task and dived deep into goal-based dialogs, machine learning and liters of coffee.
The company is called Factorio solutions, and they created robotic barista. The central part of the barista is robotic arm made by KUKA, which controls ordinary coffee maker. The arm takes a coffee cup out of the shelf, puts it in the coffee maker, presses buttons to start the coffee maker, takes the finished cup and places it on a lovely saucer. The whole process of making the coffee looks very cool. You control the robo-barista by an app in a tablet. The company wanted to go further with this project. They decided to ask us to build voice control for the robo-barista after they heard about our success in Alexa Prize. I took a lead role in this project.
How to design the dialogue?
The initial problem which I faced was to design the conversation. I saw the goal of this project to show the conversation capabilities during the process of ordering. But as I was thinking about it more and more, I realized that this is not enough. You usually don’t care about the long, engaging conversation about which coffee you would like. You want your coffee, ideally now. This may be obvious to those of you, who made some goal-based dialogs. But for me, it was something new. It felt like Alexa Prize inside out.
These two goals, to showcase the conversation capabilities and to serve the customer as soon as possible, conflict. The more you emphasize one, the further you get away from the second. The balance and robust dialogue design were required for this project. I came up with the following dialogue at first:

Alquist’s dialogue for robo-barista
The dialogue starts with the initial message “Hello, which coffee would you like?”. The formulation of this question is intentional. There are only two logical responses to it. The user knows which coffee he/she would like or the user doesn’t know. If the user doesn’t know, the dialogue continues to the left branch, in which the Alquist tells him/her which coffee it can prepare. If the user tells the name of coffee, Alquist checks it by “So one <coffee>, right?” in which “<coffee>” is replaced by the actual name of coffee. This question is necessary since the today’s voice recognition technologies are not 100% accurate, and we don’t want to disappoint the user by wrong coffee (problems of voice recognition are magnified by the fact, that robo-barista will be showcased on conferences in the noisy environment). We send the request to make a coffee to the robotic arm if everything is OK, or we restart the dialogue if the user doesn’t agree with the requested coffee.
How to implement the dialogue?
With the dialogue design prepared, the next step was to select the technology, in which I will implement the dialogue. I saw two possibilities. One with which I have a lot of experiences, and the second more challenging but promising better results in the long term.
State automata
The first possibility was to use the finite state automata. We used the finite state automata in the last year’s Alquist. I was fairly familiar with this technology. I know its benefits and drawbacks. You represent the dialogue as a graph structure. The nodes are the responses of the bot (the green rectangles in the diagram above), and edges are rules working with the user’s messages (the blue rectangles). You can imagine the rules it in the simplest form as:
if message.contains("i don't know") or message.contains("i am not sure"): transition_to(state("I can make you {list_of_drinks}")) if message.contains("<coffee>"): transition_to(state("So one <coffee>, right?")
Of course, you don’t have to rely only on the string-matching. You can measure the similarity of TF-IDF or embedding vectors, but the idea is the same. You have to manually write the graph of the dialogue and prepare all the rules. This requires fairly good imagination, to come up with all possibilities of inputs. And If I learned something out of Alexa Prize, it was the fact that you are never able to come up with every possibility. But this approach has one huge advantage. It doesn’t require data (or very little in case of embedding or TF-IDF). This is the main reason, why is this approach useful. But state automata is not the only approach available.

Robo-barista
Hybrid code networks
We were looking for ways how to improve Alquist after the Alexa Prize. One of the promising approaches was to use more of machine learning. One of machine learning systems which we stumbled upon were Hybrid code networks. Hybrid code networks is a system for dialogue management combining recurrent neural network with developer’s code for entity tracking and connection to APIs. The goal of the system is to predict next response based on the history of conversation and user’s message. The main idea behind combining RNN and developer’s code is the fact that humans are bad at designing rules used to represent the state of dialogue. RNN can learn to represent the state of conversation easily. But RNN struggles to learn how to make API call, which is easily solvable by few lines of code.

Hybrid code networks
The basic flow of the system is:
- We detect the entities in the user’s message and save it to memory.
- We replace entities in user’s message by token representing the type of entity.
- The user’s message is converted into vector representation (bag-of-words and average of world embeddings).
- The vector representation is passed into stateless LSTM, dense layer and softmax.
- The output of softmax is element-wise multiplied by a vector of action mask, which consists of ones and zeroes. It can forbid some responses by assigning zero probability to it. Developer’s code creates this vector.
- We select the response with the highest probability.
- We execute the API call and replace the entity tokens by their actual value.
- We can use the result of API call as a feature for RNN during processing of next user’s message.
- We present the user to the user.
You can learn more about the Hybrid code networks from the paper written by Jason D. Williams, Kavosh Asadi and Geoffrey Zweig names from Microsoft Research and Brown University:
This system seemed promising. I decided to use the robo-barista application as a useful training ground before integrating it to Alquist. There remained one significant problem. This system required training data, and we didn’t have any. But I came up with a solution.
Where to find training data?
There are some datasets for goal-oriented dialogues. One example is bAbI dialogue dataset, which contains dialogues regarding restaurant reservation. The problem was, I didn’t want to create a bot for a restaurant reservation. I wanted to make dialogue for ordering a coffee, so these data were mainly worthless for my domain. You will sadly face this problem usually. Even if you find data from the same domain, even slight change of dialogue structure may be problematic or impossible.
This forces you to either create a rule-based system and collect the data from users, or generate your training data. The advantage of collecting data from users is that these data are the real data. And real data allows you to train robust dialogue manager. The disadvantages are that you have to build the rule-based system first, at least some amount of users have to use your system, and machine learning system learns only to replicate the rules already implemented in the rule-based system from these data (which is a little bit disappointing when you already have the rule-based system).
I decided to create my training data. A considerable advantage of this approach is the fact that you can make any training data you want for your bot. This can sound as ideal approach. However, it has its dangers. I advise some caution. These data are not a real data, and the data can lack diversity. I propose a new method to create your training data minimizing these dangers.
Make training data if you don’t have any!
My method to generate training data for chatbot consists of following steps. The first step is to create the graph of dialogue with nodes of two types. The first type represents the responses of the bot and the second type represents the messages of users. You start with the initial node. Its type depends on your application. It should be response node if the bot has to start the conversation. If user initializes the communication, then it should be message node. Alter the types of nodes, and split the dialogue by message nodes only. You can see the example in the picture below. The graph can be in a textual or graphical form. The graph in the graphical form proved to be more clear to me.

Example of dialogue graph (Green nodes are response nodes and blue nodes are message nodes. Only message nodes can branch out the dialogue)
The next step is to fill the message nodes by examples of messages which users can say. There should be messages with the same meaning in the node. This means that you should write many possibilities of sentences expressing happiness of the user for example:
- I feel great.
- I am happy.
- I’m happy.
- …
You can also use the syntax “I (am|’m) feeling (happy|great),” which we developed for Alquist. I described the syntax in:
The final step is to take the graph of dialogue with filled messages and responses and to generate all possible transitions through the graph. This means you have to take all paths from the initial node to terminal nodes and use all examples in message nodes. You can see the example of dialogue graph and generated training dialogues below:

Example of dialogue graph
- Bot: Hello, how are you?
- User: I feel great.
- Bot: That’s great.
- Bot: Hello, how are you?
- User: I am happy.
- Bot: That’s great.
- Bot: Hello, how are you?
- User: I am sad.
- Bot: I am sorry.
- Bot: Hello, how are you?
- User: I feel terrible.
- Bot: I am sorry.
- Bot: How are you?
- User: How are you?
- Bot: I am great.
This technique generates a considerable amount of data which you can use for training.
How does robo-barista work?
We can get back to the robo-barista with all pieces ready. We have a dialogue manager in the form of Hybrid code networks, and we have a training data for it. But how does it control the robo-barista?
I selected Amazon Alexa as a platform for voice interface because I am most familiar with it thanks to Alexa prize. The first step to process the user’s message is entity recognition. I try to find names of coffee in every message by simple string matching. If I find some, I save it to the memory and mask it in the message by the <coffee> token. The message is then passed to the trained recurrent neural network.
I have access to the API of robo-barista. Single POST request containing the type of coffee is the only thing separating you from your coffee being added to the queue. The API call is inserted to one of the responses in the form of token {make_drink <coffee>}. If this token is detected in the response, the API call is performed with the coffee previously saved in the memory. The response containing API call is denied by action mask if there is no coffee in the memory. And that is all. This system was demonstrated for the first time on the KUKA industry day in the building of CIIRC. And it worked! (If you don’t count two following problems.)

(https://www.ciirc.cvut.cz/testbed-for-industry-4-0-and-national-center-for-industry-4-0-opened-its-doors/)
What went wrong?
There were two things, about which I am not happy. I expected, that voice recognition system of Amazon Echo would be able to recognize names of coffees like espresso, lungo, ristretto or latte macchiato. Guess what. It wasn’t. Combination of Czech accent, names of coffee and the noisy environment caused problems. Sometimes we had to repeat the order. Latte macchiato was a special case. We have not been able to order it at all. Ordering from tablet still worked luckily. Customers wanting Latte macchiato was served by the app on the tablet.
There are two steps to solve this problem. Test the bot using the actual device. I tested the bot mainly on cappuccino and espresso and with a text interface only. So I didn’t know there was any problem. The second step is to use custom slot type, in which you can define the list of all coffees you would like to recognize. There is the documentation how to define it for Amazon Alexa. However similar technique should work for any voice assistant.
The second problem may seem innocent at the first look. I changed the formulation of initial message from “Which coffee would you like?” to more polite “Would you like a coffee?”. What has changed? The first formulation has two valid answers. You can respond “Make me <coffee>.” or “I don’t know.”. The second formulation has four valid responses: “Make me <coffee>,” “I don’t know,” “Yes, I would like a coffee” and “No, thanks.” My problem was that I didn’t add the “yes/no” variants into dialogue. The bot responded to both of these inputs “I can make you espresso, lungo…”. This is not a disaster, you always get your coffee, but the user experience is suboptimal. Good dialogue design is the key to the satisfied customer.
What next?
I see possible improvement for generation of training data. The generated dialogues can be similar to each other. Two dialogues can differ in the single word in the worst case. The question is, whether it is useful to have these two dialogues in the training data. And if not, how to generate training dialogues out of dialogue graph in some smarter way to diversify the dialogues as much as possible?
The main takeaway from this project for me is that the Hybrid code networks and generating training dialogues proved to be useful. We plan to use these two systems heavily in the new Alquist for Alexa prize 2018.
P.S. Work automatization is the hot topic now, especially the question “What will people do if robots would do all the repetitive work?” I don’t have an answer to this, but I can provide some data from the field of conversational AI. Chatbots are slowly taking a work of people working in customer support. But on the other hand, chatbots created a new need for specialists in dialogue design. So chatbots are taking some work, but also giving a new one.