I spent last year working on chatbot Alquist for Amazon Alexa Prize 16/17. I would like to share with you the current situation in the field of conversational AI from my practical point of view. We found out all of this the hard way. I share it with you so you don’t have to.
Focus on the content, not machine learning only
It is tempting to try to solve the conversation by machine learning only. Our initial idea was to collect as many message-response pairs and use some information retrieval method (find the closest pair to user’s message). We decided to use Reddit comments as our source. We hoped, that if you have enough of these pairs, we can recreate any conversation. But this approach failed to create interesting dialogues. We thought that it is because we didn’t have enough pairs. Our next step was to limit the topic of dialogue to movies only. We had no success with this approach again. The main problem was, that dialogues were not coherent. The AI was jumping from topic to topic or referencing to something which was never said. Answers could be considered as OK, but they didn’t form coherent dialogue together.
A big problem with this approach is, that you don’t have any control what will AI say. AI will learn only what is in the data. It will not be able to make any sensible dialogue about any entity. It will say you its opinion and will deny it in the next dialog turn. It is not able to ask you for your opinion and react to it anyhow. It will not provide you any useful information. Dialogue with such AI is not beneficial, nor funny. It is just ridiculous.
We used machine learning to dialogue management only. Machine learning is not generating responses for us (except for one exception). It classifies intent, finds entities or detects sentiment. The content of the dialogue was handcrafted by us. This way we can ensure, that the dialogue is engaging and coherent, and if not we can tweak it.
Seq2Seq model doesn’t work
This is closely related to the previous point. I consider it important enough to put it to a separate point. There is the paper A Neural Conversational Model. It describes how to apply standard Seq2Seq architecture for machine translation to dialogues. There are really astonishing examples of dialogues produced by this model. We were amazed by it when we saw it for the first time. However, when we tried to use it by ourselves, it didn’t produce this type of results. The model’s responses are much shorter and generic. We used movie subtitles and Reddit comments as training datasets. We spend at least two months trying to make it work without any significant success.
We use this model at the end. We call it the “chitchat” and it is used only if all other methods of response generation fail. “Chitchat” usually responses to questions like “How are you?” or “What will you do tonight.” We allow only three responses in the row produced by this model and then we recommend some new topic to the user. This rule is important, otherwise the user is stuck in a dull and boring conversation about nothing. “Chitchat” is our only machine learning based method which generates answers.
Build dialogues with premade pieces
Our final approach was to represent dialogue as state automata. We have automat for each topic (sports, movies, music etc.). Each state of automat has “execute” and “transition” functions. The state can process user’s input or access some API in the “execute” function for example. And the “transition” function decides to which state it will go next, based on the result of the “execute” function.
This simple design of our states allows us to do various things. We realized during the development, that we create many same states over and over again. So we decided to prepare the most used states, which we used to build dialogues. This speeded up the development and made changes much easier. If we wanted to add recognition of yes/no to our automat, we just used the “YesNo” state. If we wanted to improve how the yes/no is recognized in the user’s messages, we had to do it only in single place.
Our premade states are:
- Say sentence, wait for user’s message and transition to next state
- Say sentence, don’t wait for user’s message and transition to next state
- Say sentence A every n-th visit of state and say sentence B otherwise, transition to next state (This is useful when you want to remind some information to the user, like “Remember that you can say ‘Let’s talk about sports.’ Anyway let’s go back to the books. What is your favorite?”)
- Wait for user’s message and transition to next state
- Recognize Yes or No (Agreement, disagreement) in the last user’s message and transition to one of two next states based on the result of the recognition
- Recognize entity in the last user’s message
- Switch to different automat
These states made approximately 60%-80% of our automata. The rest of states are rare, so it didn’t pay off to premade them.
Divide dialog as much as possible for easy maintenance
We created a dialogue about movies. It is huge, around 300 states. Its development took at least one month for one person of our team. Maintenance of such dialogue is a nightmare. This dialogue is about favorite movies, TV series, actors, directors, and genres. It also contains some chitchat about movies in general, like where the user usually watch movies and so on. And all of this is mixed in the single huge automat with a lot of transitions.
We also developed a dialogue about fashion. It is about favorite clothes, make-up, hairstyle and it can give some tips how to take photos of a new outfit for example. All of these parts are divided into smaller state automata, each containing around 20 states (the biggest one 60 state). The development took around 14 days for a single person. Maintenance is piece of cake because you need to debug only single small automat. My advice is to split topics as much as possible.
Goal isn’t to give as much information as possible but to entertain the user
Our initial approach to dialogue design was to ask user for his favorite entity (movie, video game, sport…), give him some information about entity (which genre it belongs to, actor who starred in it…), ask him some generic question (“Why do you like it?”, “What is your favorite part?”, “Do you play this game often?”…) and ask him for his next favorite entity again. A user was in the conversation loop and the duration of the conversation increased. However, this works only up to some point. The user does only two rounds and then he becomes bored. Main causes of this are: the same or very similar structure of dialogue, we give him information about his favorite entity which he probably already knows (because it is his favorite) and we don’t react to answers of our generic question (we replied by “I see.” or “It is interesting.” only).
The first and the second problem are quite easy to solve. The third one is challenging. You solve the first one by making more variants of dialogue. Ask user for favorite genre and movie, then transition to his favorite director of this genre or ask him whether he cares about online ratings of movies for example. You will need some really good conversation designer. Such person is usually not a programmer. So you will also need some way how to create dialogues with as few programming as possible for him.
We found out, that giving trivia and fun-facts to user solves the second problem. Don’t give him only raw data, put it into perspective etc. Great source of fun-facts is Reddit (https://www.reddit.com/r/todayilearned or https://www.reddit.com/r/Showerthoughts) which we used a lot. Is user talking about Matrix? Search todayilearned subreddit for word matrix and insert “Did you know? Will Smith turned down the role of Neo in The Matrix (1999). He instead took part in the film “Wild Wild West” which was a huge flop at the box office.” into the conversation. This is much better than generic “Matrix was released in 1999.” isn’t it?
The problem of not reacting to questions is hard to solve. Because the user can answer you by anything. There is almost infinite amount of possibilities. He can answer you by a relevant answer, but you can’t react to anything:
– Alquist: “Do you like this movie?”– User: “Yes, but my girlfriend hates it.”– Alquist: “I see!”
He can answer you and ask you at the same time:
– Alquist: “Do you like this movie?”– User: “Yes, I like this movie. Why do you ask?”– Alquist: “I see!”
Or he can say something completely irrelevant.
– Alquist: “Do you like this movie?”– User: “I had an egg for breakfast.”– Alquist: “I see!”
The possibilities are to ask fewer questions, which I don’t recommend. Users like when you ask them for their opinion. You can prepare answers for the most common answers and you can try to detect irrelevant answers. However, how to correctly react to any user input is still an open question.
Make dialogue not dependent on the user (sometimes)
You want to lead the user to some point in the conversation sometimes. We wanted to lead the user through our initial chat at the beginning of the conversation for example. We asked him how was his day or what is his name. If he replied some nonsense, we just continued with something like: “Ok, you don’t have to tell us. It is not so important information after all.” You can lead the user to some more interesting parts of dialogue in this way. However, don’t over-use it.
If you ask “Do you have any favorite actor?” don’t expect name only. Expect “yes” or “no” too
This problem appeared probably because our team doesn’t have any native English speaker. We used the question formulation from the headline. We expected, that user will tell us a name of his favorite actor. This happened, but there were also cases in which user answered us “yes”. And we answered, “I don’t know this actor.” We added reaction “Which one?” to all such states after we found out this problem. However, you have to count even with few users who answer “no”. Some form of recommendation is a good idea in such cases to push the dialogue forward.
Look into data for unexpected user’s inputs and new topic suggestions
This advice is tightly connected to previous one. Look into data. Examine the whole conversations which users have with the system. How do they answer the questions? Which topics are the most favorite? Is everything working as expected? Does your topic classification work for all inputs correctly? Do you recognize all entities?
All of this is very important. So make some way how to collect the data about conversations. Visualize the data, look into it often and make tweaks to the system. You will probably have a lot of data. Clustering helps in such cases. We used clustering to group similar user’s answers. We could prepare answers to the most common answers and it also helped us to annotate our datasets. We didn’t have to label each user’s message but individual clusters only.
End every response by question or suggestion what to do next
THIS ONE IS SUPER IMPORTANT! Every time you say something to the user, end it with some question or suggestion what to do next. This approach has several benefits. It helps to keep the conversation going. It also lowers the number of possible user’s messages because most of the time user will do what you suggest. You can also lead him to a conversational topic which he hasn’t tried yet.
Be ready to jump out of context and return back to it
We included question-answering module and so-called “single-line answers” module into our system. Question-answering module answered general-knowledge questions and “single-line” answers module handled questions to which we had hardcoded answers, like “What is your name?” or “How old are you?” Both of these modules could be used anytime during the conversation and it originally looked like this:
– Alquist: “Have you been in Brazil?”– User: “What is the population of Brazil?”– Alquist: “It is around 200 000 000.”– User: “Well ok.”
– Alquist: “Have you been in Brazil?”– User: “What is the population of Brazil?”– Alquist: “It is around 200 000 000. Anyway I was saing, have you been in Brazil.”– User: “No, not yet.”– Alquist: “Brazil is fascinating country…”
Make responses as variable as possible
One way how to not bore the user with same answers over and over is to add variance to your responses. We load responses from yaml files, in which we use our syntax. You can write several responses on single line thanks to it and several variants too. It looks like this:
book_not_enjoy:– “(|Hmm.) I didn’t (enjoy|like) (it|that one) (|very) much.”– “That one didn’t impress me (|very) much.”
- Hmm. I didn’t enjoy it much.
- I didn’t enjoy that one much.
- I didn’t like that one very much.
- That one didn’t impress me much.
- That one didn’t impress me very much.
This saves time because you don’t have to write whole sentences, you just write variants of phrases.
ASR is not perfect
ASR is not perfect. It has difficulties mainly when you are in the noisy environment or the user is not native English speaker. This was a case of our whole team. I really struggled with phrases like “Let’s talk about movies.” which was recognized as “What’s talk about movies.” and classified by dialogue manager as input to our question answering and not “movies” intent (I had to use the phrase “Tell me about movies” which worked fine).
Errors of ASR were responsible for a lot of our problems with intent and entity recognition. We tried to solve it by looking at confidence scores. If the confidence score was not above our threshold, we asked the user to repeat his message (“I didn’t understand you. Can you repeat it?”).
We tried A/B testing later in the development. We disabled the confidence score for half of the users. Do you know what happened? Ratings for both groups remained almost the same. So either solve ASR errors by some more clever method than we did or don’t bother with it at all. Amazon is improving it all the time, so this problem will maybe not exist in the future.
Filter all responses from any source you don’t have under control for profanities
Profanities… this was the most common reason why Amazon stopped us. Saying profanities was not our intention of course. The reason was, that we used a lot of texts from the internet. Reddit comments for chitchat, fun-facts from Reddit etc.
We used a combination of two ways how to filter texts containing profanities. The first one was simple string comparison against the list of profanities. The list consisted of a combination of some profanities lists which we found on the internet and later we added a list provided by Amazon. The list was several hundred phrases long, many of which I never heard before.
This worked up to some point (altho we had to add a lot of more phrases which we discovered over time). But this approach doesn’t detect hate-speech without profanities. One such example is a response “Kill your parents.” So we used machine learning to train detector of hate-speech. We trained it on two datasets of hate-speech from Twitter. You will surely be able to find them. This improved detection, but there are still some problems which we are gradually removing. I recommend you to spend a time to develop a way how to check all texts which you didn’t write by yourself.
This is my 13 practical recommendations to anyone trying to create their own conversational AI. They are based on my experience, which I and our team gathered over one year of competing in the first Amazon Alexa Prize 16/17. I hope that this list will help you to create something really clever and human-like because one of my dreams is to have a truly intelligent conversation with AI. Good luck and keep pushing the frontiers!