A Conversational AI Chatbot is Helping New Mothers Keep Healthy (#5)

Research in Translation

Jun 17, 2024

Introduction

Welcome back to ‘From the Computer to the Clinic’ - a newsletter about computational biology and its contributions to biomedical research.

In this newsletter, we explore how computational biology research can drive clinical progress. By sharing success stories in one disease area or domain of research, we aim to inspire the use of these successful approaches for other diseases and research areas also.

If you haven’t already, you can subscribe to this newsletter, or share it with friends and colleagues

Part V

This is part V of this series - if you missed parts I-IV, you can find them on the home page of this newsletter.

In 2015, a team of researchers bridging the academic (University of Montreal, Georgia Institute of Technology) and business worlds (Microsoft, Facebook) applied the RNN paradigm described in the previous part of this series to the task of generating conversational dialogue.

The key idea is that you can train a neural network to predict what words should come next after providing the opening phrase in a dialogue. The network uses the context of the opening phrase to generate the first word of the response, then the opening phrase and the first word to generate the second word in the response and so on. The RNN architecture is well suited to this task because it is designed to take in an ordered series of inputs - in this case the words of a sentence, encoded in a numeric form (a word embedding vector, see part IV) - and predict what should come next. Each successive word in the sentence is influenced by all the words that came before, as each time a new word flows into the network, the network modifies the word embedding vector with the stored information from previous words.

Machine learning models require a lot of data to train, and training a dialogue-generating RNN model was only possible because of a major societal change that took off in the early 2000s – the rise of social media. Users of social media applications like Facebook, Twitter, Reddit, and others generate an immense number of real-world conversations, and these conversations are often mined for training data. In the case of this specific model, 127M conversational segments from Twitter spanning the months of June – August 2012.

The use of data from social media apps and other websites to train large machine learning models is very common, and controversial. There are concerns that these models will spit out private information about individuals whose data is included in the training set, or incorporate explicit material if the training dataset is not carefully filtered, among other issues. Though it is not something I will emphasize much more in this series, it is an issue worth following and something this field of research has to reckon with.

The RNN approach used in the 2015 paper is more powerful than previous methods because of its ability to keep track of extended conversations – i.e., not just responding to a single message, but a string of past messages that provide important context. The authors present the example of a three-piece conversation, containing a context, message, and response phrase: “Because of your game?” [context] “Yeah I’m on my way now” [message] “Ok good luck!” [response].

Why is the context important? Imagine taking the role of the responder in this conversation, but you are only given the message and not the context. Without the context, you would have no idea where your conversational partner is going, and it is hard to tell what kind of response would be appropriate. ‘Ok good luck!’ makes sense in the context of a game, but if the conversation partner were going to a wedding - especially their own wedding - this would be odd at best, and insulting at worst. ‘See you soon’ carries no risk of insult, but only works if you and your conversational partner are going to be in the same place. In the case of the wedding, hopefully you were invited too. But there are a lot of other cases - for example if the other person is going to work - where this kind of response would not make much sense.

Clearly, context is important, and to take advantage of it, the researchers had to build context into their training data. The Twitter training data was gathered in the form of context-message-response ‘triples’, 127 million of them, just like the example from the above paragraph: “Because of your game?” [context] “Yeah I’m on my way now” [message] “Ok good luck!” [response].

This is what the training data looks like to a human, but to the computer, each word in a conversation triplet is represented as an embedding vector (how these vectors are generated is quite complex, beyond the scope of this article, but worth reading into if interested). A given word’s embedding vector - say the one corresponding to the last word in the message - flows through the RNN and the values in the embedding vector are modified, taking into account information from past words in the context and message. In the end, a final output vector is spit out of the network.

This output vector – again, just a list of numbers - determines what the next word in the response should be. Though this output vector is a bit special: each element in the vector corresponds to a different word in the vocabulary of all possible words, where the vocabulary is all the different words from all the Twitter messages used in training. The model picks the next word by choosing the element with the highest value.

It is like a secret code. Say that this is the output vector from the neural network:

[0.1,0.03,0.76, 0.2,…]

And each of the number in this list corresponds to a word:

[Apple, Aardvark, and, along, …] #The full length of this list is the number of unique words encountered across all of the training messages. The model can only spit out words that it has seen before.

In this case, the model would spit out ‘and’ as the next word of the response because its corresponding value is highest. But based on whatever comes before in the message and context, this output vector will look different, and a different word will have the highest value.

After training the model on a large set of context-message-response triplets (see part III of the series for more detail on how neural networks are trained), you can provide to the model a set of context-message pairs, and have the model generate a response.

Specifically, the researchers fed in a ‘test set’ of context-message pairs that the model has not seen in training, and asked it to generate responses. In this ‘test set’, the researchers also have the actual responses that came with these context-message pairs, so they can see how well the model is replicating the true responses (in addition to observing whether the responses, close to the true responses or not, sound like sensible human speech).

You can see some examples of responses in table 5 of the paper. All of the responses sound like a human wrote them, though some are more reasonable than others. Here, for example, is a response that mostly makes sense (though there’s an extra ‘and you?’ at the end that sounds a bit unnatural):

Context: “i know! what have you been up to?”
Message: “nothing really. just enjoying summer and working. what about you?”
Response: “that’s good! nothing much just listening to music and you?” (this is computer generated)

But there are other cases that are much less sensible:

Context: “today i will walk for miles. clear sky clear mind.
Message: “love walking. it’s too hot to walk this month. i haven’t moved. frozen. heat freeze.”
Response: “it’s too hot for you to get up to honey if you have frozen yogurt to you.” (this is computer generated)

You can imagine why the computer would have a hard time here - the input message is very choppy, and the phrase ‘heat freeze’ is not common in human conversation (maybe this is why the computer has thrown frozen yogurt into its response).

Despite the flaws, it is still quite impressive that a computer can take in messages, convert them to lists of numbers, and spit out a series of codes representing individual words to produce a sentence in response. Even the example of a bad response still looks like a sentence - it is not just a random set of words - and could well have been produced by a very tired (or intoxicated) human.

It’s also important to keep in mind that this paper came out in 2015. Since it was published, the machine learning architectures used for response generation have evolved (from RNNs to long short-term memory networks to transformers – the architecture underlying the ChatGPT), and become much more impressive text generators.

[If you want to dig into the details of how the model performed beyond the several examples featured here, the paper has a lot more detail]

In addition to architecture improvements, the way that researchers have been thinking about this response generation problem (and what information can help a machine learning model learn to generate conversational text), has also evolved. A more recent 2018 paper demonstrates this. The big idea of this paper is that you can strengthen responses by allowing the model to draw from external facts. Or in the researchers’ own words, “the key idea is that we can condition responses not only based on conversation history (Sordoni et al., 2015), but also on external ‘facts’ that are relevant to the current context”. Note that the authors of this paper are directly citing and extending the work of the authors from the 2015 paper we just discussed.

To demonstrate the potential value of external facts, the authors (this is another academic-industry collaboration, this time between the University of Southern California and Microsoft) include several examples of interactions involving restaurants

Here’s one: “I’m at New Wave Cafe”, and the corresponding response: “Try to get to Dmitri’s for dinner. Their pan fried scallops and shrimp scampi are to die for”.

In order for a model to answer in this way, it needs to understand that New Wave Cafe is a restaurant in a particular location, recognize that Dmitri’s is another restaurant nearby, and know that Dmitri’s has menu items like pan fried scallops and shrimp scampi that are very popular. Earlier models would likely spit out a much simpler response like ‘Enjoy the meal!’. That would be perfectly legitimate, but clearly not as intelligent.

In order to recognize and incorporate so-called ‘named entities’ like New Wave Cafe or Dmitri’s into responses, the authors needed to go beyond conversation data alone. To train their model, the researchers used data from Twitter (23 million conversations) like the 2015 paper, but also Foursquare (1.1 million ‘tips’), an app which provides personalized recommendations to try out local restaurants and other businesses.

The model is designed so that when a message is presented, and that message contains a named entity (like New Wave Cafe), it can draw on the information from the Foursquare data to gather more information about the Cafe (its menu, its ratings, other similar restaurants, etc.) and incorporate this information into the response. You can think of the two training datasets as teaching the model two distinct skills - the Twitter data gives the model the ability to generate natural-sounding human conversation, and the Foursquare data gives the model the ability to speak intelligently about specific people, places, and objects in the world.

Now that we have a sense of how AI chatbots work under the hood, the final part of this series will bring us back to Penn Medicine’s maternal health care system and show how the decades of AI innovations outlined in the series have made Penny the postpartum care chatbot possible.

Stay tuned for part VI…

From the Computer to the Clinic

A Conversational AI Chatbot is Helping New Mothers Keep Healthy (#5)

Research in Translation

Introduction

Part V

Discussion about this post