Generative AI is fundamentally changing conversation design, but deploying an AI-enabled chatbot still requires a human-led process of hardcoding flows for different conversational scenarios. We’re not yet at the point of leaving the logic of complex interactions to the LLM; it’s unclear if or when we’ll arrive.
A roadblock to fast and performant chatbot design is the number of side roads and strange turns inherent to human conversations. Humans can drive those interactions without thinking, but it’s time consuming, arduous, and challenging for us to chart the path with the detail required.
Luckily, companies already have wide libraries of successful customer conversations from call transcripts, bot logs, email support tickets, and other sources. With the right human-in-the-loop workflow, those examples should provide the LLM what it needs to deduce the logic and design the chatbot flow. This article outlines that process, beginning with specific subsets of data and ending with a flow that could feed a generative playbook, a text to flow converter, or a chatbot platform that can build flows from natural language.
Validating the Concept: Single-Use Case Testing
To begin, we need to validate whether our goal is feasible for a single kind of customer request. In this example, we’ll test the workflow on conversations that flag a missing package.
We can search through the sea of transcripts for missing packages in one of two ways. First, we can employ semantic search, a data engineering tactic underlying RAG solutions, which will search our transcripts for similar phrases. If we add a custom example to the stash–”I didn’t receive my package yet”--we can automatically surface semantically similar requests. We’ll select a handful conversations deemed similar and moved them to the stash.
This test data determines the quality of our experiment–we want to make sure it’s accurate and specific. Searching by semantic similarity alone, we risk including conversations that mention a missing package but regard a different issue entirely. Rather than read through every transcript individually, we can build a custom prompt to find the true key issues of the conversations we’ve selected.
We can run this prompt on the ten conversations in the stash to see the summarized key issue for each one. In this example, only a segment of the ten conversations have ‘package missing’ as the key issue. We can move those conversations to a new stash and proceed to the next step: engineering a chatbot flow prompt.
Generating Chatbot Flows from a Small Data Segment
Having filtered our data through semantic search and a custom summarization prompt, we can start working on the prompt that will generate the chatbot logic. This is where iterative, human-in-the-loop prompt engineering takes over. After many revisions, we’ve engineered a prompt with sufficient detail to create the flow we need–full of conditional logic, subnodes, specific omissions, and instructions a chatbot could follow.
Running this prompt on a merged stash will take all of the selected conversations into account, compiling their cues to create a single flow. This is the beginning of many revisions–we’ll want to ensure the output is properly formatted and that there are no oversights across these six examples.
Expanding the Logic
We’re now confident that we’ve generated a chatbot flow logic that could sufficiently manage these example conversations. But no doubt there are other conversations regarding missing packages that take different turns.
To find more conversations and test this flow against edge cases, we can return to our ‘key issue’ prompt. Using the pipeline feature, we can summarize the key issue across all raw transcripts simultaneously. We can cluster the output by similarity, search for the keyword ‘not delivered,’ curate the results, and move them to a new stash. Working with simplified summaries, we can now effectively search by semantic similarity to find more conversations on the same topic. In this example, we found 15 additional conversations about a missing package.
Let’s create a custom prompt to test our generated flow against these new scenarios. We can copy and paste the flow from the previous step and ask the model whether the given instructions would sufficiently handle these new conversations. In the case of no, we’ll ask it to list the missing steps. We can run this on each individual conversation in the stash to see a yes or no classification for all 15 examples.
Following is that prompt, slightly simplified:
Our output highlights important gaps in the flow; a step for validating the user’s account information, a provision for the case that order details are unknown. These are just a few examples of overlooked steps we’ll want to account for.
We can continue to fine-tune this flow until we get nothing but ‘yes’ from the above validation prompt. Then we can transfer the output to a generative playbook, a text to flow converter, or a chatbot platform that allows you to express flows in natural language.
Finding and accounting for edge cases is easier, faster, and more effective when we work from real customer conversations. Synthetic data and imagined scenarios are limited by our conscious understanding of customer interactions, often lacking the nuances of natural language in action. With this workflow, conversation designers can accelerate the backend build of flexible and intuitive chatbots, improving agility and dramatically accelerating time to deployment.