Experts with McKinsey estimate that Generative AI will enhance customer operations productivity by 38% relative to functional spending; customer operations is the function they project will see the highest return.
The magnitude of those numbers reflects the growing confidence that GenAI will be as good or better than human agents at performing complex customer service tasks repeatedly and reliably. As with human agents, training, supervision, and performance management will be the pillars of success.
In order to supervise and manage the impossible volume of agent-led interactions, we’ll need to design a prompt that can assess performance in alignment with our human and business judgment. Once that prompt is functioning, we’ll need to be able to apply it at scale to monitor customer experiences in near-real-time. This article outlines that process, demonstrating a workflow for scaling agent quality assurance (QA) using HumanFirst and Vertex AI.
Creating Training Data
Use Vertex AI to create task-specific, synthetic data to engineer an agent monitoring prompt.
HumanFirst integrates with the platform of your choice, including Google’s Vertex AI, to help you test and run custom prompts against your data. You can follow this guide or watch this video to set up the integration.
For the sake of this demonstration, we’ll refrain from critiquing real agent-to-customer conversations. Instead, we’ll use HumanFirst to generate realistic training data that will help us engineer a successful QA prompt.
Within HumanFirst, we can quickly create realistic examples of good and bad conversations on common topics that agents see daily. Let’s start with a prompt to generate clean conversations between an agent and a user.
In the ‘Post Process’ tab, changing the Post Process Rule to ‘Conversations’ will give us the output in the format we’re looking for.
Next, we’ll want to feed the prompt a handful of those common topics on which to model the interactions. We can type those example topics directly into the stash using the ‘Add a custom example’ space. I’ve added the following:
When we run this prompt, it will generate a clean conversation for each topic we’ve added to the stash. Since I’ve added four topics, the output will consist of four conversations. If we wanted to generate a lot of training data, we could add a prompt directive to generate 50 conversations for each topic and add more custom topics to the stash.
To create a mirrored dataset of unsuccessful conversations, we can use a similar prompt and add a directive to make the agent do something out of the ordinary or against protocol. We can run that prompt on the same stash of custom topics, and our output will be four conversations that contain an issue we’d want our QA to flag.
Once we’re happy with the training data generated for both good and bad conversations, we can group and label the outputs into ‘good’ and ‘bad’ datasets. The result will be two training datasets on relevant topics, one of which exemplifies successful interactions, and another that models conversations we’d want to find and fix.
Assessing Agent Performance
Apply prompts to monitor agents to improve performance and accelerate trust.
This custom training data will help us engineer a prompt that can be fine-tuned and applied at scale to monitor real conversations. First, we can select the example conversations in both of our training datasets and move them to the stash; four good and four bad conversations. We can start with the following prompt:
When we run this against the stash, we’ll see a yes or no classification for every stashed conversation. We can consider the output against the source conversation to make sure the prompt is successfully distinguishing between good and bad.
Naturally, this is iterative; we wouldn’t expect an out of the box model to discern performance the same way we do. Instead, prompt engineering allows us to add clear parameters and directives to bring a more human understanding and business-aware context into our analysis.
Let’s add the following directives:
We can run this prompt against the same stashed data and measure its performance against the source conversations to see if it's improved.
If we’re confident in the prompt’s performance, we can start with a subset of real data for further fine-tuning. We could return to the ‘unsatisfied_refund’ label we created in our last example, and see what the prompt picks up. We could move all of the data from the larger ‘refund’ category to the stash and test the prompt on a wider subset of conversations which we know will have a mix of good and bad outcomes. If we see patterned instances in which the model differs from our judgment, we can further fine-tune the prompt. When we like how the prompt performs, we can run it at scale using the pipeline feature.
Scaling Agent QA
Use a pipeline to apply a monitoring prompt at scale.
Pre-release feature: The pipeline feature is currently available on demand only.
The pipeline feature will allow us to run the fine-tuned prompt against all of the raw conversations we’ve uploaded to the platform. By navigating to the ‘Pipeline’ tab, we can select the prompt we’ve just crafted, choose ‘Unlabeled’ as the input data source, and run the pipeline to extract a good or bad classification, in accordance with our prompt instructions, across all of our conversations at once.
The pipeline output will show us the yes / no classification for each conversation. We can select a single exchange and view the source conversation to double-check the model's decision making. Then we can cluster the output and quickly create two groups containing the good and the bad conversations. We’ve successfully scaled our human-like QA across an unlimited amount of real customer-agent exchanges.
Here begins another exploration; we can label the bad conversations and explore that dataset, grouping them by summarized call drivers and labeling each group. We can use a prompt to identify where the agents differ from the instructions, or have the model suggest better resolution journeys. We can also extract protocols from the good conversations to understand the resolution strategies that are proving successful.
In just a few clicks, we’ve effectively extended our contextual, human understanding to assess hundreds of agent conversations. We’ve surfaced protocol gaps and identified new proof points for successful strategies. For companies managing high volumes of customer conversations, this monitoring workflow will help teams identify and address problems sooner, making them more competitive in the intra-industry race to automated, elevated CX.
The next and final article in this series walks through a process of generating self-serve knowledge to address the gaps in the bad conversations. Stay tuned!