Back to blog
Articles
Tutorials
March 14, 2024
·
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

March 14, 2024
|
4 min read

Experts with McKinsey estimate that Generative AI will enhance customer operations productivity by 38% relative to functional spending; customer operations is the function they project will see the highest return.

The magnitude of those numbers reflects the growing confidence that GenAI will be as good or better than human agents at performing complex customer service tasks repeatedly and reliably. As with human agents, training, supervision, and performance management will be the pillars of success. 

In order to supervise and manage the impossible volume of agent-led interactions, we’ll need to design a prompt that can assess performance in alignment with our human and business judgment. Once that prompt is functioning, we’ll need to be able to apply it at scale to monitor customer experiences in near-real-time. This article outlines that process, demonstrating a workflow for scaling agent quality assurance (QA) using HumanFirst and Vertex AI.

Creating Training Data

Use Vertex AI to create task-specific, synthetic data to engineer an agent monitoring prompt.

HumanFirst integrates with the platform of your choice, including Google’s Vertex AI, to help you test and run custom prompts against your data. You can follow this guide or watch this video to set up the integration.

For the sake of this demonstration, we’ll refrain from critiquing real agent-to-customer conversations. Instead, we’ll use HumanFirst to generate realistic training data that will help us engineer a successful QA prompt. 

Within HumanFirst, we can quickly create realistic examples of good and bad conversations on common topics that agents see daily. Let’s start with a prompt to generate clean conversations between an agent and a user. 

In the ‘Post Process’ tab, changing the Post Process Rule to ‘Conversations’ will give us the output in the format we’re looking for.

Next, we’ll want to feed the prompt a handful of those common topics on which to model the interactions. We can type those example topics directly into the stash using the ‘Add a custom example’ space. I’ve added the following:

When we run this prompt, it will generate a clean conversation for each topic we’ve added to the stash. Since I’ve added four topics, the output will consist of four conversations. If we wanted to generate a lot of training data, we could add a prompt directive to generate 50 conversations for each topic and add more custom topics to the stash.

To create a mirrored dataset of unsuccessful conversations, we can use a similar prompt and add a directive to make the agent do something out of the ordinary or against protocol. We can run that prompt on the same stash of custom topics, and our output will be four conversations that contain an issue we’d want our QA to flag. 

Once we’re happy with the training data generated for both good and bad conversations, we can group and label the outputs into ‘good’ and ‘bad’ datasets. The result will be two training datasets on relevant topics, one of which exemplifies successful interactions, and another that models conversations we’d want to find and fix.

Assessing Agent Performance

Apply prompts to monitor agents to improve performance and accelerate trust.

This custom training data will help us engineer a prompt that can be fine-tuned and applied at scale to monitor real conversations. First, we can select the example conversations in both of our training datasets and move them to the stash; four good and four bad conversations. We can start with the following prompt:

When we run this against the stash, we’ll see a yes or no classification for every stashed conversation. We can consider the output against the source conversation to make sure the prompt is successfully distinguishing between good and bad. 

Naturally, this is iterative; we wouldn’t expect an out of the box model to discern performance the same way we do. Instead, prompt engineering allows us to add clear parameters and directives to bring a more human understanding and business-aware context into our analysis. 

Let’s add the following directives:

We can run this prompt against the same stashed data and measure its performance against the source conversations to see if it's improved. 

If we’re confident in the prompt’s performance, we can start with a subset of real data for further fine-tuning. We could return to the ‘unsatisfied_refund’ label we created in our last example, and see what the prompt picks up. We could move all of the data from the larger ‘refund’ category to the stash and test the prompt on a wider subset of conversations which we know will have a mix of good and bad outcomes. If we see patterned instances in which the model differs from our judgment, we can further fine-tune the prompt. When we like how the prompt performs, we can run it at scale using the pipeline feature.

Scaling Agent QA

Use a pipeline to apply a monitoring prompt at scale.

Pre-release feature: The pipeline feature is currently available on demand only.

The pipeline feature will allow us to run the fine-tuned prompt against all of the raw conversations we’ve uploaded to the platform. By navigating to the ‘Pipeline’ tab, we can select the prompt we’ve just crafted, choose ‘Unlabeled’ as the input data source, and run the pipeline to extract a good or bad classification, in accordance with our prompt instructions, across all of our conversations at once.

The pipeline output will show us the yes / no classification for each conversation. We can select a single exchange and view the source conversation to double-check the model's decision making. Then we can cluster the output and quickly create two groups containing the good and the bad conversations. We’ve successfully scaled our human-like QA across an unlimited amount of real customer-agent exchanges.

Here begins another exploration; we can label the bad conversations and explore that dataset, grouping them by summarized call drivers and labeling each group. We can use a prompt to identify where the agents differ from the instructions, or have the model suggest better resolution journeys. We can also extract protocols from the good conversations to understand the resolution strategies that are proving successful. 

In just a few clicks, we’ve effectively extended our contextual, human understanding to assess hundreds of agent conversations. We’ve surfaced protocol gaps and identified new proof points for successful strategies. For companies managing high volumes of customer conversations, this monitoring workflow will help teams identify and address problems sooner, making them more competitive in the intra-industry race to automated, elevated CX.

The next and final article in this series walks through a process of generating self-serve knowledge to address the gaps in the bad conversations. Stay tuned!

Latest content

Customer Stories
4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
September 16, 2024
Articles
7 min read

Non-Technical AI Adoption: The Value of & Path Towards Workforce-Wide AI

Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.
September 12, 2024
Articles
6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.
September 12, 2024
Tutorials
4 min read

Building Prompts for Generators in Dialogflow CX

How to get started with generative features.
August 15, 2024
Announcements
3 min read

HumanFirst and Infobip Announce a Partnership to Equip Enterprise Teams with Data + Generative AI

With a one-click integration to Conversations, Infobip’s contact center solution, HumanFirst helps enterprise teams leverage LLMs to analyze 100% of their customer data.
August 8, 2024
Tutorials
4 min read

Two Field-Tested Prompts for CX Teams

Get deeper insights from unstructured customer data with generative AI.
August 7, 2024
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
6 min read

Generating Chatbot Flow Logic from Real Conversations

How to build flexible, intuitive Conversational AI from unstructured customer data.
February 29, 2024
Customer Stories
4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
September 16, 2024
Articles
7 min read

Non-Technical AI Adoption: The Value of & Path Towards Workforce-Wide AI

Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.
September 12, 2024
Articles
6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.
September 12, 2024

Let your data drive.

Tutorials

Scaling Quality Assurance with HumanFirst and Google Cloud

ALEX DUBOIS
March 14, 2024
.
4 min read

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox