Back to blog
Articles
Articles
November 26, 2020
·
4 MIN READ

A bottom-up approach to NLU

November 26, 2020
|
4 MIN READ

Latest content

Tutorials
4 min read

Building Prompts for Generators in Dialogflow CX

How to get started with generative features.
August 15, 2024
Announcements
3 min read

HumanFirst and Infobip Announce a Partnership to Equip Enterprise Teams with Data + Generative AI

With a one-click integration to Conversations, Infobip’s contact center solution, HumanFirst helps enterprise teams leverage LLMs to analyze 100% of their customer data.
August 8, 2024
Tutorials
4 min read

Two Field-Tested Prompts for CX Teams

Get deeper insights from unstructured customer data with generative AI.
August 7, 2024
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Tutorials
6 min read

Generating Chatbot Flow Logic from Real Conversations

How to build flexible, intuitive Conversational AI from unstructured customer data.
February 29, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024
Tutorials
4 min read

Accelerating Data Analysis with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to accelerate data analysis.
January 24, 2024
Tutorials
4 min read

Exploring Contact Center Data with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to streamline topic modeling.
January 11, 2024
Tutorials
4 min read

Building Prompts for Generators in Dialogflow CX

How to get started with generative features.
August 15, 2024
Announcements
3 min read

HumanFirst and Infobip Announce a Partnership to Equip Enterprise Teams with Data + Generative AI

With a one-click integration to Conversations, Infobip’s contact center solution, HumanFirst helps enterprise teams leverage LLMs to analyze 100% of their customer data.
August 8, 2024
Tutorials
4 min read

Two Field-Tested Prompts for CX Teams

Get deeper insights from unstructured customer data with generative AI.
August 7, 2024

Let your data drive.

Bottom-up labeling applies the tried and tested divide-and-conquer approach to this problem, with great success.

An important aspect of conversation design is understanding your customers’ intents. What are your customers asking? What problems do they have?

To solve this, access to real conversational data is critical — without it, you’re pretty much playing a guessing game; you can brainstorm the most common intents with your team, but correctly addressing the long tail specific to your domain is next to impossible.

However, access to conversational data isn’t enough: without proper tooling you’ll find yourself manually sifting through transcripts of conversations with absolutely no idea on where to start and when to stop, what utterance constitutes a valid intent vs. what is noise etc.

The typical approach to this problem has been to apply unsupervised clustering techniques.

Image taken from IBM Watson (reference)

There are two clear problems with unsupervised clustering as an approach to discovery and training of intents:

  • A first obvious problem is that clusters will often overlap (see image above), and represent similar / same intents, requiring a manual intervention to disambiguate them.
  • A less obvious but more fundamental problem, is that unsupervised clustering techniques do not say anything about how abstract or specific the intent generated from a given cluster should be.

For example, a cluster with utterances similar to “how can I transfer funds to my checking account?” could be assigned to any one of the these 3 labels, from most abstract to most specific

  1. Has a question
  2. Has a question > about bank account
  3. Has a question > about bank account > transfers
Determining which label to apply is a non-trivial problem, as the right level of abstraction for any given intent depends on whether there is sufficient data to accurately train the intent at that level of abstraction.

This is a classic chicken-and-egg problem: you need labeled data in order to correctly label your data.

Bottom-up approach to intent discovery & data labeling

Bottom-up labeling applies the tried and tested divide-and-conquer approach to this problem, with great success. Instead of expecting a human or unsupervised algorithm to correctly “predict” what intents and abstractions exist in the data, it provides a simple framework to iteratively discover this information.

The bottom-up “algorithm” is simple:

  • Step 1: Identify a few very high-level intents that can capture most (if not all) of meaning in your data (in our experience, “has a question” and “has a problem” are great starting points).
  • Step 2: Label your conversation / utterance data, assigning utterances to one of these high-level intents (the cognitive load at this labeling step is minimal, since the decision boils down to simply assigning each utterance to one of the existing high-level intents)

The outcome of this step is very valuable in itself, as it provides high-quality and domain-specific training data to classify users who “have a question” or have a problem”.

Image by Author
  • Step 3: For every intent (i.e: “has a question”), identify more specific “sub-intents” that its training examples can fall into (i.e: “has a question > about credit account”, “has a question > about account settings”)
  • Step 4: Re-assign the top-level intents’ training data to the more specific sub-intents you’ve just created
Image by Author
  • Repeat steps 3 & 4 (i.e: divide an conquer)
Image by Author

Every step produces training data for classifiers that can recognize increasingly specific intents: this is one of the major advantages of this approach.

What’s the catch?

If this solution to labeling and training data seems too obvious, it’s because it is: divide-and-conquer has been used to break down problems into manageable chunks for a long time; it just hasn’t been easily made available to data labeling and intent discovery use-cases yet.

The main reason for this is a question of tooling and resources: the labeling and refactoring workflows required to make this efficient and manageable at scale are costly to build out, and only the more sophisticated companies have done so — these companies are able to charge customers thousands and thousands of dollars to build and train intents from unstructured data.

There are however some solutions out there focusing on democratizing this approach: HumanFirst is one of them, and provides one of the first out-of-the-box bottom-up labeling and intent discovery solution. In our next article, we’ll explore how machine-learning and semantic search can accelerate this bottom-up approach. Stay tuned!

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox