Back to blog
Articles
Articles
December 17, 2021
·
4 MIN READ

The Importance of High-Quality Training Data

December 17, 2021
|
4 MIN READ

Latest content

Customer Stories
4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
September 16, 2024
Articles
7 min read

Non-Technical AI Adoption: The Value of & Path Towards Workforce-Wide AI

Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.
September 12, 2024
Articles
6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.
September 12, 2024
Tutorials
4 min read

Building Prompts for Generators in Dialogflow CX

How to get started with generative features.
August 15, 2024
Announcements
3 min read

HumanFirst and Infobip Announce a Partnership to Equip Enterprise Teams with Data + Generative AI

With a one-click integration to Conversations, Infobip’s contact center solution, HumanFirst helps enterprise teams leverage LLMs to analyze 100% of their customer data.
August 8, 2024
Tutorials
4 min read

Two Field-Tested Prompts for CX Teams

Get deeper insights from unstructured customer data with generative AI.
August 7, 2024
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Customer Stories
4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
September 16, 2024
Articles
7 min read

Non-Technical AI Adoption: The Value of & Path Towards Workforce-Wide AI

Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.
September 12, 2024
Articles
6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.
September 12, 2024

Let your data drive.

Articles

The Importance of High-Quality Training Data

ALEX DUBOIS
December 17, 2021
.
4 MIN READ

In the new AI/NLU paradigm, businesses realize that complex algorithms will no longer sustain their competitive advantage. The advantage lies in the ability to curate and utilize high-quality training data.

However, the tooling to build, curate and manage AI training data has not followed the surge of conversational AI/NLU tools.

But yet… it’s widely recognized that AI/ML is 95%+ data work:

Andrew Ng, pioneer of the data-centric movement, logically wondered why we aren’t ensuring data quality is of the utmost importance for a machine learning team, especially since the majority of machine learning is data cleaning and preparation.

This was in response to the conventional wisdom of AI practitioners, which suggested that in order to improve AI systems, users must iterate the model and hold the data fixed, also known as the model-centric approach.

The data-centric approach flipped that equation on its head. AI practitioners began to systematically improve the quality of the data while holding the model fixed.

This is why we’re seeing a shift that reflects the importance of a data-centric view:

How do I make the shift from model-centric to data-centric?

To adopt a data-centric approach, you have to prioritize the continuous iteration and improvement of the data over the model.

Let’s take a closer look.

Quality > Quantity

It’s important to prioritize quality over quantity. This axiom has been ingrained in us by every boss, professor, and Marie Kondo enthusiast. Why hoard massive amounts of data you don’t need?

Low-quality, high-volume data will inevitably focus your attention on the wrong things, not to mention the 100+ hours spent tuning your model to overcome it. The lack of high-quality data makes many of these algorithms pretty impractical.

Use real-world data

Your system needs to translate into real-world utility. It is important to use historical, customized, high-quality data so your model is tailored to your exact use case. This solves the generalization problem of certain industries being slow to adopt machine-learning processes due to the lack of tailored data (like healthcare and agriculture). Training data isn’t ubiquitous, it’s not a one-size-fits-all model that can be universally applied to any domain.

Using ‘state-of-the-art’ generative models to curate synthetic data will favor quantity over quality. Your model will be trained with implausible examples and will result in the inability to succeed when deployed in the real world.

The best training data comes from real conversations that are specific to your users.

Get systematic

Discover a systematic way to manage your data by having consistent labeling techniques without ambiguity, a streamlined way to discover new intents to increase coverage, and ways to disambiguate overlapping intents.

Using an unmethodical, top-down approach leads to inaccuracy and the long-tail of requests will not be scoped in.

Use error analysis

Find where the data systematically underperforms, with workflows to fill in the poor or uncovered data. For example, adopt workflows with timely feedback loops and revisions. Catching errors early on and keeping track of changes is important to make sure you can roll back anything that didn’t work out as expected.

Continuous Improvement

Systematic improvement of training datasets is one of the most effective ways to improve the performance of a model. You should methodize the following:

  • How do we continuously improve the intents that we’ve deployed?
  • How do we continuously discover new intents?
  • How is this workflow streamlined?

… but the best way to make the shift towards data-centricity is adopting data-centric NLU tooling.

HumanFirst: Data-centric Tooling for NLU

We often get the question,

“Is HumanFirst another chatbot or NLU platform, like DialogFlow or Watson?”

If it’s not obvious by now, the answer is no. We addressed the tooling gap in the market by building a hyper-efficient tool for building and maintaining the data that powers your chatbot or NLU.

We saw teams turning to Excel or building (and maintaining) their own tooling and processes to do this work. Both of these alternatives lead to inefficiencies, frustration, high cost, and a delayed time to market.

So, what are we?

A tool to systematically engineer data used to build AI systems, with the ability to:

  • Explore your unlabeled dataset to inform development priorities and business decisions
  • Discover what to label with clustering and semantic search capabilities
  • Build an intent hierarchy with machine-learning labeling workflows
  • Test your data with real-time updates and revisions
  • Correct issues with remediation workflows, while discovering the long-tail of intents through this course
  • Work with natural language data in a fun, intuitive, and useful way

As we move towards the democratization of ML skills and tools, data is becoming the key component and differentiator of modern ML pipelines. Thus, having cleaned and de-noised datasets will become the key differentiator in data architectures. Training data needs to return accurate predictions and have the ability to scale systematically and sustainably.

Our vision at HumanFirst is to make the entire process of discovering, training, and improving intents from raw natural language data productized and user-intuitive. HumanFirst maintains the most advanced data pipeline and platform to address this gap in the ML/AI tooling ecosystem.

HumanFirst is like Excel, for Natural Language Data. A complete productivity suite to transform natural language into business insights and AI training data.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox