Back to blog
Articles
Articles
September 25, 2023
·
4 min read

LLM Drift

September 25, 2023
|
4 min read

Latest content

Customer Stories
4min read

Lightspeed Uses HumanFirst for In-House AI Enablement

Meet Caroline, an analyst-turned-AI-expert who replaced manual QA, saved countless managerial hours, and built new solutions for customer support.
December 10, 2024
Customer Stories
4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
September 16, 2024
Articles
6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.
September 12, 2024
Articles
7 min read

Non-Technical AI Adoption: The Value of & Path Towards Workforce-Wide AI

Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.
September 12, 2024
Tutorials
4 min read

Building Prompts for Generators in Dialogflow CX

How to get started with generative features.
August 15, 2024
Announcements
3 min read

HumanFirst and Infobip Announce a Partnership to Equip Enterprise Teams with Data + Generative AI

With a one-click integration to Conversations, Infobip’s contact center solution, HumanFirst helps enterprise teams leverage LLMs to analyze 100% of their customer data.
August 8, 2024
Tutorials
4 min read

Two Field-Tested Prompts for CX Teams

Get deeper insights from unstructured customer data with generative AI.
August 7, 2024
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Customer Stories
4min read

Lightspeed Uses HumanFirst for In-House AI Enablement

Meet Caroline, an analyst-turned-AI-expert who replaced manual QA, saved countless managerial hours, and built new solutions for customer support.
December 10, 2024
Customer Stories
4 min read

How Infobip Generated 220+ Knowledge Articles with Gen AI For Smarter Self-Service and Better NPS

Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
September 16, 2024
Articles
6 min read

AI for CIOs: From One-Off Use to Company-Wide Value

A maturity model for three stages of AI adoption, including strategies for company leaders to progress to the next stage.
September 12, 2024

Let your data drive.

A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.

LLM Drift refers to significant alterations in LLM responses over a brief timeframe. It has nothing to do with inherent unpredictability of LLMs or minor amendments in prompt engineering, but instead involves a fundamental shift in the LLM.

A recent investigation discovered that the accuracy of responses by GPT-4 and GPT-3.5 sees substantial fluctuations in a positive direction over a four-month period. And worryingly, in the opposite direction as well.

Notably, the study shows significant variations in both GPT-3.5 and GPT-4, observing performance degradation in certain tasks.

Our findings highlight the need to continuously monitor LLMs behaviour over time. — Source

The most notable changes observed during the study were:

  • Chain-Of-Thought efficiency saw changes, with GPT-4 being less likely to answer questions which yields an opinion.
  • The decrease in opinionated answers can be related to an improvement of safety questions.
  • Tendencies of deviation for GPT4 and GPT3.5 are often different
  • Even-though LLM improvements can be achieved with fine-tuning and contextual prompt injection (RAG) unexpected behaviour will still be present.
  • The researchers stressed continued testing and benchmarking, due to the fact that the study’s analysis was largely based on shifts in broader accuracy as the main metric. However, fine-grained investigations could disclose additional interesting shift patterns.

The schematic below shows the fluctuation in model accuracy over a period of four months. In some cases the deprecation is quite stark, being more than 60% loss in accuracy.

Source

The table below shows Chain-Of-Thought (CoT) effectiveness drifts over time for prime testing.

Without CoT prompting, both GPT-4 and GPT-3.5 achieved relatively low accuracy.

With CoT prompting, GPT-4 in March achieved a 24.4% accuracy improvement, which dropped by -0.1% in June. It does seem like GPT-4 loss the ability to optimise the CoT prompting technique.

Considering GPT-3.5 , the CoT boost increased from 6.3% in March to 15.8% in June.

The datasets used and basic code examples from the study are available on GitHub. I also added an executed notebook which you can view here.

The GitHub repository also holds the datasets and generated content. Each csv file corresponds to one dataset with one record/row corresponding with one query and the generation from one LLM service.

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Build Frameworks, natural language data productivity suites & more.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox