Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.
With a one-click integration to Conversations, Infobip’s contact center solution, HumanFirst helps enterprise teams leverage LLMs to analyze 100% of their customer data.
Partnering with HumanFirst, Infobip generated over 220 knowledge articles, unlocked 30% of their agents' time, and improved containment by a projected 15%.
Reviewing the state of employee experimentation and organizational adoption, and exploring the shifts in thinking, tooling, and training required for workforce-wide AI.
A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.
LLM Drift refers to significant alterations in LLM responses over a brief timeframe. It has nothing to do with inherent unpredictability of LLMs or minor amendments in prompt engineering, but instead involves a fundamental shift in the LLM.
A recent investigation discovered that the accuracy of responses by GPT-4 and GPT-3.5 sees substantial fluctuations in a positive direction over a four-month period. And worryingly, in the opposite direction as well.
Notably, the study shows significant variations in both GPT-3.5 and GPT-4, observing performance degradation in certain tasks.
Our findings highlight the need to continuously monitor LLMs behaviour over time. — Source
The most notable changes observed during the study were:
Chain-Of-Thought efficiency saw changes, with GPT-4 being less likely to answer questions which yields an opinion.
The decrease in opinionated answers can be related to an improvement of safety questions.
Tendencies of deviation for GPT4 and GPT3.5 are often different
Even-though LLM improvements can be achieved with fine-tuning and contextual prompt injection (RAG) unexpected behaviour will still be present.
The researchers stressed continued testing and benchmarking, due to the fact that the study’s analysis was largely based on shifts in broader accuracy as the main metric. However, fine-grained investigations could disclose additional interesting shift patterns.
The schematic below shows the fluctuation in model accuracy over a period of four months. In some cases the deprecation is quite stark, being more than 60% loss in accuracy.
The table below shows Chain-Of-Thought (CoT) effectiveness drifts over time for prime testing.
Without CoT prompting, both GPT-4 and GPT-3.5 achieved relatively low accuracy.
With CoT prompting, GPT-4 in March achieved a 24.4% accuracy improvement, which dropped by -0.1% in June. It does seem like GPT-4 loss the ability to optimise the CoT prompting technique.
Considering GPT-3.5 , the CoT boost increased from 6.3% in March to 15.8% in June.
The datasets used and basic code examples from the study are available on GitHub. I also added an executed notebook which you can view here.
The GitHub repository also holds the datasets and generated content. Each csv file corresponds to one dataset with one record/row corresponding with one query and the generation from one LLM service.
I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Build Frameworks, natural language data productivity suites & more.
Subscribe to HumanFirst Blog
Get the latest posts delivered right to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.