Chat message classification with ML

Matias Casali
April 11, 2023

Context / Introduction

In one of our apps we had a chat that received lots of messages. Our users were having problems due to not being able to answer urgent messages quickly enough, which caused client complaints. To try and mitigate this, we were asked to classify messages into different dynamic categories, each with a different urgency and theme.

Problem and Solution

We decided to solve this problem using machine learning. For this purpose, we had two main constraints: 

  1. We didn’t have enough data to train or fine-tune a model from scratch. We could gather this data but it would take a long time, and speed was important for this feature. For that reason, we changed our scope and decided to use an already trained model. 
  2. Since we wanted to classify incoming chat messages, the model needed to be fast enough to keep up with the constant message inflow.

The first model we chose was BART MLNI. This is a zero shot classification model: it can classify natural language into categories it has never seen during training, making it excellent for our use case. Under the hood, this model is a natural language inference model, given two sentences, a “hypothesis” and a “premise”, it will determine if the premise follows from the hypothesis.

Let's say we want to decide if a sentence is about science, we can do so using this NLI model, we need to use the sentence we want to classify as the premise, and “This example is about science” as the hypothesis, thus the model will return a score detailing how likely the premise follows from the hypothesis, which in our case will approximately correlate with how likely it is that our sentence is related to science.

During testing, this model proved to be pretty accurate, but it had a big disadvantage: it was really slow. Since we wanted to classify a message into one of many categories, the model had to analyze the hypothesis/premise for each one of them. 

We tried using a smaller, faster model to discard the unlikely categories so the main model would have a smaller subset of categories; the performance improved greatly, but it wasn’t enough to keep up with the incoming messages. Since we were hosting the model in AWS SageMaker we tested auto-scaling it to compensate. It worked but ended up being too costly.

At the same time we were looking for other potential solutions, we were also experimenting with the ChatGPT API, and decided to test how it would do with this categorization problem.

We started off with this naive prompt

It did work, and it was almost 6 times faster than the zero shot model, but it would frequently return categories that weren’t defined, and it could only process one message at a time.

After iterating on it for some time, we arrived at the following prompt, that, at least for the time being, is working correctly. It receives the list of messages and categories, and returns the chosen categories in a JSON array.

Besides the prompt, it’s important to also use an adequate temperature. The model temperature determines how “creative” the model will be when completing your prompt. A temperature of 0 means the model will always return the same answer for the same input, and a high temperature will return wildly different answers. We found out that in this case ~0.3 works best, but it should return good enough results with any temperature.


For the time being, ChatGPT is a fast, cheap and flexible solution for many natural language processing tasks, and as described in this note, for some use cases is even better than traditional narrow models.

As an alternative solution to the performance problem of the zero shot model, it is also possible to distill it into a smaller, faster model that retains a high degree of accuracy. 

Interested in our services?
Please book a call now.