Monitoring of Artificial Intelligence Models: Method, Representation, and Results

All news outlets are undergoing a gradual shift in the way their articles are written. The topics covered, editorial style, vocabulary used, and the depth of content per section are all factors that are likely to evolve over time.

To train an AI model, you need to build a dataset. To do this, we use a news publisher’s article database. This dataset is therefore representative of the publisher’s content only at a specific point in time. As a result, the natural evolution of real-world data creates a gap between it and the training data, which can cause the model’s performance to decline.

What are the solutions to this problem? In other words, how can we ensure that a model trained for the press and media remains relevant?

In short: Implementing a monitoring system is essential for ensuring the model’s effectiveness. Consulting an expert helps determine the best approach based on your specific needs. This prevents the use of ineffective methods and promotes efficient and responsible solutions. 🌱

Need an expert? Tell us about your project.

Definition

To ensure that a model continues to perform well, a monitoring system can be implemented. This requires providing the model with actionable data—since it cannot process textual data—by selecting an appropriate data representation method. 

Converting text data into data that can be used by AI 

To verify the relevance of a model against constantly evolving data, one can compare the texts used for training with the texts currently produced by the editorial team. 

However, a machine cannot compare texts written in natural language; they must first be converted into a numerical representation. In short, each article is converted into a vector and placed in a space. This makes it possible to calculate the distance between two texts and thus determine their degree of similarity

This mechanism is used to determine the category to which a new article belongs, but it can also be used to flag instances where submitted articles differ too greatly from the training articles.

Present data in a way that is understandable to humans 

Data can be represented in many different ways. To keep things simple, we will explore three of them:

Each has its own specific characteristics, which makes the choice of data representation a critical decision for achieving the desired result. To understand the implications of this choice, let’s consider a hypothetical use case: France-Actus.

Maintaining the performance of a classification model for the press: the case of France-Actus

To understand how to choose the best representation, let’s take the example of a news outlet, which we’ll call France-Actus. 

France-Actus is a general-interest daily newspaper that covers national news. In 2018, the publication sought a solution to boost its editorial team’s productivitywhile improving the internal structure of its website. The editor-in-chief, Eddy Torialiste, decided to implement a classification system for its articles.

In practical terms, when a journalist finishes writing an article, they use an artificial intelligence system to automatically categorize it under one or more sections of the website.

This new tool proved invaluable for several years, until an unforeseen event disrupted this collaboration.

Implementing COVID monitoring

Two years after the classification model was implemented, its performance has plummeted. In fact, the categories suggested by the model seem inconsistent, and the editorial team is no longer satisfied with the tool.

And for good reason: it’s 2020, a year marked bythe onset of COVID-19; the pandemic dominated the headlines and all media outlets for many weeks.

The result? France-Actus’s editorial line has been affected, and the data fed into the model now differs too greatly from the training data. 

To solve this problem, Eddy Torialiste makes two decisions: 

  • Retrain the classification model on the new data,
  • Monitor your model so you’re automatically alerted if its performance declines.

Eddy Torialiste opts for the simplest solution: tracking changes to his articles by comparing the representation used by his classification model. The representation corresponding to the language model is therefore already calculated within his system, since this step is necessary for categorizing his articles.

A Look at Monitoring: ChatGPT

In 2023, France-Actus is facing two new challenges: Readers are criticizing the site’s navigation, and its SEO performance is declining.

In an effort to identify the cause, Eddy Torialiste is once again examining the performance of the classification model. However, the metrics show no decline in performance or any significant change in the articles.

After surveying the editorial staff, Eddy Torialiste learned thatsome of his journalists had begun using ChatGPT, which had led to a significant change in writing style and vocabulary.

Aside from the fact that he would have liked to have been warned, Eddy Torialiste thus learns that the decisionto use a language model makes it impossible to detect such a change. Indeed, in order to classify the articles correctly, the language model was trained to ignore changes in style and focus instead on the topics covered. 

The best approach in this case would have been to use a TF-IDF representation. This is because the language model ignores form to focus on the subject matter, whereas the TF-IDF method focuses on style and vocabulary

Insufficient monitoring: the Olympics

In 2024, the France-Actus newsroom is bustling with activity as it covers the Paris Olympics. Numerous articles are published every day to report on and analyze all the sporting events.

With just a few days left in the competition, the classification model alerts Eddy Torialiste: the proportion of articles published in the “sports” category has skyrocketed. He doesn’t understand why he wasn’t alerted sooner and would have preferred to be notified in the early days of the competition. 

To achieve this, Eddy Torialiste should have opted for a more specialized representation, similar to classifier scores, which track only the proportion of categories but are responsive and lightweight.

How do you choose the right method? 

The France-Actus case study shows us that the way data is presented has a significant impact on the results obtained from a monitoring model.

One possible solution could be to combine several methods to cover all possibilities. 

Another option could be to use an LLM. While this approach has proven effective in terms of performance, it requires a model that is ten times larger, which means it will be more expensive and consume more energy.

However, there is a better way to combine efficiency and performance. The solution is to consult an expert who can thoroughly understand your needs and address them with a tailored solution.

This focus on efficiency encourages the adoption of responsible practices in the field of artificial intelligence. 

Are you looking for an expert to help you process and analyze your data? We’re here to help!

Are you interested in this topic?

CONTACT US