Detecting similar content

Written by : Corentin Blanc Published on : February 08, 2024

The hunt for duplicate content

Knowing how to detect similar content on your website is crucial in SEO. It helps prevent your content from being penalized in the SERPs. Search engines rank web pages according to several criteria. When duplicate content is detected on a page, its ranking is downgraded.

This criterion has been adopted by search engines to guarantee the originality of content and thus promote a quality user experience. In addition, it allows them to avoid determining how to position one piece of content in relation to another that is identical to it.

How can I automatically detect similar content?

There are three ways to automatically detect similar content: keyword search, semantic search and hybrid search.

While the first two employ radically different procedures, the third brings them together, in an attempt to propose a more complete and precise method.

Keyword search: term weighting

What is a keyword search?

Keyword searching involves using specific words or phrases to identify similar documents. To do this, we construct a vector representing the frequency of occurrence of a term. For example, if you want to find chocolate cake recipes, simply search for all documents containing the keywords "recipe", "cake" and "chocolate".

How to proceed?

This process is usually carried out using TF-IDF weighting. In a nutshell, it consists of evaluating the relative importance of each word (TF) in a document compared to its presence in a collection of documents (IDF).

Term Frequency

More precisely, Term Frequency is calculated by counting the occurrences of a term and dividing this sum by the total number of words in the document. The result is the proportion of that term in the document, and presumably its importance. Supposedly, since the terms most often used in a document are not always the most important, as is the case with determiners.

Inverse Document Frequency

IDF weighting, on the other hand, refers to the recurrence of a term among a large corpus of documents to determine its importance. We therefore divide the number of documents in the corpus by the number of documents containing the term under study.

This method has a variant which allows the size and saturation of documents to be taken into account by the same term: this is the BM25 formula.

Limits of keyword searches

The TF-IDF formula does, however, have its limitations, which you need to be aware of if you are to make good use of it. It requires you to focus on a single word or expression. However, writing often requires the use of synonyms, or the transition from popularized vocabulary to a technical lexicon. Language nuances are not considered in the formula. In semiological terms, all the signs used to designate the same object are not taken into account. What's more, it doesn't take into account the order in which terms appear in a document.

Semantic search: document vectorization

Semantic search at a glance

Semantic search is more comprehensive than keyword search. Its higher degree of accuracy in similarity detection is enabled by an understanding of the overall meaning of a document. Still using the example of the chocolate cake recipe, semantic search could find more results by including recipes based on cocoa or spreads, without ever using the word "chocolate".

How to detect similar content using semantic search?

This method works through the use of language models, such as BERT for English and CamemBERT for French. They store documents in the form of vectors that capture and represent semantics, as well as the various relationships between words. By comparing them according to a similarity score, it is then possible to determine whether the content is similar or not. In most cases, cosine similarity is used.

Able to train a system on larger corpora of words and documents, semantic search can have an idea of the context associated with terms, make use of it and deal with unknown words, but above all it is able to deal with notions of semantics.

The limits of semantic search

However, this method also has its limitations. In this case, the results expressed by these language models are complex, and do not correspond to human reasoning. This makes them difficult to interpret. In addition, semantic search requires more time and computing power.

Hybrid research: the best of both worlds

The hybrid search performs a keyword search, while taking into account the overall meaning of the document. For example, it's possible to find recipes containing the terms "recipe", "cake" and "chocolate", while also taking into account cocoa- or spread-based recipes not containing these keywords.

This technique makes it possible to query a very large corpus, while ultimately proposing only the most specific and precise results that correspond to the search.

"Charter of Paris on AI and journalism: ContentSide makes a commitment. Building a high-quality internal network "

Are you interested in this topic?