Intelligent research based on reasoning
A search engine is a service for finding resources based on a query. Indeed, depending on the nature and structure of the data, several description and query languages have been proposed. In this article, we discuss the main techniques existing in the literature and we position ourselves with respect to the state of the art.
Structured and semi-structured information retrieval :
Data structuring is used to organise information in order to facilitate access (tables, tuples, relations, identifiers, etc). Several representative models exist in the literature. Each model has a specific technique to minimise the cost in terms of response time. For example, SQL, dedicated to relational data tables, is one of the most widely used languages in the computer world.
With the emergence of the Web and the different media of exchange, the need to find a data format that facilitates the exchange of data between different actors became unavoidable. In this context, several other languages were invented to solve the problem of interoperability. XML and JSON are certainly among the most well-known languages. Thanks to their extensibility and expressive power, these two languages have been widely adopted by the Web community as a data format.
Finally, to cope with the explosion in data volume, NoSQL technologies have emerged to ensure scalability. It is mainly a question of intelligent data organisation with a lightening of integrity constraints which tend to increase response time.
Text search :
The data structure is relevant information that allows precise queries to be formulated, the syntax of which is defined according to the elements in the database. Unfortunately, this solution is not applicable for long texts. Textual information retrieval adopts solutions based on a completely different principle. First of all, a first filtering step is applied using techniques from automatic language processing (ALP). All the constituent elements of the text are morpho-syntactically analysed in order to eliminate irrelevant information and to format the selected words (lemmatisation, explicitation, etc.) which will then be indexed by the system. This last step depends on the representation model used (vector, probabilistic, etc). The relevance of each keyword is calculated according to several parameters that are specific to each model. As an example, the famous TF-IDF method implemented by the Lucene API uses the frequency of occurrence of words as the main criterion for assigning weights. Despite this proven performance, TF-IDF shows some limitations for short texts where word frequency is not a determining criterion.
Reasoning-based research :
The two approaches presented above make it possible to query data where the syntax of the query must correspond exactly to that of the data. In fact, in this context, deductive databases have appeared, offering the possibility of defining rules allowing the inference (deduction) of new facts from the data already present. Thanks to this system, even if the user's query does not explicitly correspond to the data initially available, the system takes charge of analysing this query in order to return all the information corresponding to the defined constraints.
As an example, let's assume that our database contains only instances of type Student and Teacher . The two queries Q1 (find all persons) and Q2 (find all persons who teach in a public institution) return no result. On the other hand, specifying some reasoning rules R1 (all students are people), R2 (a teacher is a person who teaches in a public institution) and R3 (a university is a public institution) returns all results satisfying the constraints defined by Q1 and Q2. In addition to deduction, the reasoning also contributes to the optimisation of the computation time, the query is simplified before its execution which allows to avoid joins.
Among the languages for implementing deductive systems, two main categories can be distinguished: logic programming and the Semantic Web. Prolog and Datalog are among the most famous languages belonging to the logic programming category.However, with the emergence of the Web, other more expressive languages dedicated to ontological development have emerged. Ontology Web Language (OWL), based on Description Logic, is an example of these languages recommended by the W3C.
Using reasoning for textual data :
In the previous sections, we presented three types of information retrieval. Indeed, the ideal solution would be to exploit all possible dimensions to provide accurate and intelligent answers that allow the deduction of implicit facts. However, if we want to bring this capability to unstructured textual data, we have to go through a phase of ontology enrichment allowing to detect all the different entities composing a text and to assign the corresponding category at the level of this ontology. Consequently, a new informational layer, allowing to answer intelligent queries, is added to the system.
If we consider the query Q3(find African personalities who have visited France), a standard search engine would only obtain documents whose keywords described in the query are explicitly present in the documents. To face this limitation, we build specific ontologies, enriched from external resources such as Wikipedia. Our queries are intelligent and allow us to deduce that the South African president, Jacob Zuma had attended the COP21, which is an event that took place in Paris . So Jacob Zuma is one of the results returned
At ContentSide, we opt for an innovative and multidimensional solution using both textual information and web reasoning, based on ontology enrichment, via several resources. This challenge is far from trivial. Several methods exist in the state of the art, but unfortunately there is no generic method that works universally for all domains. This is a complicated field, involving challenges from several research disciplines such as data mining, NLP, Semantic Web and information retrieval.
Are you interested in this topic?