Which scraping solution to use for online media?
The Internet is the largest source of information and data ever constructed by mankind. Automatic exploitation of this enormous collection of data is extremely difficult without prior knowledge of the structure of each site. An automated process such as data classification, keyword extraction or document search requires an efficient scraping tool. In our previous work, we made a comparative study between the different existing scraping tools.
At Press' Innov, we felt the need to develop a more accurate scraping tool to meet the needs of the press industry. The inaccuracy of some existing scraping solutions is due to several pieces of information that negatively impact the quality of the service provided.
As a reminder, this study only concerns scrapers with no prior knowledge of the structure of the sites to be scraped. They must act completely autonomously, without human intervention. Those that only work on the basis of a pre-requisite knowledge of the site are "Text Extractor". So we're going to be a bit more ambitious than that...!
A new scraping tool
Before learning, it is essential to understand the nature of the texts to be processed. The intra-page information (e.g. word length, link density, etc.) used by classic scraping tools such as Boilerpipe, DiffBot and Goose is often insufficient. We are increasingly encountering dynamic sites with content of varying lengths. News menus and advertisements with characteristics close to those of the main content of the page (see the screenshot above). At this point, the distinction between the two types of information (harmful and relevant) becomes extremely complicated.
The use of inter-page information for each site is an informationally very promising dimension. The evolution of harmful information from one page to another allows the extraction of patterns used in the prediction of undesirable information.
Based on this logic, our data scientists developed a new scraping tool based on artificial intelligence techniques. We applied learning techniques on large-scale data from several heterogeneous sites to extract discriminating patterns. Our scraping tool has been enhanced with statistical methods and some inference rules.
The accuracy, recall and performance of our scraper were compared with other solutions on the market. Significant improvements have been made for our use cases. As a result of these efforts, Press' Innov has its own scraper.
A parallel real-time architecture
Scalability and distribution are two important keys to the quality of a real-time service. The regular updating of our learning services requires an optimised architecture. To achieve this, our development team has implemented a solution based on parallel computing with intelligent cache management ensuring optimal performance.
We are very satisfied with the result. The industries we serve are looking for this quality. We already have several ideas for innovative solutions based on this technology!
Are you interested in this topic?