Web Crawler Solution
A number of Ness clients use large volumes of aggregated data to populate and update their inventory. This solution automates much of that process, using the web as a data input.
The Ness Connected Labs team set about designing a tool that uses the internet as one massive database, from which data can be automatically extracted and updated. Automatically ‘scraping’ content from the web in real time, to provide up-to-date information at the required frequency and scale, seemed like an idea worth developing. An automated process should free up developers and hand real control to non-technical content editors.
Ness clients aggregating data (in TV listings, financial analysis or restaurant menus) face two common problems to solve:
- Manually updated information becoming stale, out-of-date (and therefore less valuable) over time
- Huge increases in the volume of ‘long-tail’ content as the scale of these businesses increases.
Could we address all of these challenges?
The Labs team researched the business requirement and looked at how the existing process works with our clients in order to set the parameters for the innovation challenge. They benchmarked ‘information aggregation’ tools and crawlers, as well as plugins for web browsers like the ‘inspector’ on Chrome, an ‘in-context’ dev tool. The team then designed and built a tool that automated most of the process.
We used Scrapy for the Web Crawler Framework, Selenium or ScrappyJS middleware for handling dynamic page content, and Frontera as a crawl frontier implementation.
Once the data was identified, it was stored in a combination of HBase + Elasticsearch with Kafka + Flume as a communication layer between Scrapy and HBase/Elasticsearch.
The web crawler console enables content editors to point to the source website of their choice anywhere on the web. The information is checked for accuracy, relevance and format and then displayed where it’s needed. If the source of information disappears or changes format, the content editor receives an alert and can make an adjustment (find a better source) through the console.
The web crawler can be used in many similar situations in which aggregation of information is required and the process of updating it is inefficient.