GenDE: A CRF-Based Data Extractor

Kayed, Mohammed; Shaalan, Khaled

GenDE: A CRF-Based Data Extractor

Date

2020

Authors

Kayed, Mohammed

Shaalan, Khaled

Publisher

River Publishers

Abstract

Web site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema, and so the extractor will use the wrapper to get instances of the schema types. If the wrapper failed to work with the new page, a new wrapper/schema would be re-generated by calling an unsupervised wrapper induction system. In this paper, a new data extractor called GenDE is proposed. It verifies the site schema and extracts data from the Web pages using Conditional Random Fields (CRFs). The problem is solved by breaking down an observation sequence (a Web page) into simpler subsequences that will be labeled using CRF. Moreover, the system solves the problem of automatic data extraction from modern JavaScript sites in which data/schema are attached (on the client side) in a JSON format. The experiments show an encouraging result as it outperforms the CSP-based extractor algorithm (95% and 96% of recall and precision, respectively). Moreover, it gives a high performance result when tested on the SWDE benchmark dataset (84.91%).

Keywords

Wrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction.

Citation

Kayed, M. and Shalaan, K. (2020) “GenDE: A CRF-Based Data Extractor,” Journal of Web Engineering, 19(3-4), pp. 371–404.

URI

https://bspace.buid.ac.ae/handle/1234/3044

Collections

Professor Khaled Shaalan

Full item page