GenDE: A CRF-Based Data Extractor
Date
2020
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
River Publishers
Abstract
Web site schema detection and data extraction from the Deep Web have been
studied a lot. Although, few researches have focused on the more challenging
jobs: wrapper verification or extractor generation. A wrapper verifier would
check whether a new page from a site complies with the detected schema, and
so the extractor will use the wrapper to get instances of the schema types. If
the wrapper failed to work with the new page, a new wrapper/schema would
be re-generated by calling an unsupervised wrapper induction system. In this
paper, a new data extractor called GenDE is proposed. It verifies the site
schema and extracts data from the Web pages using Conditional Random
Fields (CRFs). The problem is solved by breaking down an observation
sequence (a Web page) into simpler subsequences that will be labeled using
CRF. Moreover, the system solves the problem of automatic data extraction
from modern JavaScript sites in which data/schema are attached (on the client
side) in a JSON format. The experiments show an encouraging result as it
outperforms the CSP-based extractor algorithm (95% and 96% of recall and
precision, respectively). Moreover, it gives a high performance result when
tested on the SWDE benchmark dataset (84.91%).
Description
Keywords
Wrapper induction, data extractor, wrapper verifier, sequence
labeling, CRFs model, JSON data extraction.
Citation
Kayed, M. and Shalaan, K. (2020) “GenDE: A CRF-Based Data Extractor,” Journal of Web Engineering, 19(3-4), pp. 371–404.