GenDE: A CRF-Based Data Extractor

dc.contributor.authorKayed, Mohammed
dc.contributor.authorShaalan, Khaled
dc.date.accessioned2025-05-14T14:26:00Z
dc.date.available2025-05-14T14:26:00Z
dc.date.issued2020
dc.description.abstractWeb site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema, and so the extractor will use the wrapper to get instances of the schema types. If the wrapper failed to work with the new page, a new wrapper/schema would be re-generated by calling an unsupervised wrapper induction system. In this paper, a new data extractor called GenDE is proposed. It verifies the site schema and extracts data from the Web pages using Conditional Random Fields (CRFs). The problem is solved by breaking down an observation sequence (a Web page) into simpler subsequences that will be labeled using CRF. Moreover, the system solves the problem of automatic data extraction from modern JavaScript sites in which data/schema are attached (on the client side) in a JSON format. The experiments show an encouraging result as it outperforms the CSP-based extractor algorithm (95% and 96% of recall and precision, respectively). Moreover, it gives a high performance result when tested on the SWDE benchmark dataset (84.91%).
dc.identifier.citationKayed, M. and Shalaan, K. (2020) “GenDE: A CRF-Based Data Extractor,” Journal of Web Engineering, 19(3-4), pp. 371–404.
dc.identifier.doihttps://doi.org/10.13052/jwe1540-9589.19342.
dc.identifier.issn1540-9589, 1544-5976
dc.identifier.urihttps://bspace.buid.ac.ae/handle/1234/3044
dc.language.isoen
dc.publisherRiver Publishers
dc.relation.ispartofseriesJournal of Web Engineeringv19 n3-4 (2020): 371-404
dc.subjectWrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction.
dc.titleGenDE: A CRF-Based Data Extractor
dc.typeArticle
Files
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.35 KB
Format:
Item-specific license agreed upon to submission
Description: