LPIS Home Page
Google Search

Title: DEiXTo: A web data extraction suite
Author(s): F. Kokkoras, K. Ntonas, N. Bassiliades.
Availability: Click here to download the PDF (Acrobat Reader) file (4 pages).
Keywords: web data extraction, web scraping, pattern matching.
Appeared in: 6th Balkan Conference on Informatics (BCI 2013), ACM, pp. 9-12, Thessaloniki, Greece, 19-21 Sep 2013, 2013.
Abstract: Web data extraction (or web scraping) is the process of collecting unstructured or semi-structured information from the World Wide Web, at different levels of automation. It is an important, valuable and practical approach towards web reuse while at the same time can serve the transition of the web to the semantic web, by provid-ing the structured data required by the latter. In this paper we present DEiXTo, a web data extraction suite that provides an arsenal of features aiming at designing and deploying well-engineered extraction tasks. We focus on presenting the core pattern matching algorithm and the overall architecture, which allows programming of custom-made solutions for hard extraction tasks. DEiXTo consists of both freeware and open source components.
See also : DEiXTo

        This paper has been cited by the following:

1 Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, Robert Baumgartner, Web data extraction, applications and techniques: A survey, Knowledge-Based Systems, Volume 70, November 2014, Pages 301-323, ISSN 0950-7051, http://dx.doi.org/10.1016/j.knosys.2014.07.007.
2 Alireza Aghamohammadi and Ali Eydgahi. "A New Five-Factor Process for Increasing Cybersecurity and Privacy", in: National Cybersecurity Institute Journal, Jane LeClair (Ed.), Vol.1 No.1, Excelsior College, USA, 2014
3 Kei Kanaoka, Yotaro Fujii, and Motomichi Toyama. 2014. Ducky: a data extraction system for various structured web documents. In Proceedings of the 18th International Database Engineering & Applications Symposium (IDEAS '14), Ana Almeida, Jorge Bernardino, and Elsa Ferreira Gomes (Eds.). ACM, New York, NY, USA, 342-347.