Rasel Kabir
ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASSIGNMENT BASED APPROACH
Kabir, Rasel; Kabir, Shaily; Amin, Shamiul
Authors
Shaily Kabir
Shamiul Amin
Abstract
Searching useful information from the web, a popular activity, often involves huge irrelevant contents or noises leading to difficulties in extracting useful information. Indeed, search engines, crawlers and information agents may often fail to separate relevant information from noises indicating significance of efficient search results. Earlier, some research works locate noisy data only at the edges of the web page; while others prefer to consider the whole page for noisy data detection. In our paper, we propose a simple priority-assignment based approach with a view to differentiating main contents of the page from the noises. In our proposed technique, we first make partition of the whole page into a number of disjoint blocks using HTML tag based technique. Next, we determine a priority level for each block based on HTML tags priority while considering aggregate priority calculation. This assignment process gives a priority value to each block which helps rank the overall search results in online searching. In our work, the blocks with higher priority are termed as informative blocks and preserved in database for future use, whereas lower priority blocks are considered as noisy blocks and are not used for further data searching operation. Our experimental results show considerable improvement in noisy block elimination and in online page ranking with limited searching time as compared to other known approaches. Moreover, the obtained accuracy from our approach by applying the Naive Bayes text classification method is about 90 percent, quite high as compared to others.
Citation
Kabir, R., Kabir, S., & Amin, S. (2015). ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASSIGNMENT BASED APPROACH. Electrical & Computer Engineering: An International Journal (ECIJ), 4(3),
Journal Article Type | Article |
---|---|
Acceptance Date | Sep 25, 2015 |
Online Publication Date | Sep 25, 2015 |
Publication Date | 2015-09 |
Deposit Date | Nov 21, 2023 |
Journal | Electrical & Computer Engineering: An International Journal (ECIJ) |
Print ISSN | 2088-8708 |
Electronic ISSN | 2722-2578 |
Publisher | Institute of Advanced Engineering and Science |
Peer Reviewed | Peer Reviewed |
Volume | 4 |
Issue | 3 |
Public URL | https://keele-repository.worktribe.com/output/643058 |
Publisher URL | https://zenodo.org/records/3596583 |
Downloadable Citations
About Keele Repository
Administrator e-mail: research.openaccess@keele.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search