Skip to main content

Research Repository

Advanced Search

ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASSIGNMENT BASED APPROACH

Kabir, Rasel; Kabir, Shaily; Amin, Shamiul

Authors

Rasel Kabir

Shaily Kabir

Shamiul Amin



Abstract

Searching useful information from the web, a popular activity, often involves huge irrelevant contents or noises leading to difficulties in extracting useful information. Indeed, search engines, crawlers and information agents may often fail to separate relevant information from noises indicating significance of efficient search results. Earlier, some research works locate noisy data only at the edges of the web page; while others prefer to consider the whole page for noisy data detection. In our paper, we propose a simple priority-assignment based approach with a view to differentiating main contents of the page from the noises. In our proposed technique, we first make partition of the whole page into a number of disjoint blocks using HTML tag based technique. Next, we determine a priority level for each block based on HTML tags priority while considering aggregate priority calculation. This assignment process gives a priority value to each block which helps rank the overall search results in online searching. In our work, the blocks with higher priority are termed as informative blocks and preserved in database for future use, whereas lower priority blocks are considered as noisy blocks and are not used for further data searching operation. Our experimental results show considerable improvement in noisy block elimination and in online page ranking with limited searching time as compared to other known approaches. Moreover, the obtained accuracy from our approach by applying the Naive Bayes text classification method is about 90 percent, quite high as compared to others.

Citation

Kabir, R., Kabir, S., & Amin, S. (2015). ISOLATING INFORMATIVE BLOCKS FROM LARGE WEB PAGES USING HTML TAG PRIORITY ASSIGNMENT BASED APPROACH. Electrical & Computer Engineering: An International Journal (ECIJ), 4(3),

Journal Article Type Article
Acceptance Date Sep 25, 2015
Online Publication Date Sep 25, 2015
Publication Date 2015-09
Deposit Date Nov 21, 2023
Journal Electrical & Computer Engineering: An International Journal (ECIJ)
Print ISSN 2088-8708
Electronic ISSN 2722-2578
Publisher Institute of Advanced Engineering and Science
Peer Reviewed Peer Reviewed
Volume 4
Issue 3
Public URL https://keele-repository.worktribe.com/output/643058
Publisher URL https://zenodo.org/records/3596583


Downloadable Citations