Things you may not know about Web Information Extraction?

How do directories, especially Search engine-type listings get the right details about the pages within their aggregated sites database? How do they easily gather web information from these pages? Do they utilize web information extraction strategies to find the right information? Definitely, yes!

Web information extraction is the single most fundamental activity that web search engines have been using to prop up their information on existing websites in the World Wide Web. You don’t really have to overemphasize the number of websites on the web anymore to know how extensive web information extraction procedures are. As long as websites keep popping out everywhere, there would always be a reason why search engines must expend web information extraction on them.

Web Information extraction is basically the act of gathering useful information or contents or meta tags from websites to compile a viable list for public viewing. Directories are not only made for the purpose of earning, they are also developed to provide a convenient source of web information for the website viewers.

Have you ever wondered how major search engines or directories like Yahoo, Google and GoGuides.org gather short details about all the websites within their directory? Search directories utilize common crawling or listing strategies to get the right information from the page source themselves. Manual data extraction and entry, and robot crawling are two of the most well known procedures of gathering page details. These methods often target the site’s meta tags, where the title, description and links information are stored.

In the manual procedure, the sites are scanned for the title and description by the data entry personnel. The sites are then linked through different categories, according to the website’s use and relevance. The sites are also checked for the quality of the content as well as the visual items. After this process, the sites are then made live for public viewing. Manual directory information extraction is quite tedious, yet they produce highly unique data listings.

Human edited directories like DMOZ seek volunteer web editors to help them develop their directory. These web editors often develop very original site descriptions, thus most search engine crawl information from DMOZ itself because of the quality of work the web editors have put in for every listing.

The automated procedure uses similar techniques as what manual editors do. The “robot site crawler” or just “robots” search for given fields like the meta title and meta description from the page source, and then transforms a hierarchy according to the relevance or purpose of the site. The crawler then collates all the information gathered, and displays them on the site directory. Everything is done automatically by the coding applications, which jump from site to site to list down everything.

The procedure for this type of strategy may seem very easy, yet it produces very common or repetitive site descriptions. There is also a tendency for the site descriptions to be erroneous.

Robot crawlers are capable of extracting web information, but not editing it to its correct data form. The robots are incapable of deciding whether the information entered for each site is relevant or supplementary to the category the web architects are currently building.

Author: KPO

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>