Web Search Technology [Lecture notes Information retrieval]

Web search engines work by storing information about many web pages, which they retrieve from the Web itself. Crawler-based search engines have three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site [1]. This is what it means when someone refers to a site being "spidered" or "crawled". The spider returns to the site on a regular basis, such as every month or two, to look for changes. Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalogue, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with the new information. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been "spidered" but not yet "indexed". Until it is indexed (added to the index), it is not available to those searching through the search engine [2, 1]. The third element is a ranking algorithm. Search engines use a ranking algorithm to determine the order in which matching web pages are returned on the results page [2]. They build indices mostly based on keyword occurrence, link popularity and frequency, for query negotiation using these indices. Using these connectivity-based algorithms, they measure the quality of each individual page so that users will receive a ranked page list for their queries. The working of a search engine can be summarized in three simple steps as follows:

a. Crawling the Web

b. Matching the keyword with the web pages available in the Web repository

c. Providing result to the user’s query

1. Sergey Brin and Larry Page. Google search engine, http://google.stanford.edu.
2. Brin S. and Page L.,"The Anatomy of a Large-scale Hypertextual Web Search Engine", in Proceedings of WWW, 1997.

Shruti Speak's

Search This Blog

Web Search Technology [Lecture notes Information retrieval]

Labels

Comments

Popular posts from this blog

Inter-Organizational Value Chain

System Analysis and Design (SAD)