Web search engines work by
storing information about many web pages, which they retrieve from the Web itself.
Crawler-based search engines have three major elements. First is the spider,
also called the crawler. The spider visits a web page, reads it, and then
follows links to other pages within the site [1]. This is what it means when
someone refers to a site being "spidered" or "crawled". The
spider returns to the site on a regular basis, such as every month or two, to
look for changes. Everything the spider finds goes into the second part of the
search engine, the index. The index, sometimes called the catalogue, is like a
giant book containing a copy of every web page that the spider finds. If a web
page changes, then this book is updated with the new information. Sometimes it
can take a while for new pages or changes that the spider finds to be added to
the index. Thus, a web page may have been "spidered" but not yet
"indexed". Until it is indexed (added to the index), it is not
available to those searching through the search engine [2, 1]. The third
element is a ranking algorithm. Search
engines use a ranking algorithm to determine the order in which matching web
pages are returned on the results page [2]. They build indices mostly based on
keyword occurrence, link popularity and frequency, for query negotiation using
these indices. Using these connectivity-based algorithms, they measure the
quality of each individual page so that users will receive a ranked page list
for their queries. The working of a search engine can be summarized in
three simple steps as follows:
a. Crawling the Web
b. Matching the keyword with
the web pages available in the Web repository
c. Providing result to the
user’s query
1. Sergey Brin and Larry Page. Google search engine, http://google.stanford.edu.
2. Brin S. and Page L.,"The Anatomy of a Large-scale Hypertextual Web Search Engine", in Proceedings of WWW, 1997.
Comments