Skip to main content

Web Search Technology [Lecture notes Information retrieval]



Web search engines work by storing information about many web pages, which they retrieve from the Web itself. Crawler-based search engines have three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site [1]. This is what it means when someone refers to a site being "spidered" or "crawled". The spider returns to the site on a regular basis, such as every month or two, to look for changes. Everything the spider finds goes into the second part of the search engine, the index. The index, sometimes called the catalogue, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated with the new information. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been "spidered" but not yet "indexed". Until it is indexed (added to the index), it is not available to those searching through the search engine [2, 1]. The third element is a ranking algorithm. Search engines use a ranking algorithm to determine the order in which matching web pages are returned on the results page [2]. They build indices mostly based on keyword occurrence, link popularity and frequency, for query negotiation using these indices. Using these connectivity-based algorithms, they measure the quality of each individual page so that users will receive a ranked page list for their queries. The working of a search engine can be summarized in three simple steps as follows:

a. Crawling the Web
b. Matching the keyword with the web pages available in the Web repository
c. Providing result to the user’s query

1. Sergey Brin and Larry Page. Google search engine, http://google.stanford.edu.
2.  Brin S. and  Page L.,"The Anatomy of a Large-scale Hypertextual Web Search Engine", in Proceedings of WWW, 1997.


Comments

Popular posts from this blog

Advantages and Disadvantages of EIS Advantages of EIS Easy for upper-level executives to use, extensive computer experience is not required in operations Provides timely delivery of company summary information Information that is provided is better understood Filters data for management Improves to tracking information Offers efficiency to decision makers Disadvantages of EIS System dependent Limited functionality, by design Information overload for some managers Benefits hard to quantify High implementation costs System may become slow, large, and hard to manage Need good internal processes for data management May lead to less reliable and less secure data

Inter-Organizational Value Chain

The value chain of   a company is part of over all value chain. The over all competitive advantage of an organization is not just dependent on the quality and efficiency of the company and quality of products but also upon the that of its suppliers and wholesalers and retailers it may use. The analysis of overall supply chain is called the value system. Different parts of the value chain 1.  Supplier     2.  Firm       3.   Channel 4 .   Buyer

Big-M Method and Two-Phase Method

Big-M Method The Big-M method of handling instances with artificial  variables is the “commonsense approach”. Essentially, the notion is to make the artificial variables, through their coefficients in the objective function, so costly or unprofitable that any feasible solution to the real problem would be preferred, unless the original instance possessed no feasible solutions at all. But this means that we need to assign, in the objective function, coefficients to the artificial variables that are either very small (maximization problem) or very large (minimization problem); whatever this value,let us call it Big M . In fact, this notion is an old trick in optimization in general; we  simply associate a penalty value with variables that we do not want to be part of an ultimate solution(unless such an outcome is unavoidable). Indeed, the penalty is so costly that unless any of the  respective variables' inclusion is warranted algorithmically, such variables will never be p