Skip to main content

Modern Web IR

                                 Evolution of Modern WebIR

In 1995, everything changed with the creation of the web. Web objects are the largest collection of information ever created by humans, and this collection changes continuously when new objects are created and old ones removed. In order to adapt to this changed scenario, a new discipline has been created: Web Information Retrieval [1,2,3]. It uses some concepts of traditional IR, and introduces many innovative ones. Modern WebIR [4] is a discipline which has exploited some of the classical results of information retrieval, thereby developing innovative models of information access. A recent report showed that 80% of Web surfers discover new sites (that they visit) through search engines [4,5] (such as Ask, Google, MSN or Yahoo).

1.3.1 Types of Modern WebIR
Information retrieval on the Web can be broadly classified into two technologies:

1. Question Answering Systems (QA): In information retrieval, question answering (QA) is the task of automatically answering a question posed in natural language. To find the answer to a question, a QA computer programme may use either a pre-structured database or a collection of natural language documents (a text corpus such as the World Wide Web or some local collection). Question System is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval. Thus, natural language search engines are sometimes regarded as the next step beyond current search engines. QA research attempts to deal with a wide range of question types including: fact, list, definition, How, Why, hypothetical, semantically constrained, and cross-lingual questions.
·                     Closed-domain question answering deals with questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies.
·                     Open-domain question answering deals with questions about nearly everything, and can only rely on general ontology and world knowledge. These systems usually have much more data available for extracting answers to a given query [5,6].
2. Search Engines: The goal of a modern Web Search Engine is to retrieve documents “relevant” to a user query from a given collection. Nowadays, a user query is modelled as a set of keywords extracted from a large dictionary of words; a document is typically a web page, PDF, PostScript, doc file, or whatever files that can be parsed into a set of tokens. Global Search Engines serve as de facto Internet portals, while local search engines are embedded in numerous individual websites. Search engines can be broadly categorized into the following categories on basis of their functionality [6]:

a. Crawler-Based Search Engines: Crawler-based Search Engines use automated software programmes to survey and categorize web pages [7,8]. The programmes used by the search engines to access web pages are called ‘spiders’, ‘crawlers’, ‘robots’ or ‘bots’. A spider will find a web page, download it and analyse the information presented on the web page. This is a seamless process. The web page will then be added to the search engine’s database. Then when a user performs a search, the search engine will check its database of web pages for the keywords the user searched for, and present a list of link results [8]. The results (list of suggested links to go to) are listed on pages by order of which is ‘closest’ (as defined by the ‘bots’) to what the user wants to find online. Crawler-based Search Engines are constantly searching the Internet for new web pages and updating their database of information with these new pages. Examples of Crawler-based Search Engines are Google and AskJeeves.

b. Directories: A ‘directory’ uses human editors who decide what category the site belongs to; they place websites within specific categories in the ‘directories’ database. The human editors comprehensively check the website and rank it, based on the information they find, using a pre-defined set of rules. There are two major directories (i) Yahoo! Directory (ii) Open Directory  which have gained popularity.

c. Hybrid Search Engines: A Hybrid Search Engine uses a combination of both crawler-based results and directory results. More and more search engines these days are moving on to a hybrid-based model. Some of the popular Hybrid Search Engines are Yahoo!  and Google].
d. Meta Search Engines: Unlike other search engines, Meta Search Engines gather results from different search engines results, and combine them into one large listing. Some of the popular Meta Search Engines are ‘Meta crawler’, ‘Dog pile’.
e. Specialty Search Engines: Specialty Search Engines have been developed to cater for the demands of niche areas. The promise is of higher relevancy within the given context. Overall, all of these specialization approaches have one goal in common: increased precision with simple queries. There are various dimensions for search specialization. Some of them are listed below:

(i) File format: Specialization is done by file or mime type of a document (e.g. Yahoo! Image Search 4).
(ii) Geography/Language: Major search engines offer language specific search applications (e.g. Google.in and Google.uk).
(iii) Transient Information: Specialization is done by focusing on transient information. Google News 6 or Day pop 7 are examples of search engines which focus on transient information.
(iv) Document Type: Specialization is done by the intent of the document.
(v) User Intent: Specialization is done by looking at the user’s intent (what a user is searching) at the moment of search. E.g., search engines could base specialization on Broder’s [7] classification of queries into informational, navigational and transactional queries.
(vi) Search Context: Specialization is done by applying and leveraging the search context to augment the query.

There are many specialty search engines dedicated for shopping, e.g., Froogle, Yahoo! Shopping, BizRate and PriceGrabber; some are used to conduct local searches, e.g., NZPages(a New Zealand Website Directory), SearchNZ(a New Zealand Search-based Engine). Some search engines are specifically used to conduct Domain Name Searches, e.g., iServe  and Free parking. Search engines like Tucows  and CNET Download.com are used to conduct freeware and shareware software-related searches.


1.     Ellis D., “Behavioal Approach to Information Retrieval”,  Journal of Documentation, Vol.46, pp191-213,1989.  
2.      Pitkow James Edward, “Characterstics World Wide Web Ecologies”, Thesis,Georgia Institute of Technology,1997.
3.      Ricardo Baeza-Yates and Berthier Ribeiro-Neto,"Modern Information Retrieval",Addison-Wesley, 1999.
4. 
         4.      Singhal Amit (Google, Inc.),"Modern Information Retrieval: A Brief Overview",Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 
        5.      Torrey C.,Churchill E.F.,McDonald D.W. ,” Learning How: The Search for Craft Knowledge on the Internet”, CHI 2009, April 3rd, 2009, Boston.
      6.       Sergey Brin, Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search    Engine”, 1998,http://www-db.stanford.edu/pub/papers/google.pdf
       7.          Belkin  N.J.," Helping people find what they don’t know”, Comm. ACM, 43, 8, 58-61,2000.
         8. Sergey Brin and Larry Page. Google search engine, http://google.stanford.edu.
     9.     Anagnostopoulos A., Broder  A. and Carmel D., "Sampling search engine results", In Proceedings of 14th International World Wide Web Conference, pages 245–256, Chiba, Japan,2005 .
 

Comments

Popular posts from this blog

Advantages and Disadvantages of EIS Advantages of EIS Easy for upper-level executives to use, extensive computer experience is not required in operations Provides timely delivery of company summary information Information that is provided is better understood Filters data for management Improves to tracking information Offers efficiency to decision makers Disadvantages of EIS System dependent Limited functionality, by design Information overload for some managers Benefits hard to quantify High implementation costs System may become slow, large, and hard to manage Need good internal processes for data management May lead to less reliable and less secure data

Inter-Organizational Value Chain

The value chain of   a company is part of over all value chain. The over all competitive advantage of an organization is not just dependent on the quality and efficiency of the company and quality of products but also upon the that of its suppliers and wholesalers and retailers it may use. The analysis of overall supply chain is called the value system. Different parts of the value chain 1.  Supplier     2.  Firm       3.   Channel 4 .   Buyer

Big-M Method and Two-Phase Method

Big-M Method The Big-M method of handling instances with artificial  variables is the “commonsense approach”. Essentially, the notion is to make the artificial variables, through their coefficients in the objective function, so costly or unprofitable that any feasible solution to the real problem would be preferred, unless the original instance possessed no feasible solutions at all. But this means that we need to assign, in the objective function, coefficients to the artificial variables that are either very small (maximization problem) or very large (minimization problem); whatever this value,let us call it Big M . In fact, this notion is an old trick in optimization in general; we  simply associate a penalty value with variables that we do not want to be part of an ultimate solution(unless such an outcome is unavoidable). Indeed, the penalty is so costly that unless any of the  respective variables' inclusion is warranted algorithmically, such variables will ...