Evolution
of Modern WebIR
In
1995, everything changed with the creation of the web. Web objects are the
largest collection of information ever created by humans, and this collection
changes continuously when new objects are created and old ones removed. In
order to adapt to this changed scenario, a new discipline has been created: Web
Information Retrieval [1,2,3]. It uses
some concepts of traditional IR, and introduces many innovative ones. Modern
WebIR [4] is a discipline which has exploited some of the classical results
of information retrieval, thereby developing innovative models of information
access. A recent report showed that 80% of Web surfers discover new sites (that
they visit) through search engines [4,5] (such as Ask, Google, MSN or
Yahoo).
1.3.1 Types of Modern WebIR
Information retrieval on the
Web can be broadly classified into two technologies:
1. Question Answering
Systems (QA): In
information retrieval, question answering
(QA) is the task of automatically answering a question posed in natural
language. To find the answer to a question, a QA computer programme may use
either a pre-structured database or a collection of natural language documents
(a text corpus such as the World Wide Web or some local collection). Question System is regarded as
requiring more complex natural language processing (NLP) techniques than other
types of information retrieval such as document retrieval. Thus, natural
language search engines are sometimes regarded as the next step beyond current
search engines. QA research attempts to deal with a wide range of question
types including: fact, list, definition, How, Why, hypothetical,
semantically constrained, and cross-lingual questions.
·
Closed-domain question answering deals with questions under a specific domain
(for example, medicine or automotive maintenance), and can be seen as an easier
task because NLP systems can exploit domain-specific knowledge frequently
formalized in ontologies.
·
Open-domain question answering deals with questions about nearly everything,
and can only rely on general ontology and world knowledge. These systems
usually have much more data available for extracting answers to a given query [5,6].
2. Search Engines: The goal of a modern Web Search Engine is to retrieve documents
“relevant” to a user query from a given collection. Nowadays, a user query is
modelled as a set of keywords extracted from a large dictionary of words; a
document is typically a web page, PDF, PostScript, doc file, or whatever files
that can be parsed into a set of tokens. Global Search Engines serve as de
facto Internet portals, while local search engines are embedded in numerous
individual websites. Search engines can be broadly categorized into the following
categories on basis of their functionality [6]:
a. Crawler-Based Search Engines: Crawler-based Search Engines use automated
software programmes to survey and categorize web pages [7,8]. The programmes
used by the search engines to access web pages are called ‘spiders’,
‘crawlers’, ‘robots’ or ‘bots’. A spider will find a web page, download it and
analyse the information presented on the web page. This is a seamless process.
The web page will then be added to the search engine’s database. Then when a
user performs a search, the search engine will check its database of web pages
for the keywords the user searched for, and present a list of link results [8].
The results (list of suggested links to go to) are listed on pages by order of
which is ‘closest’ (as defined by the ‘bots’) to what the user wants to find
online. Crawler-based Search Engines are constantly searching the Internet
for new web pages and updating their database of information with these new
pages. Examples of Crawler-based Search Engines are Google and AskJeeves.
b. Directories: A ‘directory’ uses human
editors who decide what category the site belongs to; they place websites
within specific categories in the ‘directories’ database. The human editors
comprehensively check the website and rank it, based on the information they
find, using a pre-defined set of rules. There are two major directories (i) Yahoo!
Directory (ii) Open Directory which have gained popularity.
c. Hybrid Search Engines: A Hybrid
Search Engine uses a combination of both crawler-based results and directory
results. More and more search engines these days are moving on to a
hybrid-based model. Some of the popular Hybrid Search Engines are Yahoo! and Google].
d. Meta Search Engines: Unlike other search engines, Meta Search Engines gather
results from different search engines results, and combine them into one large
listing. Some of the popular Meta Search Engines are ‘Meta
crawler’, ‘Dog pile’.
e. Specialty Search Engines: Specialty Search Engines have been developed to cater for the demands of niche areas. The promise is of higher relevancy within the given
context. Overall, all of these specialization approaches have one goal in
common: increased precision with simple queries. There are various dimensions
for search specialization. Some of them are listed below:
(i) File format: Specialization is done by file or mime type of a
document (e.g. Yahoo! Image Search 4).
(ii) Geography/Language:
Major search engines offer language
specific search applications (e.g. Google.in and Google.uk).
(iii) Transient Information: Specialization is
done by focusing on transient information. Google News 6 or Day pop 7 are
examples of search engines which focus on transient information.
(iv) Document Type:
Specialization is done
by the intent of the document.
(v) User Intent: Specialization is done by looking at the user’s
intent (what a user is searching) at the moment of search. E.g., search
engines could base specialization on Broder’s [7] classification of queries
into informational, navigational and transactional queries.
(vi) Search Context: Specialization is done by applying and
leveraging the search context to augment the query.
There are many specialty search engines dedicated for shopping, e.g.,
Froogle, Yahoo! Shopping, BizRate and PriceGrabber;
some are used to conduct local searches, e.g., NZPages(a New Zealand
Website Directory), SearchNZ(a New Zealand Search-based Engine). Some search
engines are specifically used to conduct Domain Name Searches, e.g., iServe and Free parking. Search engines like Tucows and CNET Download.com are used to conduct freeware and shareware software-related searches.
1. Ellis D., “Behavioal
Approach to Information Retrieval”,
Journal of Documentation, Vol.46, pp191-213,1989.
2. Pitkow James Edward, “Characterstics
World Wide Web Ecologies”, Thesis,Georgia Institute of Technology,1997.
3. Ricardo Baeza-Yates and
Berthier Ribeiro-Neto,"Modern Information Retrieval",Addison-Wesley,
1999.
4.
4. Singhal Amit (Google, Inc.),"Modern
Information Retrieval: A Brief Overview",Bulletin of the IEEE
Computer Society Technical Committee on Data Engineering.
5. Torrey C.,Churchill
E.F.,McDonald D.W. ,” Learning How: The Search for Craft Knowledge on the
Internet”, CHI 2009, April 3rd, 2009, Boston .
6. Sergey Brin, Lawrence Page, ”The Anatomy of a Large-Scale Hypertextual Web
Search Engine”,
1998,http://www-db.stanford.edu/pub/papers/google.pdf
7. Belkin N.J. ," Helping people find what they don’t know”,
Comm. ACM, 43, 8, 58-61,2000.
8. Sergey Brin
and Larry Page. Google search engine, http://google.stanford.edu.
9. Anagnostopoulos A., Broder A. and Carmel D., "Sampling search
engine results", In Proceedings of 14th International World Wide
Web Conference, pages 245–256, Chiba, Japan,2005 .
Comments