Lecture
Research in the field of information retrieval began more than thirty years ago. During this time, information retrieval from a highly specialized subject has become one of the key areas of computer science. A full introduction to the tasks of information retrieval is impossible to fit into the framework of this work. Therefore, in this chapter we only briefly describe the general context of the research.
Information Retrieval Tasks
The central problem of information retrieval is formulated simply - to help the user find the information in which he is interested. Unfortunately, to describe the information needs of the user is not so simple. Usually, this description is formulated as a certain query, representing a certain set of keywords characterizing the user's needs.
The classical task of information retrieval, from which the development of this field began, is the search for documents that satisfy the query, within a certain static (at the time of the search) collection of documents. For example, this problem is solved within the framework of most modern reference systems, such as the help system on the Windows operating system .
However, over thirty years of research, the list of information retrieval tasks has expanded considerably and now includes issues of modeling, classifying and clustering documents, designing search engine architectures and user interfaces, query languages, etc.
In addition to the classical task of information retrieval in this work, we also touch on the following tasks:
- Clustering documents. The purpose of clustering documents is to automatically identify groups of semantically similar documents among a given fixed set of documents. Note that groups are formed only on the basis of pairwise similarity of the descriptions of documents, and no characteristics of these groups are set in advance.
- Document classification. Unlike the clustering task, the goal of this task is to determine for each document one or more of the predefined categories to which this document belongs. A feature of the classification task is the assumption that the set of classified documents does not contain “garbage”, that is, each of the documents corresponds to any of the specified categories. A special case of the classification problem is the thematic classification problem. Here each category is a certain subject, and the purpose of the classification is to determine the subject of the document.
- Filtering documents. As in the classification task, the goal of the filtering task is to partition the set of documents into categories. However, these categories are only two - those documents that meet the specified criteria, and those that do not satisfy him. One of the most important particular cases is the task of thematic filtering of documents, i.e., automatic detection of documents corresponding to a given topic, due to the elimination of other documents.
Despite some similarities in the formulations of these tasks, they are nonetheless very different. As a result, the methods successfully used to solve one of these tasks often do not show the best results when used to solve another task.
Internet search
The rapid growth of information on the Internet makes searching an indispensable method of access to this information. There are two main forms of Internet search:
- The use of search engines that collect information about (part of) the resources available on the Internet and organize the search for this information as a full-text database. Examples of such systems are -Altavista, Google, Yandex, etc.
- The use of Internet directories in which information about selected Internet resources is classified by thematic features. Such directories exist not only in electronic form (List or Yahoo!), but also published in the form of print publications - such as, for example, “Yellow pages of the Internet '”.
The nature of the Internet determines a number of important factors that must be considered when considering search tasks:
- A huge amount of available information
- High percentage of temporal information
- Uncontrolled quality of information
- Heterogeneity of information
In addition to the various formats for presenting information, this group of features also includes the fact that many different languages and even alphabets are used to represent information.
Search engines
The huge amount of information available on the Internet makes search engines an indispensable tool. The number of existing search engines is in the hundreds and most of them belong to one of two classes:
- Multipurpose systems
- Specialized systems
A specialized search engine searches a much smaller amount of resources than any popular multipurpose search engine. However, this fact has a number of positive consequences for specialized systems.
- Information not related to the specialization of this search engine is not included in its index.
- Perhaps the use of more computationally intensive search methods.
- It is possible to attract experts in the relevant field, as well as support the service of recommending resources by system users. And as a result, improving the quality and completeness of the collection.
Therefore, often search in the relevant query specialized search engine quickly and better meets the information needs of the user.
At the same time, due to the specialization of such search engines, the choice of a particular system to perform a search is quite a difficult task. To solve this problem, it offers the ability to search manually constructed descriptions of specialized systems. Such an approach is very laborious and does not always work due to the limitations of the manually constructed descriptions. The automatic construction of such descriptions is the subject of modern research. Note that in the framework of this work we do not consider systems and search methods that take into account information about the data structure, such as methods of working with weakly structured information.
Search engine indexes
The most important difference between search engines for searching the Internet from classical information retrieval systems is the need to service all requests without real access to resources at the time of the request. Otherwise, you must either keep a fresh local copy of all resources (which is too expensive), or visit the resources during the execution of the request (which is too slow).
Therefore, in Internet search systems, all requests are serviced based on the contents of the index, which contains some descriptions of resources known to this search engine. To collect information about available resources, which is then used to build the index, commonly used so-called network robots are programs that, starting from a certain web page, recursively bypass Internet resources, extracting links to new resources from received documents.
Comments
To leave a comment
Presentation and use of knowledge
Terms: Presentation and use of knowledge