Lecture
1 Data Analysis Introduction to Information Retrieval
2 Plan of the remaining lectures 1. Introduction to information search 2. Normalization and extraction of information from text and images 3. Ranking of results 4. Web-spiders (crawlers) 5. Types of deception and protection of search engines
3 Plan of the lecture 1. Definition and tasks of IR 2. Problems of information retrieval 3. The main components of the search engine The main goal: to get an idea of the tasks and problems of information retrieval systems
4 Information Retrieval Information Retrieval Information Retrieval is a search process in a large collection of unstructured material that satisfies an information need. The main objective: to quickly get a correct and up-to-date response to an untrained user request
5 Structured and unstructured data Where and what data is unstructured? Log files of the server Wikipedia Photo gallery Billing database Web page Weather statistics Youtube
6 Requirements Compiling a request does not require specific skills. Quick results. Completeness and accuracy of results.
7 Accuracy and Completeness Accuracy - the proportion of relevant documents among the found Completeness - the proportion of found relevant documents among all relevant documents Compare: a student in the exam and a private detective
8 Search Engine Problems Searching Information Sources Processing Information Sources Indexing Data Storing Data Processing Request Ranking
9 Search for sources Links to other sites Lists of new domains from registrars Scanning of the entire IP range by ports 80, 8080, 443
10 Processing of sources Different languages Different encodings Incorrect layout Good page: UTF-8, internal SEO- optimization Best page: semantic web Reality: KOI8-R, Narod.ru, doorway, invisible text
11 Index Building Multiple Linguistic Tasks: Definition of Single Root Words Synonyms Relationship Analysis of Words in the Sentence Set of Mathematical Tasks: Metrics of Word Importance in this Document and the Collection as a Whole
12 Ranking Hundreds of parameters depending on the query conditions (time, place, language). The metrics for determining the credibility of information sources (Google PR, Yandex CY, Alexa Ranking, etc.) are complex secret mathematics models. Only recommendations for improving the rating are available.
13 General structure of a search engine Internet pages Crawler Indexer Cache Normalized documents Request Request handler Search engine Ranking Results
14 Crawler Compiles a list of pages that will be searched. Follows links, like a regular user. Rests on robots.txt (to protect private data). Links are sent to the indexer.
15 Indexer Analyzes the content of a page. Finds a modular grid. Compiles an index of a page: words and their meaning Translates the content of a page from a sheet of text into a form convenient for searching information.
16 Request handler Corrects typos Adds synonyms Analyzes homonyms Removes stop words Defines the rules for processing the list of results (AND, OR, NOT) Converts the query to the same form as the normalizer
17 Search directly See if there are already results for this query. Searches for the specified query in the index. Applies boolean rules to the results of subqueries.
18 Evaluation of the quality of the search. If the user returns to the results page after clicking on a link, he is not satisfied. If he did not return, heighten the authority of the last link page. Social networks + recommendations of sources
19 Types of search engines Classic text search engines Reverse image search Reverse music search Specialized search engines (goods, real estate, cars) Search results aggregators Voice search
20 Reversible image search Compiles an index of images, not text, compressing them to the minimum necessary for subsequent comparison with a request picture.
21 Reverse Music Lookup Same for Music (Tunatic)
22 Search Results Aggregator Searches for an answer to a question, not a page with an answer. It also uses someone else's index and results.
23 Other specialized search engines The input line is slightly modified.
24 Other specialized search engines (examples) Yandex.Market - search for products Google Maps - search for places, addresses Koders - search by code
Comments
To leave a comment
Natural language processing
Terms: Natural language processing