Machine vision systems

Lecture

Introduction

Even novice users know about the existence of special systems that “automatically enter text into a computer”. From the outside, everything looks pretty simple and logical. On the scanned image, the system finds fragments in which it “recognizes” letters, and then replaces these images with real letters, or, in a different way, their machine codes. This is how the transition from the text image to the “real” text is performed, with which you can work in a text editor. How to achieve this?

The company "Bit" has developed a special character recognition technology, which was called "Fountain transformation", and on its basis - a commercial product, which received a high rating. This is the optical recognition system Fine Reader. Today, the third version of the product is presented on the market, which works not only with text, but also with forms, tables, and the developers are already working on a new fourth version of the Fine Reader, which will recognize not only printed but also handwritten text.

Basic principles or perception integrity

The principle of integrity is the basis of the fountain transformation. In accordance with it, any perceived object is considered as a whole, consisting of parts interconnected by certain relations. So, for example, a printed page consists of articles, an article - from a heading and columns, a column - from paragraphs, paragraphs - from lines, lines - from words, words - from letters. At the same time, all the listed elements of the text are interconnected by certain spaces and linguistic relations. To isolate a whole, it is necessary to define its parts. Parts, in turn, can only be considered as part of a whole. Therefore, a holistic process of perception can occur only within the framework of the hypothesis about the perceived object - the whole. After the assumption of the perceived object is made, its parts are singled out and interpreted. Then an attempt is made to “assemble” a whole of them in order to verify the correctness of the initial hypothesis. Of course, the perceived object can be interpreted as part of a larger whole. Thus, reading a sentence, a person recognizes letters, perceives words, links them into syntactic structures and understands the meaning. In technical systems, any decision on recognition of the text is ambiguous, and by sequential nomination and verification hypotheses and attracting both knowledge of the object itself and the general context. A holistic description of a class of objects of perception meets two conditions: firstly, all objects of a given class satisfy this description, and secondly, not a single object of another class satisfies it. For example, the image class of the letter "K" should be described so that any image of the letter "K" would fall into it, and the image of all other letters would not. This description has the property of display, that is, provides playback of the described objects: the standard letter for the OCR system allows you to visually reproduce the letter, the standard word for speech recognition allows you to say the word, and the description of the sentence structure in the parser allows you to synthesize the correct sentence. From a practical point of view, displayability plays a huge role because it allows you to effectively control the quality of descriptions. There are two types of holistic descriptions: template and structural. In the first case, the description is an image in a raster or vector representation, and the transformation class is set (for example, repetition, scaling and etc.). In the second case, the description is represented as a graph, the nodes of which are the constituent elements of the input object, and the arcs are the spatial relations between them. In turn, the elements may be difficult (that is, to have their own description).

Of course, the template description is easier to implement than the structural one. However, it cannot be used to describe objects with a high degree of variability. A generic description, for example, can be taken to recognize only printed characters, and structural - even for handwritten characters.

Perception integrity offers two important architectural solutions. First, all sources of knowledge should work whenever possible simultaneously. It is impossible, for example, to first recognize the page, and then subject it to vocabulary and contextual processing, since in this case it will be impossible to carry out feedback from context processing to recognition. Secondly, the object under study should be presented and processed as much as possible.

The first step is perception - the formation of a hypothesis about the perceived object. The hypothesis can be formed both on the basis of an a priori model of the object, the context and the results of testing previous hypotheses (the “top-down” process), and based on a preliminary analysis of the object (“bottom-up”). The second step is the refinement of perception (hypothesis testing), in which an additional analysis of the object is carried out within the framework of the hypothesis put forward and the context is drawn in full force.

For the convenience of perception, it is necessary to pre-process the object without losing essential information about it. Usually, preprocessing is reduced to transforming an input object into a representation convenient for further work (for example, vectoring an image), or getting all sorts of options for segmentation of an input object, from which the correct one is selected by extending and testing hypotheses. The process of putting forward and testing hypotheses should be explicitly reflected in the program architecture. Each hypothesis must be an object that can be assessed or compared with others. Therefore, hypotheses are usually put forward sequentially, and then combined into a list and sorted based on a preliminary assessment. For the final selection of the hypothesis, the context and other additional sources of knowledge are actively used.

Now one of the leaders in the field of genetic programming is a group of researchers from Stanford University (Stanford University), working under the guidance of Professor John Koz. Genetic programming has breathed new life into the already forgotten LISP (List Processing) language, which was created by the John McCarthy group (the one who in the 60s introduced the term "artificial intelligence" into our everyday life) just for list processing and functional programming. By the way, this language in the USA was and remains one of the most common programming languages for artificial intelligence tasks.

Character Recognition

Today, there are three approaches to character recognition - the template, structural and indicative. But the principle of integrity is answered only by the first two. The template description is simpler to implement, however, unlike the structural one, it does not allow describing complex objects with a large variety of forms. That is why the template description is used for recognizing only printed characters, while the structural one is used for handwritten characters, which, naturally, have much more variants of style.

Template systems

Such systems convert the image of a single character into a raster one, compare it with all the templates available in the database and choose a template with the fewest points other than the input image. Template systems are fairly resistant to image defects and have a high processing speed of input bottom ones, but only those fonts whose templates are known to them are reliably recognized. And if the recognizable font is even slightly different from the reference, template systems can make mistakes even when processing very high-quality images!

Structural Systems

In such systems, an object is described as a graph, whose nodes are the elements of the input object, and the arcs are the spatial relations between them. Systems that implement this approach usually work with vector images. Structural elements are components of the line symbol. So, for the letter "p" is a vertical segment and an arc.

The disadvantages of structural systems include their high sensitivity to image defects that violate the constituent elements. Vectorization can also add additional defects. In addition, for these systems, in contrast to the template and attribute, effective automated training procedures have not yet been created. Therefore, for Fine Reader structural descriptions had to be created manually.

Sign Systems

In them, the averaged image of each symbol is represented as an object in the n-dimensional feature space. Here, an alphabet of features is selected, the values of which are calculated when the input image is recognized. The resulting n-dimensional vector is compared with the reference, and the image refers to the most appropriate of them. Sign systems do not meet the principle of integrity. A necessary, but insufficient condition for the integrity of the description of a class of objects (in our case, this is a class of images representing a single character) is that all objects of a given class and none of the objects of other classes must satisfy the description. But since a significant portion of information is lost in the calculation of signs, it is difficult to guarantee that only objects can be assigned to this class.

Structural spot pattern

Fountain transformation combines the advantages of the template and structural systems and, in our opinion, allows us to avoid the shortcomings inherent in each of them separately. The basis of this technology is the use of the structural-stained standard. It allows you to represent the image as a set of spots, interconnected n-ary relations, defining the structure of the symbol. These relationships (that is, the location of the spots relative to each other) form the structural elements that make up the symbol. So, for example, a segment is one type of n-ary relationship between the spots, an ellipse is another, an arc is the third. Other relationships define the spatial arrangement of the elements forming the symbol.The standard includes: - name; - mandatory, prohibiting and optional structural elements; - relations between structural elements; - relations, connecting structural elements with a descriptive symbol rectangle; - attributes used to highlight structural elements ; - attributes used to verify relationships between elements; - attributes used to assess the quality of elements and relations; - the position from which the selection of an element begins nta (relations of localization of elements). Structural elements allocated for a class of images can be original and composite. The original structural elements are spots, the composite elements are a segment, an arc, a ring, a point. In principle, any objects described in the standard can be taken as constituent structural elements. In addition, they can be described both through the original and through other constituent structural elements. For example, for the recognition of Korean hieroglyphs (syllable letter), the descriptive elements of a syllable are descriptions of individual letters (but not individual elements of letters). As a result, the use of composite structural elements allows building hierarchical descriptions of classes of recognizable objects. Relationships between structural elements are used as relationships, which are determined either by the metric characteristics of these elements (for example, ), or by their relative position on the image (for example, < to the right>, ). When specifying structural elements and relations, specific parameters are used, allowing to define the structural element or relation when using The use of this element in the standard of a particular class. For structural elements, parameters that define the range of permissible orientation of a segment, for example, may be specific, and for relations, parameters that define the maximum allowable distance between characteristic points of structural elements in relation to . of a specific structural of the image element and the of the performance of this relationship. / DT> Construction and testing of structural-stained standards for classes of recognizable objects - pr The process is complex and time consuming. The base of images, which is used for debugging descriptions, should contain examples of good and bad (maximum permissible) images for each grapheme, and the base images are divided into training and control sets. The description developer pre-sets a set of structural elements (splitting into spots) and the relationship between them . The system of training on the basis of images automatically calculates the parameters of elements and relations. The resulting standard is checked and corrected by the control sample of images of this grapheme. By the control sample, the result of recognition is checked, that is, the quality of the confirmation of the hypotheses is evaluated. The standard is superimposed on the image, and the relationship between the spots highlighted in the image are compared with the relationship of the spots in the reference. If the spots selected on the image and the relations between them satisfy the standard of a certain symbol, then this symbol is added to the list of hypotheses about the result of recognition of the input image. / DT>

Cognitive Technologies Machine Reading Lessons

The system works on the principle of . This means that when you click the Scan and Understand button, the entire document processing process starts: scanning, page fragmentation into text and graphic blocks, text recognition, spelling check, and output file generation. But what is behind all this? The intelligent algorithm allows you to automatically select the optimal brightness level of the scanner (adaptive scanning) depending on the background of the document, save illustrations (or, depending on the task being solved, delete unnecessary graphic elements to minimize subsequent editing) .In CuneiForm, several methods of similar comparison are used. First, the image of each character is decomposed into separate elements - events. For example, an event is a fragment from one intersection line to another. The totality of events is a compact description of a symbol. Other methods are based on the ratio of individual elements of symbols and the description of their characteristic features (rounding, straight, angles, etc.). For each of these descriptions, there are databases in which the corresponding standards are located. The image element received for processing is compared with the standard. And then, on the basis of this comparison, the decisive function renders a verdict on the correspondence of the image to a specific symbol. In addition, there are algorithms that allow you to work with low quality texts. So, for cutting characters, there is a method for estimating optimal partitions. And vice versa, the mechanism of their connection was developed to connect the “scattered” elements. For the first time in CuneiForm'96, we applied self-learning algorithms (or adaptive recognition). The principle of their work is as follows. In each text there are clearly and indistinctly printed characters. If, after the system has recognized the text (as the usual system does, for example, the previous version of OCR CuneiForm 2.95), it turns out that the accuracy was lower than the threshold, the text is being distributed based on a font that is well printed. tane symbols. Here, the developers combined the advantages of two types of recognition systems: omni-and multi-font. Recall that the first ones allow you to recognize any fonts without additional training, while the second ones are more stable when recognizing low-quality texts. The results of using Cunei-Form'96 have shown that using self-learning algorithms can increase the recognition accuracy of low-quality texts four to five times! But the main thing is that self-learning systems have a much greater potential for improving the accuracy of recognition. The important role is played by the methods of vocabulary and syntactic recognition and, in fact, serve as a powerful means of supporting geometric recognition. But for their effective use it was necessary to solve two important problems. First, implement quick access to a large (about 100,000 words) dictionary. As a result, it was possible to build a system for storing words, where the storage of each word took no more than one byte, and access was carried out in minimum time. С другой стороны, потребовалось построить систему коррекции результатов распознавания, ориентированную на альтернативность событий (подобно системе проверки орфографии). Сама по себе альтернативность результатов распознавания очевидна и обусловлена хранением коллекций букв вместе с <оценками соответствия>. А словарный контроль позволял изменять эти оценки, используя словарную базу. В итоге применение словаря позволило реализовать схему дораспознавания символов.Сегодня наряду с задачами повышения точности распознавания на передний план выходят вопросы расширения сфер применения OCR-технологий, соединения технологий распознавания с архивными системами. Иными словами, мы переходим от монопрограммы, выполняющей функции ввода текста, к автоматизированным комплексам, решающим задачи клиента в области документооборота. Вот уже около полугода CuneiForm поставляется в комплекте с сервером распознавания CuneiForm OCR Server, предназначенным для коллективного ввода данных в организациях, а электронный архив <Евфрат>, включающий модуль распознавания, за короткое время приобрел большую популярность.С таким прицелом создавался и комплект CuneiForm'96i Professional, существенно изменивший представления о системах распознавания в целом.

Распознавание рукописных текстов

Очевидно, проблема распозна-вания рукописного текста значи-тельно сложнее, чем в случае с текстом печатным. Если в послед-нем случае мы имеем дело с огра-ниченным числом вариаций изо-бражений шрифтов (шаблонов), то в случае рукописного текста число шаблонов неизмеримо больше. Дополнительные сложности вносят также иные соотношения линейных размеров элементов изобра-жений и т. п.

And yet today we can recognize that the main stages of the development of handwriting recognition technology (individual characters written by hand) characters have already been passed. In the arsenal of Cognitive Technologies there are technologies for recognizing all the main types of texts: stylized numbers, typed characters and characters. But the technology of entering characters will need to go through the adaptation stage, after which it will be possible to declare that the toolkit for streaming the input of documents into the archives is indeed fully implemented.

Summary

Динамичное развитие новых ком-пьютерных технологий (сетевые технологии, технологии <клиент-сервер>, и т. д.) нашли свое отра-жение и в состоянии сектора элек-тронного документооборота. Если раньше в продвижении технологий бесклавиатурного ввода делался упор на преимущества их персо-нального использования, то сего-дня на первый план выходят пре-имущества коллективного и рацио-нального использования техноло-гий ввода и обработки документов. Иметь одну, обособленную систе-му распознавания сегодня уже яв-но недостаточно. С распознанны-ми текстовыми файлами (как бы хорошо они распознаны ни были) нужно что-то делать: хранить в ба-зе данных, осуществлять их поиск, передавать по локальной сети, и т. д. Словом, требуется взаимо-действие с архивной или иной сис-темой работы с документами. Та-ким образом, система распозна-вания превращается в утилиту для архивных и иных систем работы с документами.

С появлением сетевых версий систем сканирования (режим потокового сканирования OCR CuneiForm) и распознавания (сервер распознавания CuneiForm OCR Server) документов нашей компании уже удалось реализовать некоторые преимущества коллективного использования таких технологий в организациях разного масштаба. По этой причине, с нашей точки зрения, актуальным был бы разговор о комплексном решении компаниями проблемы автоматизации работы с документами в организациях самого различного ранга. Что касается Cognitive Technologies, то представляемый ею электронный архив <Евфрат> (система включает в себя возможность ввода документов с помощью OCR CuneiForm), новые утилиты, встроенные в OCR CuneiForm'96, и технологии, используемые при реализации крупных проектов, продолжают линию компании, направленную на расширение применения систем ввода информации и разработку технологий автоматизации работы с документами.

Machine vision systems

Introduction

Basic principles or perception integrity

Character Recognition

Structural spot pattern

Cognitive Technologies Machine Reading Lessons

Распознавание рукописных текстов

Summary

See also

Comments

To leave a comment

Pattern recognition

Terms: Pattern recognition