Lecture
Even novice users know about the existence of special systems that “automatically enter text into a computer”. From the outside, everything looks pretty simple and logical. On the scanned image, the system finds fragments in which it “recognizes” letters, and then replaces these images with real letters, or, in a different way, their machine codes. This is how the transition from the text image to the “real” text is performed, with which you can work in a text editor. How to achieve this?
The company "Bit" has developed a special character recognition technology, which was called "Fountain transformation", and on its basis - a commercial product, which received a high rating. This is the optical recognition system Fine Reader. Today, the third version of the product is presented on the market, which works not only with text, but also with forms, tables, and the developers are already working on a new fourth version of the Fine Reader, which will recognize not only printed but also handwritten text.
The principle of integrity is the basis of the fountain transformation. In accordance with it, any perceived object is considered as a whole, consisting of parts interconnected by certain relations. So, for example, a printed page consists of articles, an article - from a heading and columns, a column - from paragraphs, paragraphs - from lines, lines - from words, words - from letters. At the same time, all the listed elements of the text are interconnected by certain spaces and linguistic relations. To isolate a whole, it is necessary to define its parts. Parts, in turn, can only be considered as part of a whole. Therefore, a holistic process of perception can occur only within the framework of the hypothesis about the perceived object - the whole. After the assumption of the perceived object is made, its parts are singled out and interpreted. Then an attempt is made to “assemble” a whole of them in order to verify the correctness of the initial hypothesis. Of course, the perceived object can be interpreted as part of a larger whole. Thus, reading a sentence, a person recognizes letters, perceives words, links them into syntactic structures and understands the meaning. In technical systems, any decision on recognition of the text is ambiguous, and by sequential nomination and verification hypotheses and attracting both knowledge of the object itself and the general context. A holistic description of a class of objects of perception meets two conditions: firstly, all objects of a given class satisfy this description, and secondly, not a single object of another class satisfies it. For example, the image class of the letter "K" should be described so that any image of the letter "K" would fall into it, and the image of all other letters would not. This description has the property of display, that is, provides playback of the described objects: the standard letter for the OCR system allows you to visually reproduce the letter, the standard word for speech recognition allows you to say the word, and the description of the sentence structure in the parser allows you to synthesize the correct sentence. From a practical point of view, displayability plays a huge role because it allows you to effectively control the quality of descriptions. There are two types of holistic descriptions: template and structural. In the first case, the description is an image in a raster or vector representation, and the transformation class is set (for example, repetition, scaling and etc.). In the second case, the description is represented as a graph, the nodes of which are the constituent elements of the input object, and the arcs are the spatial relations between them. In turn, the elements may be difficult (that is, to have their own description).
Of course, the template description is easier to implement than the structural one. However, it cannot be used to describe objects with a high degree of variability. A generic description, for example, can be taken to recognize only printed characters, and structural - even for handwritten characters.
Perception integrity offers two important architectural solutions. First, all sources of knowledge should work whenever possible simultaneously. It is impossible, for example, to first recognize the page, and then subject it to vocabulary and contextual processing, since in this case it will be impossible to carry out feedback from context processing to recognition. Secondly, the object under study should be presented and processed as much as possible.
The first step is perception - the formation of a hypothesis about the perceived object. The hypothesis can be formed both on the basis of an a priori model of the object, the context and the results of testing previous hypotheses (the “top-down” process), and based on a preliminary analysis of the object (“bottom-up”). The second step is the refinement of perception (hypothesis testing), in which an additional analysis of the object is carried out within the framework of the hypothesis put forward and the context is drawn in full force.
For the convenience of perception, it is necessary to pre-process the object without losing essential information about it. Usually, preprocessing is reduced to transforming an input object into a representation convenient for further work (for example, vectoring an image), or getting all sorts of options for segmentation of an input object, from which the correct one is selected by extending and testing hypotheses. The process of putting forward and testing hypotheses should be explicitly reflected in the program architecture. Each hypothesis must be an object that can be assessed or compared with others. Therefore, hypotheses are usually put forward sequentially, and then combined into a list and sorted based on a preliminary assessment. For the final selection of the hypothesis, the context and other additional sources of knowledge are actively used.
Now one of the leaders in the field of genetic programming is a group of researchers from Stanford University (Stanford University), working under the guidance of Professor John Koz. Genetic programming has breathed new life into the already forgotten LISP (List Processing) language, which was created by the John McCarthy group (the one who in the 60s introduced the term "artificial intelligence" into our everyday life) just for list processing and functional programming. By the way, this language in the USA was and remains one of the most common programming languages for artificial intelligence tasks.
Template systems
Such systems convert the image of a single character into a raster one, compare it with all the templates available in the database and choose a template with the fewest points other than the input image. Template systems are fairly resistant to image defects and have a high processing speed of input bottom ones, but only those fonts whose templates are known to them are reliably recognized. And if the recognizable font is even slightly different from the reference, template systems can make mistakes even when processing very high-quality images!
Structural Systems
In such systems, an object is described as a graph, whose nodes are the elements of the input object, and the arcs are the spatial relations between them. Systems that implement this approach usually work with vector images. Structural elements are components of the line symbol. So, for the letter "p" is a vertical segment and an arc.
The disadvantages of structural systems include their high sensitivity to image defects that violate the constituent elements. Vectorization can also add additional defects. In addition, for these systems, in contrast to the template and attribute, effective automated training procedures have not yet been created. Therefore, for Fine Reader structural descriptions had to be created manually.
Sign Systems
In them, the averaged image of each symbol is represented as an object in the n-dimensional feature space. Here, an alphabet of features is selected, the values of which are calculated when the input image is recognized. The resulting n-dimensional vector is compared with the reference, and the image refers to the most appropriate of them. Sign systems do not meet the principle of integrity. A necessary, but insufficient condition for the integrity of the description of a class of objects (in our case, this is a class of images representing a single character) is that all objects of a given class and none of the objects of other classes must satisfy the description. But since a significant portion of information is lost in the calculation of signs, it is difficult to guarantee that only objects can be assigned to this class.
Очевидно, проблема распозна-вания рукописного текста значи-тельно сложнее, чем в случае с текстом печатным. Если в послед-нем случае мы имеем дело с огра-ниченным числом вариаций изо-бражений шрифтов (шаблонов), то в случае рукописного текста число шаблонов неизмеримо больше. Дополнительные сложности вносят также иные соотношения линейных размеров элементов изобра-жений и т. п.
And yet today we can recognize that the main stages of the development of handwriting recognition technology (individual characters written by hand) characters have already been passed. In the arsenal of Cognitive Technologies there are technologies for recognizing all the main types of texts: stylized numbers, typed characters and characters. But the technology of entering characters will need to go through the adaptation stage, after which it will be possible to declare that the toolkit for streaming the input of documents into the archives is indeed fully implemented.
Динамичное развитие новых ком-пьютерных технологий (сетевые технологии, технологии <клиент-сервер>, и т. д.) нашли свое отра-жение и в состоянии сектора элек-тронного документооборота. Если раньше в продвижении технологий бесклавиатурного ввода делался упор на преимущества их персо-нального использования, то сего-дня на первый план выходят пре-имущества коллективного и рацио-нального использования техноло-гий ввода и обработки документов. Иметь одну, обособленную систе-му распознавания сегодня уже яв-но недостаточно. С распознанны-ми текстовыми файлами (как бы хорошо они распознаны ни были) нужно что-то делать: хранить в ба-зе данных, осуществлять их поиск, передавать по локальной сети, и т. д. Словом, требуется взаимо-действие с архивной или иной сис-темой работы с документами. Та-ким образом, система распозна-вания превращается в утилиту для архивных и иных систем работы с документами.
С появлением сетевых версий систем сканирования (режим потокового сканирования OCR CuneiForm) и распознавания (сервер распознавания CuneiForm OCR Server) документов нашей компании уже удалось реализовать некоторые преимущества коллективного использования таких технологий в организациях разного масштаба. По этой причине, с нашей точки зрения, актуальным был бы разговор о комплексном решении компаниями проблемы автоматизации работы с документами в организациях самого различного ранга. Что касается Cognitive Technologies, то представляемый ею электронный архив <Евфрат> (система включает в себя возможность ввода документов с помощью OCR CuneiForm), новые утилиты, встроенные в OCR CuneiForm'96, и технологии, используемые при реализации крупных проектов, продолжают линию компании, направленную на расширение применения систем ввода информации и разработку технологий автоматизации работы с документами.
Comments
To leave a comment
Pattern recognition
Terms: Pattern recognition