You get a bonus - 1 coin for daily activity. Now you have 1 coin

Text analysis

Lecture



The task of extracting information is to process text in natural language in order to extract the given elements. The input to the information extraction system is a weakly structured or unstructured text in natural language; at the exit--

completed data structures (exoframes), allowing further automatic or manual processing of information. Information extraction can be considered as a special type of text annotation, when the specific data structure serves as an annotation.

Information extracted from the text is stored in the exoframe, which is a set of target slots. The target slot can contain information about objects (for example, personalities, organizations, products), relationships or events, their attributes, and it is also possible to link to a text fragment on the basis of which this information is obtained.

Relevant information must be defined absolutely for the automatic information extraction system to show good results. A good formulation of the problem can be considered as one for which the consistency of the results of manual selection of information for several subject-matter experts (inter-annotator agreement) will be high (over 90%). If the key information is hidden or so poorly defined.

For high-quality work of an automatic system in a specific subject area, it needs to possess considerable knowledge in this area. Each subject area involves the extraction of data of a different nature, its own specific professional dictionary and style of writing text.

Each specific task of extracting data from the text provides slots of different types: for events, personalities, organizations, dates, etc. The target frame and the rules for extracting information describe the conditions under which the exoframe is created and how to fill its slots.

Consider a typical application of the information extraction system. An array of texts is specified, each of which potentially contains a description of a certain object or event of the domain. For example, it may be a selection of news in which information about the appearance of new products on the market can be found. Another example is a set of home pages for employees of an organization. In addition, the definition of target information is given (it can be viewed as a list of issues related to the subject area). For each text from the array, based on the definition of target information, it is required to select answers to questions as a fragment of text. For news picking, the goal may be to detect the name of the product, the name of the manufacturer and the date the product appeared on the market; for home pages –– detection of the owner of the page, his home address and the department in which he works.

At all stages of natural language processing, there is uncertainty, which is resolved by various means. A big problem is the construction of dictionaries, thesauri, ontologies. This work is mostly done manually. Attempts to automate this work were carried out using statistical methods and machine learning methods. At the moment, there seems to be no freely available research that describes a comprehensive solution to this problem.

The use of machine learning methods can simplify the setup and development of information extraction systems and facilitate the switching of the entire system to a new subject area. Consider the levels of text analysis in general, and then the possibility and efficiency of using machine learning to contextually remove homonymy, parse, define semantic classes, construct rules for extracting information and combining partial results.

The general text processing scheme for extracting information

Information extraction systems use similar methods in many ways. Let us turn to the typical text processing sequence in information extraction tasks. We will immediately note the processing steps for which it would be useful to use machine learning. These include, first of all, those stages that require fine-tuning in specific applications.

The source text is subject to graphematic analysis; There is a selection of words and sentences. At the next stages, compound words are detected, which should be considered as one (from the point of view of a morphological analyzer). Graphematic analysis usually does not require customization, depending on the subject area, since the implementation of the general graphematic analysis algorithm is suitable for most real-world applications.

Morphological analysis usually works at the level of individual words (possibly compound) and returns the morphological attributes of a given word. In the case when the attributes cannot be set unambiguously, several possible morphological analysis options are returned. The use of machine learning methods for morphological analysis will not be useful, since there are many high-quality vocabulary and wordless solutions to this problem that can be used in a wide range of applications.

The results of morphological analysis are used for micro and macro syntax analysis. Microsyntactic analysis carries out the construction of a limited set of syntactic links (for example, the selection of nominal groups). The task of macro-syntactic analysis is to single out large syntactic units –– fragments –– and to establish a hierarchy on the set of these fragments. The division into micro- and macro-syntax analysis is conditional, it reflects the fact that for most problems of extracting information, surface analysis is sufficient (micro-syntax analysis).

Experiments show that a linguistic analyzer with rich expressive possibilities gives more errors due to the fact that almost every level of analysis is a task that does not have a strict, and even more formalized, solution. To the greatest extent, this refers to the syntactic analysis. Therefore, in the domain where a simple syntax analysis is enough, a powerful analyzer will only introduce unwanted noise, and performance will fall. At the same time, there are subject areas in which information extraction requires advanced capabilities for presenting linguistic information. In such subject areas, a primitive analyzer will not be able to provide the linguistic attributes necessary for extracting target information. The setup is done manually, so this stage of the analysis would benefit from the use of machine learning.

Since each word, after performing morphological analysis, may contain several homonymous word forms, then to improve the quality of the syntactic analysis and increase its performance, you can use the algorithms for eliminating homonymy, which reduce the number of morphological analysis options. Often, the task of removing homonymy is solved with the help of sets of rules, the creation of which is very laborious, since practically applicable sets are quite large.

In addition, for each subject area, the rule set has to be modified. Removing homonymy is another area of ​​text analysis that can be improved with machine learning.

In the future, there is a selection of semantic classes (composite types). When selecting composite types, text fragments are marked, which later (for example, when rules are applied) are treated as a whole (for example, dates, names, positions). Semantic classes are distinguished on the basis of thesauri or rules similar to the rules for extracting information. Both options are of interest in terms of machine learning methods. The first, unfortunately, is almost impossible to automate, and the second we will consider below. Then, the rules of information extraction are applied to the text. When the conditions and restrictions described in the rules are met, the functional part of the rules is executed. The functional part allows you to build target data structures or store additional information that will be used in subsequent phases. Most often, the rules are grouped in phases: the rules of the subsequent phases have access to the information generated by the rules of the previous ones. Building and testing sets of rules for extracting information, especially for a complex subject area, is a time-consuming task, for which a number of satisfactory solutions are proposed using machine learning.

Target frames can be subjected to additional processing in order to improve the quality of the system. For this purpose, means of resolving coreferency and combining partial results are used.

When coreference is resolved in target frames, objects that are described by different fragments of the text but indicate the essence of the real world are marked in a special way. Studies show that there is no general solution to the problem of coreference, however, there are general approaches that work acceptably in many subject areas, but require adjustment when moving from one area to another, therefore machine learning can also be potentially used here.

Combining partial results consists in finding partially completed target frames and deciding on the possibility of combining the results. In the case where a merge is possible, one of several target frames is assembled with more complete information than each of the original frames. Combining partial results does not have a common solution, as well as a number of the problems listed above, but requires adjustment to the subject area. The peculiarity of this stage is that there are a number of approaches that implement algorithms from the field of machine learning and are close to them (often statistical), but in addition to setting the parameters of the algorithm, the choice of algorithm for each subject area and its creative “fine-tuning” is required to solve a specific problem . Algorithms for constructing rules for combining partial results are often similar to algorithms for constructing rules for extracting information.

The quality of morphological analysis can be improved using context analysis. This will in most cases get rid of morphological homonymy. The context analysis module can be customized to an arbitrary subject area. For this, it is necessary for the module’s tutorial to provide a set of texts –– documents of the target domain. On this set, the tutorial will highlight the most characteristic context for words significant in terms of homonymy and will use it later to resolve homonymous ambiguity.

Contextual analysis does not seem to solve all the problems of homonymy for the Russian language. For example, in Russian, many nouns have the same spelling in the accusative and nominative cases (the possible context of the lexeme remains practically unchanged); the same goes for proper names. But there are many cases in which contextual analysis eliminates irrelevant homonyms. Foreign analogues show the high accuracy of the morphological processors using technology based on hidden Markov models and rules of a special type. There are implementations both for supervisory education and for “without a teacher” training.

Parsing

In order to use machine learning in parsing, careful marking of large volumes of texts is required, so supervisory training is impractical to use.

Experiments on setting up a parser with the use of machine learning "without a teacher" show that the syntactic structure of a natural language is too expressive and complex to be able to effectively build its model without having marked up texts.

If we talk about the practical side, for the implementation of syntax analysis using machine learning "without a teacher" the most effective approach is statistical training, when the allocation of syntax structures is done without using linguistic knowledge. Instead, you can count the frequency of joint occurrence of words. A similar approach (for the Russian language) was explored in, but even there a significant place is occupied by formally grammatical rules that are rigidly embedded in the system. However, adaptive syntax analysis will be very significant for the quality of the system. Depending on the tasks that we want to solve, it is not always rational to use the full power of the parser. Sometimes it is enough to disassemble only those characteristics of the proposal, which we need from an applied point of view and have a lower probability of error during analysis. Then the work of the parser can be modified in accordance with the application goals (including the means of machine learning).

Definition of semantic classes

An important property for an information extraction system is its ability to define semantic classes of text fragments. A set of semantic classes can include different components –– from primitive variants (for example, defining dates) to highlighting named entities and determining their class (for example, “Organization”, “Person”, “Position”). This will allow, when setting the rules for extracting information, to operate not on individual words and their interrelations, but on entities characteristic of the subject domain.

Machine learning in this context is most likely possible only in a supervisory version, since the use of clustering on a set of semantic classes will lead to results that are difficult for a person to perceive.

See also

created: 2014-09-22
updated: 2024-11-14
304



Rating 9 of 10. count vote: 2
Are you satisfied?:



Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Natural Language Modeling of Thought Processes and Character Modeling

Terms: Natural Language Modeling of Thought Processes and Character Modeling