Lecture
The task of extracting information from the text itself is not new: quite a lot of work has been done in this direction both by large companies aka Yandex and Google, and by independent developers. However, to say that this problem is finally solved, alas, is not necessary. In this article I want to streamline my knowledge on this issue, superficially examining the developments that I recently had to face.
And so, let us be given the proposal “The visit of the President to Denmark will give a new breath to the dialogue between the two countries.” To isolate the facts from this sentence, you will need to go through the following steps:
At this stage it is necessary to break the sentence into separate words. There should be no problems here.
After the breakdown, for each word it is necessary to obtain morphological information about it (part of speech, gender, case, number, etc.) and all sorts of attributes (for example, is the word a name or a geographical location). For this task, you must have special dictionaries: as a register of Russian words, you can use the Zalizniak dictionary or its proprietary derivatives. As a ready-made solution, you can consider the Mystem utility from Yandex.
If the Mystem utility is used, the original sentence will be parsed as follows:
krestyaninov @ localhost # echo "The visit of the President to Denmark will give a new breath to the dialogue of the two countries" | ./mystem -niwg Visit {visit = S, husband, neo = (im, unit | v, unit)} President {President = S, husband, od = (clan, ed | vin, ed)} in {in = PR = | in = S, abbr = (im, ed | im, plural | kind, ed | genus, plural | dat, ed | dates, plural, ed | ed, plural, ed | creator, pl | pr, ed | pr, pl)} Denmark {Denmark = S, Geo, Female, Neod = Wine, Unit} will give {attach = V = unproach, unit, withdrawing, 3-l, owls} new {new = S, media, neody = (im, unit | vin, unit) | new = A = (im, unit, complete, medium | win, unit, complete, medium)} breathing {breathing = S, media, neody = (im, unit | vin, unit)} dialogue {dialogue = s, husband, neo = date, unit} two {two = NUM = (genus | wines, wives, od wines, husband, od | pr)} countries {country = S, wives, neode = gender, mn}
At this stage, the associated subgroups of words in the sentence are determined. For example, the NEGATIVE-VERGE: “visit-betray”. The establishment of these relationships will allow us to determine the ambiguities in the morphological analysis. For example, from the phrase "new breath" it is clear that "new" is an adjective, and not a noun (which was not certain after morphological analysis).
The task of semantic analysis is to build a full-fledged tree of connections of words in a sentence. This process has many nuances and in general is quite complicated to be described in this article. More information about the semantic analysis can be found here.
Having formed a connection tree of words in a sentence, we can proceed directly to extracting facts. To do this, you can use the following tools:
- Search for the reference element : Some word is searched for in the text (for example, “President”), on the basis of which a fact is built on the basis of a connection tree;
- Search by pattern : Search for data by a regular expression (for example, the isolation of the date);
- Search by ontology : Search for data based on predicative rules described in a special language. Example.
More information about the fact retrieval can be found at the following addresses:
- Presentation "Yandex.Press-portraits";
- Presentation: “Automatic extraction of facts from the text”;
Of the existing systems of fact extraction that work in an acceptable manner, I managed to find only a system called GATE. The system is quite interesting and promising, but, unfortunately, it has no native support for the Russian language. You can try it in action with the help of this manual.
From paid systems, we can note the domestic development of RCO (thanks to rg_software).
Automatic extraction of facts from text On the example of newspaper articles Tatyana Lando Ideograph LLC
What it is?
Fact extraction (text mining) - automatic extraction of new, previously unknown information from the texts, to build facts.
Examples of facts:
Establishing links between objects
Setting object properties
Setting Parameter Values
Why is this necessary?
Reducing the complexity of word processing in a particular subject area. Popular application:
Medicine, biotechnology.
Can be applied to:
Decision support systems
Expert Systems
Knowledge base
Document Management Systems
Example: text
Euroset, the largest retail company in the CIS, announces the appointment of Andrei Rukavishnikov to the post of Vice President for Marketing and Advertising of the company. The turnover of the company “Euroset” in 2006 amounted to 4.62 billion dollars.
Example: Facts
1. relations between objects
Andrei Rukavishnikov is vice president of marketing and advertising for Euroset.
2. properties of objects
Euroset is the largest retail company in the CIS
3. value of parameters
The turnover of the company "Euroset" - 4.62 billion dollars in 2006.
Task statement
Extract facts from newspaper texts.
(Create Fact Database)
At this stage, the tasks are:
Identify proper names:
Andrei Rukavishnikov => man
Euroset => company
Establish connections between them
vice president of marketing and advertising =>
=> position
Existing Projects
Yandeks.Novosti - press portraits. http://news.yandex.ru/people/
RCO Fact Extractor http://rco.ru
Integrum http: // www.integrum.ru/
Why another system?
Existing systems are built with virtually no use of linguistic technologies.
The use of linguistics can
enrich the results
make them better
give the system flexibility and extensibility
Add linguistics!
definitions
The term is a component of the troika, i.e. the unit relevant to the system, in our case: the person's name, company name, position
Elementary fact - a fully completed three (Man, Company, Position) position company company
Text processing steps (for any system)
Primary text processing (structuring)
Fetching facts using samples (patterns)
Interpretation of results
Stages of our system
Primary text processing
Tokenization
Parsing
Fetching facts
Term Identification
Construction of elementary facts
Interpretation of results
Validation check
Write to database
Primary text processing
Required components
Tokenization
Breakdown of the text into words.
Lemmatization (Normalization)
reduction of the word to the initial (normal) form
Additional components
Partial parsing
Term Identification
Tokenization
- Breaking down the text into words.
Markers:
Punctuation
Spaces
Numbers
Problems:
Hyphenated writing Sviaz-Bank
Use punctuation and numbers in the names of the own application committee & quot; Sochi-2014 & quot;
Lemmatization
Reduction of the word to the initial (normal) form
Main problem:
The morphological ambiguity of the director - ed. Rp or pl. Im
Solutions:
Statistical (frequency)
Syntax accounting
Partial parsing
Partial analysis of the sentence, the establishment of grammatical links between words
Functions:
Removal of morphological ambiguity
Primary Term Identification
Method:
special formalism for describing natural language grammars: AGFL
AGFL
Affix grammar over the finite lattice
Distributed under the free license (GNU GPL)
the promise of using (on the material of other European languages) for the representation of natural language in NLP technologies has already been confirmed
http: // www.agfl.cs.ru.nl / links.html (examples)
AGFL
Flexibility and stability of the system:
works not only with sentences, but also with text “segments”
can process grammatically incorrect or incomplete sentences
ambiguity resolution due to combination of word signs.
AGFL
Two-level context-free generating formal grammar
Morphology
Syntax
supplemented by a grid of features with a finite number of values.
Signs:
grammatical categories
lexical and grammatical discharges of parts of speech,
any necessary formal characteristics
AGFL: morphological module
an analysis of the main parts of speech (nouns, verbs, adjectives and adverbs) is given.
uses the basic lexicon, for which the main classification categories of parts of speech are indicated:
characteristic of the genus and animation of nouns,
lexico-grammatical category of adjectives
control scheme of verbs, etc.
additionally used derivation module
AGFL: morphological module
work result
assigned to the form of the word, the term-speech characteristic and a set of values of morphological categories (multivalued - in the case of homonymy of forms)
embedded in the syntax module
consideration of local syntactic context for disambiguation
availability of prepositions
correspondences between the values of grammatical categories
the word form of the path in the construction in the path will receive not 5 interpretations in the role of a noun but 2 - Paragraphs units and VP mn
AGFL: syntax module
frequency constructions of phrases,
frequency schemes for constructing simple sentences
single complicating constructions in simple sentence
rows
sacrament turnover
verbal impairment
AGFL: an example
Interpretation Directors:
Rp units, V. p. units, Im.p. mn
The meeting was attended (pl.) By the director (pl.) Of the largest St. Petersburg companies.
He was appointed to the position (managed by R. p.) Of the director (R. p.) Of marketing.
Yesterday, the board of shareholders removed (requires V.p.) from the post of director (V.p.) for investments.
Stages of our system
Primary text processing
Tokenization
Parsing
Fetching facts
Term Identification
Construction of elementary facts
Interpretation of results
Validation check
Write to database
Term Identification
Based on the syntactic dependencies between words, it is concluded whether this construction means one term.
For proper names, punctuation and capital letters are also taken into account;
Term Identification
Search for the reference element
Predicates
to appoint
Class markers
Lord
Company
Position
2. The presence in the dictionary or ontology
3. Patterns / regular expressions
Term Identification: Example
Igor Chupalov appointed new director for finance and management in the Russian division of T-Systems
director of NP (dat)
CompanyName
assigned to PersonName (Nom)
Construction of elementary facts
In reality: almost inseparable from the previous stage.
Complete elementary fact in one sentence
Special predicate
No predicate
Special marker (tense, speaking verb)
Construction of elementary facts
Special predicate
Igor Chupalov appointed new director for finance and management in the Russian division of T-Systems
No predicate
From October 1, 2007, Jonathan Sparrow - General Director of Nokia Siemens Networks in Russia
Special marker (time, speaking verb)
Euroset President Alexey Chuikin noted: <...>
Construction of elementary facts
Difficult situations:
The proposal contains an incomplete fact.
In 1995, he headed the marketing department at Rothmans. (Solution: Accounting for the entire paragraph)
The proposal contains more than one fact.
Previously, Mr. Schendell worked as a vice president of sales, and Mr. Eames - senior vice president of Best Buy (No solution has yet been found)
Stages of our system
Primary text processing
Tokenization
Parsing
Fetching facts
Term Identification
Construction of elementary facts
Interpretation of results
Validation check
Write to database
Validation check
It is carried out with the help of ontology.
Ontology is the formalization of a certain area of knowledge using a conceptual scheme.
The hierarchy of concepts (objects) and defined relations between them.
More in a week
Validation check
Since January, Donald Ims director of Best Buy.
Donald Ims, Best Buy: man vs company?
... Best Buy's annual turnover exceeds ...
Ontology: the company has a sign of "turnover" => Best Buy - the company
From January X director of the company
=> Donald Ims - man
Write to database
Writing facts to a database (RDF?)
Organization of database search
position staff company man turnover
Used technologies
Developed a special platform Ideolog:
Is a logical inference system
Fully based on the Java platform.
It has a classic set of built-in predicates, which is suitable for solving any problems of logical inference.
Used technologies
Ideolog
has an extension to work with typed structures (TFS).
is fully extensible and can be supplemented with modules for solving new problems
It has a simple replenishment mechanism for embedded predicates, data types, etc.
has a convenient and visual graphical environment
Used technologies
Differences from other systems
Using formal grammar:
To remove morphological homonymy
To identify terms
Use of ontology
Not using statistics and machine learning
(planned for further stages)
Virtues
Works for individual texts (no array needed to compile statistics)
It is easy to expand the elementary fact, connecting, for example, taking into account the size of the state or location of the company,
There is a solution to automatically expand the ontology (in development)
Thanks for attention!
[email_address]
useful links
http://ideograph.ru Ideograph Ltd.
http://www.cs.ru.nl/agfl AGFL
http://www.w3.org/TR/owl-features Ontologies and the OWL language
http://people.ischool.berkeley.edu/~hearst/text-mining.html - Marty Hirst's article on the extraction of facts
http://filebox.vt.edu/users/wfan/text_mining.html Collection of links for information retrieval and extraction of facts
Comments
To leave a comment
Automatic extraction of facts from the text (fact extraction)
Terms: Automatic extraction of facts from the text (fact extraction)