Lecture
Document classification by subject
In this chapter, we consider the task of thematic classification of documents, that is, the automatic determination of the subject matter of a document on a given set of possible topics.
A distinctive feature of the classification task is the assumption that the set of classified documents does not contain “garbage”, that is, each of the documents corresponds to one of the specified topics. The last few years this issue has received much attention.
Most of the proposed classification methods are based on the use of the classical vector model of information retrieval . For description of topics, as for description of documents, weighted lists of terms (word forms) are used here. Term weights are based on statistical information about the occurrence of terms in this and possibly other documents.
In recent years, more complex approaches have attracted more and more attention. The main idea of many of them is the reduction of the dimensionality of the “feature” space, according to which the documents are classified. The initial feature space is usually the term (word form) space, which is compressed based on the results of the analysis of a large group of documents. For the analysis, various approaches are used - the clustering of terms based on their probabilistic distributions over documents, the use of knowledge detection methods in data (data mining) to specify classification rules, etc. Note that despite the improved quality of classification, the practical application of such approaches complicated by their large computational complexity, resulting in low performance.
One of the promising approaches is the use of latent-semantic analysis (LSA) to identify the structure of semantic relationships between the words used by statistical analysis of a large group of documents. This allows you to automatically distinguish between the semantic shades of the same word, depending on the context of its use. Note that the identification of semantic structure using latent-semantic analysis is fully automatic, that is, without requiring any manual compilation of dictionaries.
Classification taking into account semantic proximity of words
All classification methods use the same generalized algorithm, which consists of the following steps:
- tasks / build descriptions for all subjects
- build descriptions of the document in question
- calculating estimates of closeness between descriptions of topics and a document description and selection of the closest topics
The differences between the methods are determined by the implementation of these stages.
Descriptions of topics and documents
The proposed approach is based on the assumption that the subject of the document is determined by its vocabulary. We have excluded from consideration the so-called stop words, i.e. the most common words that can be used in documents of any subject, such as prepositions, pronouns, etc. We also believe that different syntactic forms of the same word are not reflect on the general subject of the document and, therefore, may appear as a single basic word form (term).
As a description of the document uses all the terms found in the document, with the exception of commonly used.
Topics are also represented in the system by sets of terms, however, these sets do not contain all the words used in this subject, but only a small subset of them that is automatically selected.
Building descriptions of topics
The topic is given by a relatively small set of related documents. Based on the results of the analysis of this set of documents, as well as the set of documents defining the remaining topics considered, a description of the topic is automatically constructed as a set of terms. The purpose of the analysis is to identify the differences of this topic from others and the choice of terms that best emphasize the features of this topic.
Comments
To leave a comment
Presentation and use of knowledge
Terms: Presentation and use of knowledge