Cluster analysis in pattern recognition

Lecture

As previously noted, cluster analysis (self-study, unsupervised learning, taxonomy) is used to automatically generate a list of images from a training set. All objects of this sample are presented to the system without specifying which image they belong to. Such problems are solved, for example, by a person in the process of natural-science cognition of the surrounding world (the classification of plants and animals). This experience should be used to create appropriate algorithms.

The cluster analysis is based on the compactness hypothesis. It is assumed that the training sample in the attribute space consists of a set of bunches (like galaxies in the Universe). The task of the system is to identify and formally describe these clots. The geometric interpretation of the compactness hypothesis is as follows.

Objects belonging to the same taxon are located close to each other compared to objects belonging to different taxa. "Proximity" can be understood more broadly than with geometric interpretation. For example, a pattern describing the relationship of objects of one taxon differs from that in other taxa, as is the case in linguistic methods.

We confine ourselves to considering geometric interpretation. Let us dwell on the FOREL algorithm (Fig. 13).

Cluster analysis in pattern recognition

Fig. 13. Illustration of the FOREL algorithm:
a - the process of moving a formal element over a set
objects (points); b - illustration of the dependence of the results
taxonomy according to the FOREL algorithm from the starting point
move formal element

The hypersphere of radius is under construction Cluster analysis in pattern recognition where corresponds to the hypersphere covering all objects (points) of the training sample. In this case, the number of taxa will be equal to one. Then a hypersphere of radius is built. centered on an arbitrary sampling point. For the set of points that fall inside the hypersphere (formal element), the mean value of the coordinates (the standard) is determined and the center of the hypersphere moves into it. If this movement is significant, then the set of points that fall inside the hypersphere, and, consequently, their mean value of coordinates, will noticeably change. Again we move the center of the hypersphere to this new mean value, etc. until the hypersphere stops or loops. Then all the points that fall inside this hypersphere are excluded from consideration and the process is repeated on the remaining points. This continues until all points are exhausted. Taxonomy result: a set of radius hyperspheres (formal elements) Cluster analysis in pattern recognition with the centers defined as a result of the above procedure. Call it a cycle with formal elements of radius. .

The next cycle uses hyperspheres of radius. Cluster analysis in pattern recognition . The parameter appears here. , determined by the researcher most often by selection in search of a compromise: increase leads to an increase in the rate of convergence of the computational procedure, but it also increases the risk of losing the subtleties of the taxonomic structure of the set of points (objects). It is natural to expect that with a decrease in the hypersphere radius, the number of selected taxa will increase. If in the feature space the training sample consists of Cluster analysis in pattern recognition compact bunches far apart from each other, starting from a certain radius several cycles with radii of formal elements will end with the same number taxa. It is reasonable to associate the presence of such a “shelf” in a sequence of cycles with an objective existence. Cluster analysis in pattern recognition bunches, which correspond to the taxa. The presence of several "shelves" may indicate a hierarchy of taxa. In fig. 14 shows the set of points that have a two-level taxonomic structure.

For all its visibility and interpretability of the results, the FOREL algorithm has a significant drawback: in most cases, the taxonomy results depend on the initial choice of the center of the hypersphere of radius Cluster analysis in pattern recognition . There are various techniques that reduce this dependence, but there is no radical solution to this problem. Details can be found in the recommended literature [6].

Cluster analysis in pattern recognition

Fig. 14. Hierarchical (two-level) taxonomy

An example of the use of human criteria in solving taxonomy problems is the KRAB algorithm. These criteria are worked out on a two-dimensional attribute space in the course of a taxonomy implemented by a person, and are applied in an algorithm that functions with objects of arbitrary dimension.

The factors identified in the "human" taxonomy can be summarized as follows:

- within taxa, objects should be as close as possible to each other (generalized indicator Cluster analysis in pattern recognition );

- taxa should be kept as far apart as possible (generalized indicator Cluster analysis in pattern recognition );

- in taxa, the number of objects should be as equal as possible, that is, their difference in different taxa should be minimized (generalized indicator Cluster analysis in pattern recognition );

- within taxa there should not be large jumps in the density of points, that is, the number of points per unit volume (generalized indicator Cluster analysis in pattern recognition ).

If you can successfully choose the way to measure Cluster analysis in pattern recognition and then a good match between “human” and automatic taxonomy can be achieved. The KRAB algorithm uses the following approach.

All points of the training sample are combined into a graph in which they are vertices. This graph should have a minimum total length of edges connecting all vertices and not contain loops (Fig. 15). Such a graph is called a KNP-graph (KNP is the shortest non-closed path).

Cluster analysis in pattern recognition

Fig. 15. Illustration of the KRAB algorithm

The measure of proximity of objects in one taxon is the average edge length

Cluster analysis in pattern recognition ,

Where Cluster analysis in pattern recognition - the number of objects in the taxon ,

Cluster analysis in pattern recognition - length th edge.

Averaged over all taxa measure of proximity of points Cluster analysis in pattern recognition .

The average length of the edges connecting taxa, Cluster analysis in pattern recognition .

The measure of local heterogeneity is defined as follows.

If the length Cluster analysis in pattern recognition th rib and the length of the shortest edge adjacent to it , the less , the greater the difference in the lengths of the edges, the more reason to believe that along an edge with a length will pass the border between taxa. Generalizing criterion

Cluster analysis in pattern recognition ,

Where Cluster analysis in pattern recognition - total number of points in the training set.

Determine the value Cluster analysis in pattern recognition in the following way:

Cluster analysis in pattern recognition .

It can be shown that with a fixed Cluster analysis in pattern recognition maximum achieved when .

Now it is possible to form an integrated taxonomy quality criterion

Cluster analysis in pattern recognition .

The more Cluster analysis in pattern recognition However, this quality is higher. Thus, in implementing taxonomy, you need to strive to maximize . If you want to form taxa, it is necessary to break off in the CNR graph ribs such that the criterion turned out to be the maximum possible. This is a busting task. With a large number of taxa and learning sample objects, the number of options may be unacceptably large. It is desirable to reduce the computational cost. The following trick is offered as an example. KNP-graph is not built on the set of points of the training set, but on the set of hypersphere centers (taxa), found using the FOREL algorithm. This can dramatically reduce the number of vertices (and therefore the edges) of the graph and make it possible to carry out a complete search of the cliff variants. Cluster analysis in pattern recognition ribs. Of course, this is not guaranteed a global maximum. , especially if we recall the shortcomings inherent in the FOREL algorithm. This method of combining FOREL and KRAB algorithms is named by the authors KRAB 2.