Correlation

Lecture

Correlation (from the Latin. Correlatio - correlation, interrelation), correlation dependence - the statistical interrelation of two or several random variables (or quantities that can be considered as such with some acceptable degree of accuracy). In this case, changes in the values of one or several of these quantities are accompanied by a systematic change in the values of another or other quantities. ^{[1] The} mathematical measure of the correlation of two random variables is the correlation ratio ^[2] or correlation coefficient (or ) ^[1] . If a change in one random variable does not lead to a regular change in another random variable, but leads to a change in another statistical characteristic of this random variable, then this connection is not considered correlation, although it is statistical ^[3] .

For the first time, the term “correlation” was introduced by the French paleontologist Georges Cuvier in the 18th century. He developed the "law of correlation" of parts and organs of living beings, with which it is possible to restore the appearance of a fossil animal, having at its disposal only a part of its remains. In statistics, the word "correlation" was first used by the English biologist and statistician, Francis Galton, at the end of the 19th century. ^[four]

Some types of correlation coefficients may be positive or negative. In the first case, it is assumed that we can only determine the presence or absence of a connection, and in the second, also its direction. If it is assumed that a strict order relation is given on the values of variables, then a negative correlation is a correlation, at which an increase in one variable is associated with a decrease in another. In this case, the correlation coefficient will be negative. A positive correlation in such conditions is a relationship in which an increase in one variable is associated with an increase in another variable. It is also possible that there is no statistical relationship — for example, for independent random variables.

Content

1 Correlation and relationship of values
2 Correlation Indicators
- 2.1 Parametric Correlation Indicators
  - 2.1.1 Covariance
  - 2.1.2 Linear correlation coefficient
- 2.2 Non-parametric correlation indicators
  - 2.2.1 Kendall rank correlation coefficient
  - 2.2.2 Spearman's rank correlation coefficient
  - 2.2.3 The correlation coefficient of the characters Fechner
  - 2.2.4 Multiple rank correlation coefficient (concordance)
- 2.3 Properties of the correlation coefficient
3 Correlation analysis
- 3.1 Limitations of Correlation Analysis
- 3.2 Scope
4 V selection
5 See also
6 Notes
7 Literature
8 References

Correlation and interrelation of quantities

A significant correlation between two random variables is always evidence of the existence of a certain statistical relationship in a given sample, but this relationship does not necessarily have to be observed for another sample and have a causal effect. Often the tempting simplicity of a correlational study encourages the researcher to draw false intuitive conclusions about the presence of a causal relationship between pairs of signs, while the correlation coefficients establish only statistical relationships. For example, considering fires in a particular city, it is possible to reveal a very high correlation between the damage caused by the fire and the number of firefighters who participated in extinguishing the fire, and this correlation will be positive. From this, however, the conclusion “an increase in the number of firefighters leads to an increase in the damage caused” does not follow, and even more so an attempt to minimize the damage from fires by eliminating the fire brigades will not be successful. ^[5] At the same time, the lack of correlation between the two quantities does not mean that there is no connection between them. For example, a dependency can be complex non-linear in nature, which the correlation does not reveal.

Correlation figures [edit]

Parametric Correlation Indicators [edit]

Covariance [edit]

Main articles: Covariance , Cauchy – Bunyakovsky inequality

An important characteristic of the joint distribution of two random variables is the covariance (or correlation moment ). Covariance is a second-order joint moment. ^[6] Covariance is defined as the mathematical expectation of the product of deviations of random variables ^[7] :

Where Correlation - mathematical expectation (in the English literature the designation is accepted ).

Covariance properties :

Covariance of two independent random variables and is zero ^[8] .

Evidence

Because Correlation and - independent random variables, then their deviations and also independent. Using the fact that the mathematical expectation of the product of independent random variables is equal to the product of the mathematical expectations of the factors, and the mathematical expectation of the deviation is zero, we have

Correlation

The absolute value of the covariance of two random variables and does not exceed the geometric mean of their dispersions: ^[9] .

Evidence

We introduce a random variable Correlation (Where - standard deviation) and find its variance . After completing the calculations we get:

Correlation

Any variance is non-negative, therefore

Correlation

From here

Correlation

Typing a random variable Correlation similarly

Correlation

Combining these inequalities, we have

Correlation

So,

Correlation

The covariance has a dimension equal to the product of the dimension of random variables, that is, the magnitude of the covariance depends on the units of measurement of independent variables. This feature of covariance makes it difficult to use it for the purposes of correlation analysis ^[8] .

Linear correlation coefficient [edit]

To eliminate the lack of covariance, a linear correlation coefficient (or Pearson correlation coefficient ) was introduced, which was developed by Karl Pearson, Francis Edgeworth and Rafael Weldon (Eng.) Russian. in the 90s of the XIX century. The correlation coefficient is calculated by the formula ^[10] ^[8] :

Where Correlation , - the average value of the samples.

The correlation coefficient ranges from minus one to plus one ^[11] .

Evidence

Dividing both sides of the double inequality Correlation on will get

Correlation

The linear correlation coefficient is associated with the regression coefficient in the form of the following relationship: Where - regression coefficient, - the standard deviation of the corresponding factor sign ^[12] .

For a graphical representation of such a relationship, you can use a rectangular coordinate system with axes that correspond to both variables. Each pair of values is marked with a specific symbol. Such a graph is called a “scatterplot”.

The method of calculating the correlation coefficient depends on the type of scale to which the variables belong. Thus, to measure variables with interval and quantitative scales, it is necessary to use the Pearson correlation coefficient (correlation of the moments of products). If at least one of the two variables has an ordinal scale, or is not normally distributed, you must use the Spearman rank correlation or (tau) kendall. In the case when one of the two variables is dichotomous, a point-like two-row correlation is used, and if both variables are dichotomous: the four-field correlation. The calculation of the correlation coefficient between two non-dichotomous variables is not devoid of meaning only when the connection between them is linear (unidirectional).

Nonparametric Correlation Indicators [edit]

Kendall rank correlation coefficient [edit]

It is used to identify the relationship between quantitative or qualitative indicators, if they can be ranked. The values of the indicator X are set in ascending order and assigned to their ranks. The values of the indicator Y are ranked and the Kendall correlation coefficient is calculated:

Correlation ,

Where Correlation .

Correlation - the total number of observations following the current observations with a large value of the ranks Y.

Correlation - the total number of observations following the current observations with a lower value of the ranks of Y. (equal ranks are not counted!)

Correlation

If the studied data is repeated (have the same ranks), then the corrected Kendall correlation coefficient is used in the calculations:

Correlation

Correlation - the number of related ranks in the series X and Y, respectively.

Spearman's rank correlation coefficient [edit]

The degree of dependence of two random variables (features) X and Y can be characterized on the basis of the analysis of the obtained results. Correlation . Each indicator X and Y is assigned a rank. The ranks of the values of X are arranged in the natural order i = 1, 2,. . ., n. The rank of Y is written as Ri and corresponds to the rank of the pair (X, Y) for which the rank of X is equal to i. Based on the obtained Xi and Yi ranks, their differences are calculated and calculates the Spearman correlation coefficient:

Correlation

The value of the coefficient varies from −1 (the sequence of ranks is completely opposite) to +1 (the sequence of ranks completely coincide). A value of zero indicates that the signs are independent.

The correlation coefficient of Fechner signs [edit]

The number of matches and discrepancies of the signs of deviations of the values of indicators from their average value is calculated.

Correlation

C is the number of pairs in which the signs of deviations of values from their averages coincide.

H is the number of pairs in which the signs of the deviations of values from their averages do not match.

The coefficient of multiple rank correlation (concordance) [edit]

Correlation

Correlation - the number of groups that are ranked.

Correlation - number of variables.

Correlation - rank -factor -units

Significance:

Correlation

Correlation , the hypothesis of a lack of communication is rejected.

If there are related ranks:

Correlation

Correlation coefficient properties [edit]

Cauchy – Bunyakovsky inequality:

if we take the covariance as the scalar product of two random variables Correlation

then the norm of the random variable will be equal to Correlation

, and the consequence of the Cauchy-Bunyakovsky inequality will be:

The correlation coefficient is then and only if and linearly dependent (except for events of zero probability, when several points are “knocked out” of a straight line reflecting the linear dependence of random variables):

Where

. Moreover in this case the signs Correlation

and

match up:

Evidence

Consider random variables X and Y with zero means and variances equal, respectively, Correlation and . Calculate the variance of the random variable :

Correlation

Assuming that the correlation coefficient

Correlation

the previous expression will be rewritten as

Correlation

Since you can always choose the numbers a and b so that Correlation (for example, if , then we take arbitrary a and ), then with these a and b dispersion and mean almost sure. But this means a linear relationship between X and Y. The proof is obviously generalized to the case of X and Y with nonzero means, only in the above calculations it will be necessary to replace X with Correlation and Y - on .

If a independent random variables then . The reverse is generally not true.

Correlation Analysis [edit]

Correlation analysis is a method of processing statistical data by which the closeness of the relationship between two or more variables is measured. Correlation analysis is closely related to regression analysis (the term “ correlation and regression analysis ” is also often encountered, which is a more general statistical concept), determines the need to include certain factors in the multiple regression equation, and also evaluates the obtained regression equation for identified relationships (using the coefficient of determination). ^[1] ^[2]

Limitations of Correlation Analysis [edit]

Many correlation fields. Distribution of values

with corresponding correlation coefficients for each of them. The correlation coefficient reflects the “noisiness” of the linear dependence (upper line), but does not describe the slope of the linear dependence (middle line), and is not at all suitable for describing complex, non-linear dependencies (lower line). For the distribution shown in the center of the figure, the correlation coefficient is not defined, since the variance y is zero.

Application is possible in the presence of a sufficient number of observations to study. In practice, it is considered that the number of observations should be no less than 5-6 times the number of factors (there is also a recommendation to use a proportion not less than 10 times the number of factors). If the number of observations exceeds the number of factors tenfold, the law of large numbers comes into effect, which ensures mutual cancellation of random fluctuations. ^[13]
It is necessary that the set of values of all factor and effective signs obey a multidimensional normal distribution. If the volume of the aggregate is insufficient to conduct a formal test for the normal distribution, then the distribution law is determined visually based on the correlation field . If there is a linear trend in the location of points on this field, then it can be assumed that the aggregate of the initial data obeys the normal distribution law. ^[14] .
The initial set of values should be qualitatively homogeneous. ^[13]
By itself, the fact of correlation does not give grounds to assert that one of the variables precedes or is the cause of the changes, or that the variables are generally causally related to each other, and the effect of the third factor is not observed. ^[five]

Scope [edit]

This method of processing statistical data is very popular in economics and social sciences (particularly in psychology and sociology), although the scope of application of the correlation coefficients is extensive: quality control of industrial products, metallography, agrochemistry, hydrobiology, biometrics, and others. In various applied industries, different interval boundaries have been adopted to assess the closeness and significance of communication.

The popularity of the method is due to two points: the correlation coefficients are relatively simple to calculate, their use does not require special mathematical preparation. Combined with the simplicity of interpretation, the simplicity of applying the coefficient has led to its wide distribution in the field of statistical data analysis.

In breeding [edit]

Correlation - the relationship of signs (may be positive or negative). Due to gene linkage or pleiotropy ^[15]

Correlation

Content

Correlation and interrelation of quantities

Correlation figures [edit]

Parametric Correlation Indicators [edit]

Covariance [edit]

Linear correlation coefficient [edit]

Nonparametric Correlation Indicators [edit]

Kendall rank correlation coefficient [edit]

Spearman's rank correlation coefficient [edit]

The correlation coefficient of Fechner signs [edit]

The coefficient of multiple rank correlation (concordance) [edit]

Correlation coefficient properties [edit]

Correlation Analysis [edit]

Limitations of Correlation Analysis [edit]

Scope [edit]

In breeding [edit]

See also [edit]

Comments

To leave a comment

Probability theory. Mathematical Statistics and Stochastic Analysis

Terms: Probability theory. Mathematical Statistics and Stochastic Analysis