Correlation and regression analysis. Linear correlation

Lecture

Content

Methods for studying the relationship of socio-economic phenomena using correlation-regressive analysis
Linear correlation

General idea of the correlation analysis

The forms and types of relations existing between phenomena are very diverse in their classification. The subject of statistics are only those of them that are quantitative in nature and are studied using quantitative methods. Consider the method of correlation and regression analysis, which is fundamental in the study of the relationship of phenomena.

This method contains two of its constituent parts - correlation analysis and regression analysis. Correlation analysis is a quantitative method for determining the closeness and direction of the relationship between sampling variables. Regression analysis is a quantitative method for determining the type of mathematical function in a causal relationship between variables.

To assess the strength of the connection in the theory of correlation, the Cheddock scale of English statistics is used: weak - from 0.1 to 0.3; moderate - from 0.3 to 0.5; noticeable - from 0.5 to 0.7; high - from 0.7 to 0.9; very high (strong) - from 0.9 to 1.0. It is used further in the examples on the topic.

Linear correlation

This correlation characterizes the linear relationship in the variations of variables. It can be paired (two correlating variables) or multiple (more than two variables), direct or inverse - positive or negative, when the variables vary, respectively, in the same or different directions.

If the variables are quantitative and equivalent in their independent observations Correlation and regression analysis. Linear correlation with their total then the most important empirical measures of the closeness of their linear interrelationship are the direct correlation coefficient of Austrian psychologist G.T. 1936).

The pairwise correlation coefficient of Fechner signs determines the consistency of directions in the individual deviations of variables Correlation and regression analysis. Linear correlation and from their medium and . It is equal to the ratio of the difference of the sums of coinciding ( ) and mismatched ( ) pairs of characters in deviations and to the sum of these amounts:

Correlation and regression analysis. Linear correlation

The value of Kf varies from -1 to +1. Summation in (1) is made according to the observations Correlation and regression analysis. Linear correlation which are not listed in amounts for the sake of simplification. If any one deviation or then it is not included in the calculation. If both deviations are zero at once: , then such a case is considered to be coincident in signs and is part of Correlation and regression analysis. Linear correlation . In table 12.1. shows the preparation of data for the calculation (1).

Table 12.1 Data for the calculation of the Fechner coefficient.

Score	Number of employees, thousand people	Commodity turnover, cu	Deviation from medium and		Sign Comparison and
					coincidence (Ck)	Nee-Fall (Hk)
one	0.2	3.1	+0.0	-0.9	0	one
2	0.1	3.1	-0,1	-0.9	one	0
3	0.4	5.0	+0.2	+1.0	one	0
four	0.2	4.4	+0.0	+0.4	one	0
five	0.1	4.4	-0,1	+0.4	0	one
Total	1.0	20.0	-	-	3	2

By (1) we have Kf = (3 - 2) / (3 + 2) = 0.20 . The direction of interconnection in variations !! The average number of employees | number of employees]] and the volume of turnover - positive (straight): signs in deviations and Correlation and regression analysis. Linear correlation and in the majority (in 3 cases out of 5) coincide with each other. The tightness of the relationship of variables on the Cheddock scale is weak.

The coefficients of the paired, pure (private) and multiple (total) linear Pearson correlations, in contrast to the Fechner coefficient, take into account not only the signs, but also the magnitudes of the deviations of the variables. For their calculation using different methods. So, according to the direct counting method based on ungrouped data, the Pearson pair correlation coefficient is:

Correlation and regression analysis. Linear correlation

This coefficient also varies from -1 to +1. In the presence of several variables, the Pearson multiple (cumulative) linear correlation coefficient is calculated. For three variables x, y, z, it has the form

Correlation and regression analysis. Linear correlation

This coefficient varies from 0 to 1. If to eliminate (completely eliminate or fix at a constant level) the effect Correlation and regression analysis. Linear correlation on and then their “general” connection will turn into “pure”, forming a pure (private) Pearson linear correlation coefficient:

Correlation and regression analysis. Linear correlation

This ratio varies from -1 to +1. The squares of the correlation coefficients (2) - (4) are called coefficients (indices) of determination - respectively paired, clean (private), multiple (total):

Correlation and regression analysis. Linear correlation

Each of the coefficients of determination varies from 0 to 1 and estimates the degree of variational certainty in the linear relationship of variables, showing the fraction of variation of one variable (y) due to the variation of the other (s) - x and y. The multidimensional case of the presence of more than three variables is not considered here.

According to the development of English statistics R.E. Fisher (1890-1962), the statistical significance of the pairwise and pure (particular) Pearson correlation coefficients is checked in the case of their normal distribution, based on Correlation and regression analysis. Linear correlation -the distribution of English statistics vs Gosset (pseudonym "Student"; 1876-1937) with a given level of probabilistic significance and the degree of freedom where - number of links (factor variables). For the pair coefficient we have its rms error Correlation and regression analysis. Linear correlation and the actual value Student Criteria:

Correlation and regression analysis. Linear correlation

For net correlation coefficient Correlation and regression analysis. Linear correlation when calculating it instead of (n-2) must be taken because in this case there is m = 2 (two factor variables x and z). For a large number n> 100, instead of (n-2) or (n-3) in (6), we can take n, neglecting the accuracy of the calculation.

If tr> ttabl. , then the pair correlation coefficient - total or pure is statistically significant, and for tr ≤ ttable. - insignificant.

The significance of the multiple correlation coefficient R is checked by F - Fisher criterion by calculating its actual value

Correlation and regression analysis. Linear correlation

When FR> Ftabl. the coefficient R is considered significant with a given level of significance a and the available degrees of freedom Correlation and regression analysis. Linear correlation and , and at Fr≤ Ftabl - insignificant.

In the aggregates of a large volume n> 100, the normal distribution law is applied directly (tabulated Laplace-Sheppard function) to assess the significance of all Pearson coefficients instead of the criteria t and F.

Finally, if the Pearson coefficients do not obey the normal law, then Z - Fisher criterion is used as a criterion of their significance, which is not considered here.

Conditional calculation example (2) - (7) given in table. 12.2, where the initial data of Table 12.1 are taken with the addition of the third variable z - the size of the total area of the store (in 100 sq. M).

Table 12.2. Data preparation for calculating Pearson correlation coefficients

Score	Indicators
to	xk	yk	zk	xkyk	xkzk	ykzk
one	0.2	3.1	0.1	0.62	0.02	0.31	0.04	9.61	0.01
2	0.1	3.1	0.1	0.31	0.01	0.31	0.01	9.61	0.01
3	0.4	5.0	1.0	2.00	0.40	5.00	0.16	25.00	1.00
four	0.2	4.4	0.2	0.88	0.04	0.88	0.04	19.36	0.04
five	0.1	4.4	0.6	0.44	0.06	2.64	0.01	19.36	0.36
Total	1.0	20.0	2.0	4.25	0.53	9.14	0.26	82.94	1.42

According to (2) - (5), the Pearson linear correlation coefficients are:

Correlation and regression analysis. Linear correlation

The relationship of the variables x and y is positive, but not close, making up the value of their pair correlation coefficient Correlation and regression analysis. Linear correlation and on pure - size and was rated on the Cheddok scale, respectively, as "noticeable" and "weak."

The coefficients of determination dxy = 0.354 and dxy.z = 0.0037 indicate that the variation of y (turnover) is due to a linear variation of x (the number of employees) by 35.4% in their total interrelation and in the net interrelation - only by 0.37% . This situation is due to the significant effect on x and y of the third variable z - the total area occupied by stores. The tightness of its relationship with them is respectively rxz = 0.677 and ryz = 0.844 .

The coefficient of multiple (cumulative) correlations of the three variables shows that the tightness of the linear relationship x and z c y is R = 0.844 , estimated by the Cheddock scale as “high”, and the coefficient of multiple determination is D = 0.713 , indicating that 71.3 % of all variations of y (commodity turnover) are due to the cumulative effect of variables x and z on it. The remaining 28.7% are due to the impact on y of other factors or the curvilinear connection of the variables y, x, z .

To assess the significance of correlation coefficients, take the significance level Correlation and regression analysis. Linear correlation . According to the initial data, we have degrees of freedom for and for . According to the theoretical table we find, respectively, ttabl.1. = 3.182 and table.2. = 4.303. For the F-criterion we have and and on the table we find Ftabl. = 19.0. The actual values of each criterion for (6) and (7) are equal to: