### 26.4 Correlation and Regression Analysis ¶

: c = cov (x)
: c = cov (x, y)
: c = cov (…, opt)
: c = cov (…, nanflag)

Compute the covariance matrix.

The covariance between two variable vectors A and B is calculated as:

cov (a,b) = 1/(N-1) * SUM_i (a(i) - mean (a)) * (b(i) - mean (b))

where N is the length of the vectors a and b.

If called with one argument, compute cov (x, x). If x is a vector, this is the scalar variance of x. If x is a matrix, each row of x is treated as an observation, and each column as a variable, and the (ij)-th entry of cov (x) is the covariance between the i-th and j-th columns in x. If x has dimensions n x m, the output c will be a m x m square covariance matrix.

If called with two arguments, compute cov (x, y), the covariance between two random variables x and y. x and y must have the same number of elements, and will be treated as vectors with the covariance computed as cov (x(:), y(:)). The output will be a 2 x 2 covariance matrix.

The optional argument opt determines the type of normalization to use. Valid values are

0 [default]:

Normalize with N-1. This provides the best unbiased estimator of the covariance.

1:

Normalize with N. This provides the second moment around the mean. opt is set to 1 for N = 1.

The optional argument nanflag must appear last in the argument list and controls how NaN values are handled by cov. The three valid values are:

includenan [default]:

Leave NaN values in x and y. Output will follow the normal rules for handling NaN values in arithmetic operations.

omitrows:

Rows containing NaN values are trimmed from both x and y prior to calculating the covariance. A NaN in one variable will remove that row from both x and y.

partialrows:

Rows containing NaN values are ignored from both x and y independently for each i-th and j-th covariance calculation. This may result in a different number of observations, N, being used to calculated each element of the covariance matrix.

Compatibility Note: Previous versions of cov treated rows x and y as multivariate random variables. This version attempts to maintain full compatibility with MATLAB by treating x and y as two univariate distributions regardless of shape, resulting in a 2x2 output matrix. Code relying on Octave’s previous definition will need to be modified when running this newer version of cov. The previous behavior can be obtained by using the NaN package’s covm function as covm (x, y, "D").

: r = corr (x)
: r = corr (x, y)

Compute matrix of correlation coefficients.

If each row of x and y is an observation and each column is a variable, then the (ij)-th entry of corr (x, y) is the correlation between the i-th variable in x and the j-th variable in y. x and y must have the same number of rows (observations).

corr (x,y) = cov (x,y) / (std (x) * std (y))

If called with one argument, compute corr (x, x), the correlation between the columns of x.

: r = corrcoef (x)
: r = corrcoef (x, y)
: r = corrcoef (…, param, value, …)
: [r, p] = corrcoef (…)
: [r, p, lci, hci] = corrcoef (…)

Compute a matrix of correlation coefficients.

x is an array where each column contains a variable and each row is an observation.

If a second input y (of the same size as x) is given then calculate the correlation coefficients between x and y.

param, value are optional pairs of parameters and values which modify the calculation. Valid options are:

"alpha"

Confidence level used for the bounds of the confidence interval, lci and hci. Default is 0.05, i.e., 95% confidence interval.

"rows"

Determine processing of NaN values. Acceptable values are "all", "complete", and "pairwise". Default is "all". With "complete", only the rows without NaN values are considered. With "pairwise", the selection of NaN-free rows is made for each pair of variables.

Output r is a matrix of Pearson’s product moment correlation coefficients for each pair of variables.

Output p is a matrix of pair-wise p-values testing for the null hypothesis of a correlation coefficient of zero.

Outputs lci and hci are matrices containing, respectively, the lower and higher bounds of the 95% confidence interval of each correlation coefficient.

: rho = spearman (x)
: rho = spearman (x, y)

Compute Spearman’s rank correlation coefficient rho.

For two data vectors x and y, Spearman’s rho is the correlation coefficient of the ranks of x and y.

If x and y are drawn from independent distributions, rho has zero mean and variance 1 / (N - 1), where N is the length of the x and y vectors, and is asymptotically normally distributed.

spearman (x) is equivalent to spearman (x, x).

: tau = kendall (x)
: tau = kendall (x, y)

Compute Kendall’s tau.

For two data vectors x, y of common length N, Kendall’s tau is the correlation of the signs of all rank differences of x and y; i.e., if both x and y have distinct entries, then

1
tau = -------   SUM sign (q(i) - q(j)) * sign (r(i) - r(j))
N (N-1)   i,j

in which the q(i) and r(i) are the ranks of x and y, respectively.

If x and y are drawn from independent distributions, Kendall’s tau is asymptotically normal with mean 0 and variance (2 * (2N+5)) / (9 * N * (N-1)).

kendall (x) is equivalent to kendall (x, x).