Basic Statistical Functions (GNU Octave (version 11.1.0))

Next: Forecasting Metrics, Previous: Statistics on Sliding Windows of Data, Up: Statistics [Contents][Index]

26.3 Basic Statistical Functions ¶

Octave supports various helpful statistical functions. Many are useful as initial steps to prepare a data set for further analysis. Others provide different measures from those of the basic descriptive statistics.

y = center (x) ¶

y = center (x, dim) ¶

y = center (x, vecdim) ¶

y = center (x, "all") ¶

y = center (…, nanflag) ¶

Center data by subtracting its mean.

If x is a vector, then center (x) computes the centered data by subtracting the mean of x from each element of x.

If x is a matrix, then center (x) returns a row vector with each element containing the centered data for each column of x.

If x is an array, then center (x) centers the data alonng the first non-singleton dimension of x.

The data in x must be numeric. The size of y is equal to the size of x.

The optional input dim specifies the dimension to operate on and must be a positive integer. Specifying any singleton dimension of x, including any dimension exceeding ndims (x), will return x.

Specifying multiple dimensions with input vecdim, a vector of non-repeating dimensions, will operate along the array slice defined by vecdim. If vecdim indexes all dimensions of x, then it is equivalent to the option "all". Any dimension in vecdim greater than ndims (x) is ignored.

Specifying the dimension as "all" will cause center to compute the center of all elements of x, and is equivalent to center (x(:)).

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation of the mean using any of the previously specified input argument combinations. The default value for nanflag is "includenan" which keeps NaN values in the calculation. To exclude NaN values set the value of nanflag to "omitnan". Any NaN value along the operating dimensions will result in all corresponding element in y being NaN.

Programming Note: center has obvious application for normalizing statistical data. It is also useful for improving the precision of general numerical calculations. Whenever there is a large value that is common to a batch of data, the mean can be subtracted off, the calculation performed, and then the mean added back to obtain the final answer.

See also: zscore.

z = zscore (x) ¶

z = zscore (x, opt) ¶

z = zscore (x, opt, dim) ¶

z = zscore (x, opt, vecdim) ¶

z = zscore (x, opt, "all") ¶

z = zscore (…, nanflag) ¶

[z, mu, sigma] = zscore (…) ¶

Compute the z-score of x.

For a vector x, the z-score is calculated by subtracting the mean and dividing by its standard deviation. If the standard deviation is zero, then divide by 1 instead.

If x is a vector, then zscore (x) returns the z-score of the elements in x.

If x is a matrix, then zscore (x) returns a row vector with each element containing the z-score of the corresponding column in x.

If x is an array, then zscore (x) computes the z-score along the first non-singleton dimension of x.

The optional parameter opt determines the normalization to use when computing the standard deviation and has the same definition as the corresponding parameter for std.

Specifying the dimension as "all" will cause zscore to operate on all elements of x, and is equivalent to zscore (x(:)).

The optional variable nanflag specifies whether to include or exclude NaN values from the calculation using any of the previously specified input argument combinations. The default value for nanflag is "includenan" which keeps NaN values in the calculation. To exclude NaN values, set the value of nanflag to "omitnan". The output will still contain NaN values at the same locations as in x.

The optional outputs mu and sigma contain the mean and standard deviation.

See also: mean, std, center.

z = normalize (x) ¶

z = normalize (x, dim) ¶

z = normalize (…, method) ¶

z = normalize (…, method, option) ¶

z = normalize (…, scale, scaleoption, center, centeroption) ¶

[z, c, s] = normalize (…) ¶

Return a normalization of the data in x using one of several available scaling and centering methods.

normalize by default will return the zscore of x, defined as the number of standard deviations each element is from the mean of x. This is equivalent to centering at the mean of the data and scaling by the standard deviation. x must be a numeric array of double or single floating point numbers.

The returned value z will have the same size as x. The optional return variables c and s are the centering and scaling factors used in the normalization such that:

  z = (x - c) ./ s

If x is a vector, normalize will operate on the data in x.

If x is a matrix, normalize will operate independently on each column in x.

If x is an N-dimensional array, normalize will operate independently on the first non-singleton dimension in x.

If the optional second argument dim is given, operate along this dimension.

normalize ignores NaN values is x similar to the behavior of the omitnan option in std, mean, and median.

The optional inputs method and option can be used to specify the type of normalization performed on x. Note that only the scale and center options may be specified together using any of the methods defined below. Valid normalization methods are:

zscore

(Default) Normalizes the elements in x to the scaled distance from a central value. Valid Options:

std: (Default) Data is centered at mean (x) and scaled by the standard deviation.
robust: Data is centered at median (x) and scaled by the median absolute deviation.

norm

z is the general vector norm of x, with option being the normalization factor p that determines the vector norm type according to:

  z = [sum (abs (x) .^ p)] ^ (1/p)

p can be any positive scalar, specific values being:

p = 1: x is normalized by sum (abs (x)).
p = 2: (Default) x is normalized by the Euclidian norm, or vector magnitude, of the elements.
P = Inf: x is normalized by max (abs (x)).

scale

x is scaled by a factor determined by option, which can be a numeric scalar or one of the following:

std: (Default) x is scaled by its standard deviation.
mad: x is scaled by its median absolute deviation.
first: x is scaled by its first element.
iqr: x is scaled by its interquartile range.

range

x is scaled to fit the range specified by option as a two element scalar row vector. The default range is [0, 1].

center

x is shifted by an amount determined by option, which can be a numeric scalar or one of the following:

mean: (Default) x is shifted by mean (x).
median: x is shifted by median (x).

medianiqr

x is shifted by median (x) and scaled by the interquartile range.

Known MATLAB incompatibilities:

The option DataVariables is only available when input x is a table class, which is not yet implemented in core Octave. See the datatypes and tablicious Octave Packages for an available overloaded method.

See also: zscore, iqr, norm, rescale, std, median, mean, mad.

n = histc (x, edges) ¶

n = histc (x, edges, dim) ¶

[n, idx] = histc (…) ¶

Compute histogram counts.

When x is a vector, the function counts the number of elements of x that fall in the histogram bins defined by edges. This must be a vector of monotonically increasing values that define the edges of the histogram bins. n(k) contains the number of elements in x for which edges(k) <= x < edges(k+1). The final element of n contains the number of elements of x exactly equal to the last element of edges.

When x is an N-dimensional array, the computation is carried out along dimension dim. If not specified dim defaults to the first non-singleton dimension.

When a second output argument is requested an index matrix is also returned. The idx matrix has the same size as x. Each element of idx contains the index of the histogram bin in which the corresponding element of x was counted.

See also: hist.

unique function documented at unique is often useful for statistics.

c = nchoosek (n, k) ¶

c = nchoosek (set, k) ¶

Compute the binomial coefficient of n or list all possible combinations of a set of items.

If n is a scalar then calculate the binomial coefficient of n and k which is defined as

 /   \
 | n |    n (n-1) (n-2) ... (n-k+1)       n!
 |   |  = ------------------------- =  ---------
 | k |               k!                k! (n-k)!
 \   /

This is the number of combinations of n items taken in groups of size k.

If the first argument is a vector, set, then generate all combinations of the elements of set, taken k at a time, with one row per combination. The result c has k columns and nchoosek (length (set), k) rows.

For example:

How many ways can three items be grouped into pairs?

nchoosek (3, 2)
   ⇒  3

What are the possible pairs?

nchoosek (1:3, 2)
   ⇒   1   2
       1   3
       2   3

Programming Note: When calculating the binomial coefficient nchoosek works only for non-negative, integer arguments. Use bincoeff for non-integer and negative scalar arguments, or for computing many binomial coefficients at once with vector inputs for n or k.

See also: bincoeff, perms.

P = perms (v) ¶

P = perms (v, "unique") ¶

Generate all permutations of vector v with one row per permutation.

Results are returned in reverse lexicographic order if v is in ascending order. If v is in a different permutation, then the result is permuted that way too. Consequently, an input in descending order yields a result in normal lexicographic order. The result has size factorial (n) * n, where n is the length of v. Any repeated elements are included in the output.

If the optional argument "unique" is given then only unique permutations are returned, using less memory and taking less time than calling unique (perms (v), "rows").

Example 1

perms ([1, 2, 3])
⇒ 
3   2   1
3   1   2
2   3   1
2   1   3
1   3   2
1   2   3

Example 2

perms ([1, 1, 2, 2], "unique")
⇒ 
2   2   1   1
2   1   2   1
2   1   1   2
1   2   2   1
1   2   1   2
1   1   2   2

Programming Note: If the "unique" option is not used, the length of v should be no more than 10-12 to limit memory consumption. Even with "unique", there should be no more than 10-12 unique elements in v.

See also: permute, randperm, nchoosek.

y = ranks (x) ¶

y = ranks (x, dim) ¶

y = ranks (x, dim, rtype) ¶

Return the ranks (in the sense of order statistics) of x along the first non-singleton dimension adjusted for ties.

If the optional dim argument is given, operate along this dimension.

The optional parameter rtype determines how ties are handled. All examples below assume an input of [ 1, 2, 2, 4 ].

0 or "fractional" (default) for fractional ranking (1, 2.5,: 2.5, 4);
1 or "competition" for competition ranking (1, 2, 2, 4);
2 or "modified" for modified competition ranking (1, 3, 3, 4);
3 or "ordinal" for ordinal ranking (1, 2, 3, 4);
4 or "dense" for dense ranking (1, 2, 2, 3).

See also: spearman, kendall.

cnt = run_count (x, n) ¶

cnt = run_count (x, n, dim) ¶

Count the upward runs along the first non-singleton dimension of x of length 1, 2, …, n-1 and greater than or equal to n.

If the optional argument dim is given then operate along this dimension.

See also: runlength.

count = runlength (x) ¶

[count, value] = runlength (x) ¶

Find the lengths of all sequences of common values.

count is a vector with the lengths of each repeated value.

The optional output value contains the value that was repeated in the sequence.

runlength ([2, 2, 0, 4, 4, 4, 0, 1, 1, 1, 1])
⇒    2   1   3   1   4

See also: run_count.