Statistics Functions in compute
compute supports the following statistical operations:
- mean - arithmetic mean.
- median - median value (50th percentile) (details).
- q1 - First quartile (details)
- q3 - Third quartile.
- iqr - Inter-quartile range.
- mode - The value appearing most often in the data (details).
- antimode - The value appearing least often in the data.
- pstdev - Standard-deviation of a set representing the entire population (details).
- sstdev - Standard-deviation of a set representing a sample of a population.
- pvar - Variance of a population (details).
- svar - Variance of a sample.
- mad - Median Absolute Deviation, scaled by 1.4826 for normal distribution (details).
- madraw- Median Absolute Deviation, unscaled.
- pskew - Skewness of a set representing the entire population (details).
- sskew - Skewness of a set representing a sample of a population.
- pkurt - Excess Kurtosis of a set representing the entire population (details).
- skurt - Excess Kurtosis of a set representing a sample of a population.
- jarque - p-Value of Jarque-Bera test for normality (details).
- dpo - p-Value of D'Agostino-Pearson Omnibus test for normality (details).
Equivalent R functions
compute is designed to closely follow R project's
statistical functions. See the R equivalent code for each
of compute's operators. When building
compute from source code on your local computer,
these operators are checked using the
make check command.
Using statistical functions
Example of calculating the (Five-Number Summary)http://en.wikipedia.org/wiki/Five-number_summary of all values in the first column of the input file:
$ compute min 1 q1 1 median 1 q3 1 max 1 < FILE.TXT 78 93 100 107 120
The same command, with header lines for better clarity:
$ compute -H min 1 q1 1 median 1 q3 1 max 1 < FILE.TXT min(x) q1(x) median(x) q3(x) max(x) 78 93 100 107 120
Finding out the count,mean and sample standard-deviation:
$ compute -H count 1 mean 1 sstdev 1 < FILE.TXT count(x) mean(x) sstdev(x) 100 100.06 9.5767184
Testing for normality (See next section for discussion about normality testing):
$ compute -H sskew 1 skurt 1 dpo 1 jarque 1 < FILE.TXT sskew(x) skurt(x) dpo(x) jarque(x) -0.207246 -0.5770543 0.3341 0.3271
Testing for normality
A normal distribution is required for many applications of. Testing whether your data fits a normal distribution or not is commonly a first step before starting a more thorough analysis.
Several operations can be used to test for normality, thought care must be taken when inferring results. Always consult a trained statistician before making final analysis.
Skewness operations (sskew, pskew) test the data set for asymmetry of the probability distribution of a real-valued random variable about its mean. A normally distributed set should have low skewness.
The rule-of-thumb ranges were suggested in Mulmer, M.G., Principles of Statistics (1979):
x > 0 - positively skewed / skewed right 0 > x - negatively skewed / skewed left x > 1 - highly skewed right 1 > x > 0.5 - moderately skewed right 0.5 > x > -0.5 - approximately symmetric -0.5 > x > -1 - moderately skewed left -1 > x - highly skewed left
Jarque-Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution.
compute's jarque operator returns the p-value calculated for the data set
based on the Jarque-Bera test, under the null-hepythesis of normality.
A high p-value indicates the null hypothesis cannot be rejected,
and therefor the input data might be normally distributed.
A low p-value indicates the null hypothesis can be rejected, and the
data is likely not normally distributed.
D'Agostino-Peason Omnibus Test
D'Agostino-Pearson Omnibus test detects deviations from normality due to either skewness or kurtosis.
Similarly to jarque operator, the dpo operator returns the p-value calculated for the data set based on the Jarque-Bera test, under the null-hepythesis of normality. A high p-value indicates the null hypothesis cannot be rejected, and therefor the input data might be normally distributed. A low p-value indicates the null hypothesis can be rejected, and the data is likely not normally distributed.
Examples - Testing for normality
The files used in the following examples:
- seq20 - 100 normally-distributed random values.
- seq21 - 100 random values drawn from a non-normal distribution.
Testing normally-distribted values:
$ compute -H sskew 1 jarque 1 dpo 1 < seq20.txt sskew(x) jarque(x) dpo(x) -0.207246 0.327135 0.334111
The skewness result is close enough to zero to be considered approximately symmetric, while both Jarque-Bera test and D'Agostino-Pearson-Omnibus test return high p-values - indicating the null-hypothesis of normal-distribution cannot be rejected.
NOTE: This does not yet prove the data is truly normally distributed - but this is a positive first step.
Testing non-normally-distributed values:
$ compute -H sskew 1 jarque 1 dpo 1 < seq21.txt sskew(x) jarque(x) dpo(x) 1.212020 8.0113e-09 7.6899e-10
The skewness result is large enough to indicate the values are highly skewed. The Jarque-Bera test and D'Agostino-Pearson-Omnibus test results show very low p-values - indicating the null-hypothesis of normal-distribution can be rejected.
For an informative tutorial about skewness and kurtosis, see Measures of Shape: Skewness and Kurtosis, by Stan Brown