# compute - Statistics examples

## Simple Data Processing

### Statistics Functions in compute

compute supports the following statistical operations:

1. mean - arithmetic mean.
2. median - median value (50th percentile) (details).
3. q1 - First quartile (details)
4. q3 - Third quartile.
5. iqr - Inter-quartile range.
6. mode - The value appearing most often in the data (details).
7. antimode - The value appearing least often in the data.
8. pstdev - Standard-deviation of a set representing the entire population (details).
9. sstdev - Standard-deviation of a set representing a sample of a population.
10. pvar - Variance of a population (details).
11. svar - Variance of a sample.
12. mad - Median Absolute Deviation, scaled by 1.4826 for normal distribution (details).
13. madraw- Median Absolute Deviation, unscaled.
14. pskew - Skewness of a set representing the entire population (details).
15. sskew - Skewness of a set representing a sample of a population.
16. pkurt - Excess Kurtosis of a set representing the entire population (details).
17. skurt - Excess Kurtosis of a set representing a sample of a population.
18. jarque - p-Value of Jarque-Bera test for normality (details).
19. dpo - p-Value of D'Agostino-Pearson Omnibus test for normality (details).

### Equivalent R functions

compute is designed to closely follow R project's statistical functions. See the R equivalent code for each of compute's operators. When building `compute` from source code on your local computer, these operators are checked using the `make check` command.

### Using statistical functions

Example of calculating the (Five-Number Summary)http://en.wikipedia.org/wiki/Five-number_summary of all values in the first column of the input file:

``````\$ compute min 1 q1 1 median 1 q3 1 max 1 < FILE.TXT
78      93     100        107    120
``````

The same command, with header lines for better clarity:

``````\$ compute -H min 1 q1 1 median 1 q3 1 max 1 < FILE.TXT
min(x)  q1(x)  median(x)  q3(x)  max(x)
78      93     100        107    120
``````

Finding out the count,mean and sample standard-deviation:

``````\$ compute -H count 1 mean 1 sstdev 1 < FILE.TXT
count(x)   mean(x)   sstdev(x)
100        100.06    9.5767184
``````

Testing for normality (See next section for discussion about normality testing):

``````\$ compute -H sskew 1 skurt 1 dpo 1 jarque 1 < FILE.TXT
sskew(x)     skurt(x)     dpo(x)    jarque(x)
-0.207246    -0.5770543   0.3341    0.3271
``````

### Testing for normality

A normal distribution is required for many applications of. Testing whether your data fits a normal distribution or not is commonly a first step before starting a more thorough analysis.

Several operations can be used to test for normality, thought care must be taken when inferring results. Always consult a trained statistician before making final analysis.

Skewness

Skewness operations (sskew, pskew) test the data set for asymmetry of the probability distribution of a real-valued random variable about its mean. A normally distributed set should have low skewness.

The rule-of-thumb ranges were suggested in Mulmer, M.G., Principles of Statistics (1979):

``````                     x > 0       -  positively skewed / skewed right
0 > x           -  negatively skewed / skewed left

x > 1       -  highly skewed right
1 > x >  0.5    -  moderately skewed right
0.5 > x > -0.5    -  approximately symmetric
-0.5 > x > -1      -  moderately skewed left
-1 > x           -  highly skewed left
``````

Jarque-Bera Test

Jarque-Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution.

`compute`'s jarque operator returns the p-value calculated for the data set based on the Jarque-Bera test, under the null-hepythesis of normality. A high p-value indicates the null hypothesis cannot be rejected, and therefor the input data might be normally distributed. A low p-value indicates the null hypothesis can be rejected, and the data is likely not normally distributed.

D'Agostino-Peason Omnibus Test

D'Agostino-Pearson Omnibus test detects deviations from normality due to either skewness or kurtosis.

Similarly to jarque operator, the dpo operator returns the p-value calculated for the data set based on the Jarque-Bera test, under the null-hepythesis of normality. A high p-value indicates the null hypothesis cannot be rejected, and therefor the input data might be normally distributed. A low p-value indicates the null hypothesis can be rejected, and the data is likely not normally distributed.

Examples - Testing for normality

The files used in the following examples:

• seq20 - 100 normally-distributed random values.
• seq21 - 100 random values drawn from a non-normal distribution.

Testing normally-distribted values:

``````\$ compute -H sskew 1 jarque 1 dpo 1 < seq20.txt
sskew(x)    jarque(x)  dpo(x)
-0.207246   0.327135   0.334111
``````

The skewness result is close enough to zero to be considered approximately symmetric, while both Jarque-Bera test and D'Agostino-Pearson-Omnibus test return high p-values - indicating the null-hypothesis of normal-distribution cannot be rejected.

NOTE: This does not yet prove the data is truly normally distributed - but this is a positive first step.

Testing non-normally-distributed values:

``````\$ compute -H sskew 1 jarque 1 dpo 1 < seq21.txt
sskew(x)   jarque(x)   dpo(x)
1.212020   8.0113e-09  7.6899e-10
``````

The skewness result is large enough to indicate the values are highly skewed. The Jarque-Bera test and D'Agostino-Pearson-Omnibus test results show very low p-values - indicating the null-hypothesis of normal-distribution can be rejected.

Further information

For an informative tutorial about skewness and kurtosis, see Measures of Shape: Skewness and Kurtosis, by Stan Brown