17.4.1.2 Algorithms (One-Way ANOVA)


Theory of One-Way ANOVA

Assume we have response data measured in k levels of the factor, where y_{ij}\,\! represents the value of ith observation (i = 1, 2, ...n_j) on the jth factor level (j = 1, 2, ..., k). Then we could write the model of one-way ANOVA as:

y_{ij}=u+t_j+\varepsilon _{ij},j = 1,2, ..., k; i = 1, 2, ...n_j

Since ANOVA testing whether the mean of two or more populations (levels) are equal. Thus, the null hypothesis is that the means of the different populations are the same and the alternate hypothesis is at least one psample's mean is different from the others. Mathematically, this is expressed as:

H0:\mu =\mu _1=\mu _2=\cdots =\mu _k

H1:\mu _p\neq \mu _q for some p and q, 1 \leq  p, q \geq  k.

where \mu _i\,\! is the jth sample mean. To test the hypothesis, it should be divide the total sample variation into variation between groups and variation within groups, and then using the F-test to test whether these two variations are different.

Algebraically, we can use the respective mean square of each part to estimate the variation:

\sum_{j=1}^k\sum_{i=1}^{n_1}(y_{ij}-\bar y)^2=\sum_{j=1}^kn_j(\bar y_j-\bar y)^2+\sum_{j=1}^k\sum_{i=1}^{n_1}(y_{ij}-\bar y_j)^2

where the left term is called the "total sum of squares", the second term is called the "sum of squares of treatments", which represents the variation between groups, and the third term is called "sum of squares of error", which represent the variation within groups. The equation is then commonly abbreviated to

SS_{Total}=SS_{Treatment}+SS_{Error}\,\!

When H_0\,\! is true, the k levels sample data will be normally and independently distributed, with mean \mu\,\! and variance \sigma ^2\,\!. Thus the statistic

F=\frac{MS_{Treatment}}{MS_{Error}}=\frac{ss_{Treatment}/(k-1)}{ss_{Error}/(n-k)}

will follow an F distribution F_{(k-1, n-k)}\,\! where MS_{Treatment} is the mean squares for treatments and MS_{Error} is the mean squares for error, which are both formed by dividing the sum of squares by the associated degrees of freedom respectively. Given a certain significance level\alpha\,\! , if the F statistic exceeds the critical value F_{(k-1,n-k,\alpha)}\,\! which is the tabular value of the F distribution with k-1 and n-k degrees of freedom at level \alpha\,\! , or equivalently, the followed P value less than the significance level, the null hypothesis should be rejected.

Typically, it is common to present the results of the analysis of variance in an ANOVA table:

Source of Variation Degrees of Freedom (DF) Sum of Squares (SS) Mean Square (MS) F Value Prob > F
Model (Factor) k-1 SS_{Treatme} MS_{Treatment} MS_{Treatment} / MS_{Error} P\{F\geq F_{(k-1,n-k,\alpha )}\}
Error n-k SS_{Error} MS_{Error}
Total n-1

Homogeneity of Variance

In the analysis of variance, it is assumed that different samples have equal variances, which is commonly called homogeneity of variance. The Levene test and Brown-Forsythe test can be used to verify the assumption. Suppose we have k samples of response data, where y_{ij}\,\! represents the value of ith observation (i = 1, 2, ...n_j) on the jth factor level (j = 1, 2, ..., k). The hypotheses of both Levene test and Brown-Forsythe test can be expressed as:

H_0: \sigma^2 _1=\sigma^2 _2=\cdots =\sigma^2 _k

H_1: \sigma^2 _p\neq \sigma^2 _q , for at least one pair (p, q), 1\leq p,q\leq k

Define Z_{ij}\,\! as the following three definitions according to different tests,

  1. Absolute Levene test:Z_{ij}=|y_{ij}-\bar y_j|
  2. Squared Levene test:Z_{ij}^2=(y_{ij}-\bar y_j)^2
  3. Brown-Forsythe test:Z_{ij}=|y_{ij}-m_j|\,\!

When H_0 holds, the test statistic

F=\frac{\sum_{j=1}^kn_j(\bar Z_j-\bar Z)^2/(k-1)}{\sum_{j=1}^k\sum_{i=1}^{n_1}(Z_{ij}-\bar Z_j)^2/(n-k)}

will (approximately) follow an F distribution F_{(k-1,n-k)}\,\!where \overline{Z_j}and \overline{Z}are the group mean of and the overall mean of the Z_{ij}\,\! respectively.

Multiple Means Comparisons

Given that an ANOVA experiment has determined that at least one of the population means is significantly different, multiple means comparison subsequently compares all possible pairs of factor level means to determine which mean (or means) is (or are) significantly different. There are various methods for mean comparison in Origin, and we use the NAG function nag_anova_confid_interval (g04dbc) to perform means comparisons.

Two types of multiple means comparison methods are included in Origin:

  1. Single-step method. It creates simultaneous confidence intervals to show how the means differ, including Tukey-Kramer, Bonferroni, Dunn-Sidak, Fisher's LSD, and Scheffe.
  2. Stepwise method. Sequentially perform the hypothesis tests, including Holm-Bonferroni and Holm-Sidak tests.

Power Analysis

The power analysis procedure calculates the actual power for the sample data, as well as the hypothetical power if additional sample sizes are specified.

The power of a one-way analysis of variance is a measurement of its sensitivity. Power is the probability that the one-way ANOVA will detect differences in the sample means when real differences exist. In terms of the null and alternative hypotheses, power is the probability that the test statistic F will be extreme enough to reject the null hypothesis when it should be rejected actually (i.e. given the null hypothesis is not true).

Power is defined by the equation:

power=1-probf(f,dfa,dfe,nc)\,\!

where f is the deviate from the non-central F-distribution with dfa and dfe, model and error degrees of freedom, respectively. And nc = SST/MSE, where SST is the sum of squares of the Model, and MSE is the mean square of the Errors. The value of probf( ) is obtained using the NAG function nag_prob_non_central_f_dist (g01gdc). Please see the NAG documentation for more detailed information.

All the above is a brief algorithm outline of one-way analysis of variation, for more information about the detail mathematical deduction, please reference to the corresponding part of the user's manual and NAG document.