2.75.1.2 Algorithm for Identify Data Distribution

In this app, MLE (maximum likelihood estimation) method is used to estimate parameters for each distribution except Gaussian Mixture distribution, and the latter uses EM (expectation maximization) algorithm.

In the MLE method, for input data $\mathbf{y}$ , parameters vector $\theta$ :

First calculate the likelihood function $\mathcal{L}_{n}( \theta \,;\mathbf {y} )$ for the distribution, which is the product of probability density functions of input data.
Then maximize the logarithm of the likelihood function $\ell ( \theta \,;\mathbf {y} )$ by setting its partial derivatives equal to zero with respect to parameters.
Use Newton-Raphson method to solve these parameters in step 2.

1 Distributions
2 Goodness of Fit
3 Estimate Parameter's Standard Error
4 Probability Plot
5 Reference

Distributions

In this app, distributions include:

Distribution	Parameters	Probability density function	Cumulative distribution function	Inverse cumulative distribution function
Normal	location: $\mu$ scale: $\sigma$	$f(x)=\frac{1}{ \sqrt{2 \pi} \sigma } e^{ -\frac{( x - \mu )^2}{2 \sigma^2} }, \; \sigma > 0$	$P(x) = \Phi \left( \frac{x - \mu}{\sigma} \right)$	$x = \mu + \sigma \Phi^{-1}(p)$
Lognormal	location: $\mu$ scale: $\sigma$	$f(x)=\frac{1}{ \sqrt{2 \pi} \sigma x } e^{ -\frac{( \text{ln}(x) - \mu )^2}{2 \sigma^2} },$ $x>0 , \; \sigma > 0$	$P(x) = \Phi \left( \frac{\text{ln}(x) - \mu}{\sigma} \right)$	$\text{ln}(x) = \mu + \sigma \Phi^{-1}(p)$
3-Parameter Lognormal	location: $\mu$ scale: $\sigma$ threshold: $\lambda$	$f(x)=\frac{1}{\sqrt{2 \pi} \sigma (x -\lambda)} e^{ -\frac{( \text{ln}(x -\lambda) - \mu )^2}{2 \sigma^2} },$ $x > \lambda , \; \sigma > 0$	$P(x) = \Phi \left( \frac{\text{ln}(x - \lambda) - \mu}{\sigma} \right)$	$\text{ln}(x - \lambda) = \mu + \sigma \Phi^{-1}(p)$
Gamma	shape: $\alpha$ scale: $\beta$	$f(x) = \frac{ x^{\alpha-1} e^{-\frac{x}{\beta}}}{ \beta^{\alpha} \Gamma(\alpha) },$ $x > 0, \; \alpha > 0, \; \beta > 0$	$P(x) = \frac{1}{ \Gamma(\alpha) } \gamma( \alpha, \frac{x}{\beta})$	$x = \beta \gamma^{-1}( \alpha, \Gamma(\alpha) p )$
Exponential	scale: $\theta$	$f(x) = \frac{ 1 }{ \theta } e^{-\frac{x}{\theta}}, \; x > 0, \; \theta > 0$	$P(x) = 1- e^{ -\frac{x}{\theta} }$	$x = \theta ( - \text{ln}(1-p) )$
2-Parameter Exponential	scale: $\theta$ threshold: $\lambda$	$f(x) = \frac{ 1 }{ \theta } e^{-\frac{x - \lambda}{\theta}}, \; x > \lambda, \; \theta > 0$	$P(x) = 1- e^{ -\frac{x - \lambda}{\theta} }$	$x = \lambda + \theta ( - \text{ln}(1-p) )$
Smallest Extreme Value	location: $\alpha$ scale: $\beta$	$f(x) = \frac{1}{\beta } e^{ \frac{x- \alpha}{\beta}} e^{ - e^{ \frac{x- \alpha}{\beta} } },$ $\beta > 0$	$P(x) = 1- e^{-e^{\frac{x-\alpha}{\beta}}}$	$x = \alpha + \beta \text{ln}( - \text{ln}(1-p) )$
Weibull	scale: $a$ shape: $b$	$f(x) = \frac{b}{a} \left(\frac{ x }{ a }\right)^{b - 1} e^{ -(\frac{ x }{ a })^b },$ $x > 0, \; a > 0, \; b > 0$	$P(x) = 1 - e^{ -(\frac{ x }{ a })^b }$	$\text{ln}(x) = \text{ln}(a) + \frac{1}{b} \text{ln}( - \text{ln}(1-p) )$
3-Parameter Weibull	scale: $a$ shape: $b$ threshold: $\lambda$	$f(x) = \frac{b}{a} \left(\frac{ x - \lambda }{ a }\right)^{b - 1} e^{ -\left(\frac{ x - \lambda }{ a }\right)^b },$ $x > \lambda, \; a > 0, \; b > 0$	$P(x) = 1 - e^{ -\left(\frac{ x - \lambda }{ a }\right)^b }$	$\text{ln}(x - \lambda) = \text{ln}(a) + \frac{1}{b} \text{ln}( - \text{ln}(1-p) )$
Largest Extreme Value	location: $\alpha$ scale: $\beta$	$f(x) = \frac{1}{\beta } e^{- \frac{x- \alpha}{\beta}} e^{ - e^{ -\frac{x- \alpha}{\beta} } },$ $\beta > 0$	$P(x) = e^{-e^{-\frac{x-\alpha}{\beta}}}$	$x = \alpha - \beta \text{ln}( - \text{ln}(p) )$
Logistic	location: $\mu$ scale: $\sigma$	$f(x) = \frac{e^{\frac{x- \mu}{\sigma}}}{\sigma(1+e^{\frac{x- \mu}{\sigma}})^2 },\sigma > 0$	$P(x) = \frac{1}{1+e^{-\frac{x- \mu}{\sigma}}}$	$x =\mu+ \sigma\text{ln}(\frac{p}{1-p})$
Loglogistic	location: $\mu$ scale: $\sigma$	$\begin{array}{l}f(x) = \frac{e^{\frac{\text{ln}(x)- \mu}{\sigma}}}{x\sigma(1+e^{\frac{\text{ln}(x)- \mu}{\sigma}})^2 } , \\ x>0, \sigma > 0\end{array}$	$P(x) = \frac{1}{1+e^{-\frac{\text{ln}(x)- \mu}{\sigma}}}$	$x =e^{\mu+ \sigma\text{ln}(\frac{p}{1-p})}$
Folded Normal	location: $\mu$ scale: $\sigma$	$\begin{array}{l}f(x) = \frac{1}{ \sqrt{ 2 \pi } \sigma} e^{ - \frac{(x - \mu)^2}{ 2 \sigma^2 } } \\ + \frac{1}{ \sqrt{ 2 \pi } \sigma} e^{ - \frac{(x + \mu)^2}{ 2 \sigma^2 } }, x \geqslant 0, \; \sigma > 0\end{array}$	$\begin{matrix}P(x) = \Phi \left( \frac{x + \mu}{\sigma} \right) \\ - \Phi \left( \frac{\mu - x}{\sigma} \right)\end{matrix}$	$x = \text{foldnorminv}(p, \, \mu, \, \sigma)$
Rayleigh	scale: $\sigma$	$f(x) = \frac{x}{\sigma^2}e^{-\frac{x^2}{2\sigma^2}},x\geqslant0, \sigma > 0$	$P(x) = 1-e^{-\frac{x^2}{2\sigma^2}}$	$x=\sqrt{2}\sigma \sqrt{-\text{ln}(1-p)}$
Gaussian Mixture	Number of components: $g$ , For ith component, weight: $w_i$ location: $\mu_i$ scale: $\sigma_i$	$\begin{array}{l}f(x) = \sum_{i=1}^g \frac{w_i}{ \sqrt{2 \pi} \sigma_i } e^{ -\frac{( x - \mu_i )^2}{2 \sigma_i^2} }, \\ 0 < w_i < 1, \; \sigma > 0\end{array}$	$P(x) = \sum_{i=1}^g w_i \Phi \left( \frac{x - \mu_i}{\sigma_i} \right)$	No explicit solution. It can be calculated by interpolation.

where Φ is the CDF function for the standard normal distribution, and γ is the lower incomplete gamma function.

Note:

Box-Cox Transformation, Johnson Transformation and Yeo-Johnson Transformation distributions transform data first, and then use Normal distribution to fit. For more, see Algorithm for Data Transformation.
The scale parameter in Normal and Lognormal distributions is corrected to the unbiased standard deviation, i.e. divided by n-1 instead of n, where n is the number of points.
The threshold parameter λ in 2-Parameter Exponential distribution is estimated as below:

$\lambda = X_{(1)} - \frac{\bar{X} - X_{(1)}}{n-1}$ ,

where X₍₁₎ is the minimum of X, and X is the mean of X.

Goodness of Fit

Anderson-Darling Test

The hypotheses for Anderson-Darling test are:

H₀: Input data follows the distribution.

H₁: Input data doesn't follow the distribution.

Anderson-Darling Statistic

Input data is sorted first, and use CDF function to calculate the probability Z_i for each point. Anderson-Darling statistic A² is calculated as below:

$A^2 = -n -(1/n) \sum_{i=1}^n \left[ (2i-1) \text{ln} (Z_i) + ( 2n+1-i ) \text{ln} (1 - Z_i) \right]$

A² value can be used to assess the fit. The smaller the value is, the better the model will be.

P-value for Anderson-Darling Test

P-value is calculated by interpolating the statistic A² on tables in the reference [1] and [2]. If P-value is less than the critical value, e.g. 0.05, H₀ will be rejected.

For some distributions, no reference is available, their P-values are missing. e.g. 3-Parameter Lognormal distribution.

For Folded Normal, Gaussian Mixture and 3-Parameter Weibull with fixed shape distributions, P-value is calculated by the Monte Carlo method. 1000 samples are drawn to produce the statistic for the null hypothesized distribution with estimated parameters. The P-value is the proportion of samples whose A² statistic values are greater than or equal to input data's [4]. i.e.

$\text{P-value} = \frac{b + 1}{N + 1}$ ,

where b is the number of statistic values that are greater than or equal to input data's, and N is the number of samples, here N=1000.

BIC

In Gaussian Mixture distribution, BIC value is calculated as follows:

$\text{BIC} = -2 \, \ell + k \text{ln}(n)$ ,

where ℓ is the logarithm of the likelihood in the estimation, k is the number of parameters, and n is the number of points.

The smaller the BIC value is, the better the model will be.

Likelihood-ratio Test(LRT)

Likelihood-ratio test(LRT) is used to compare two models: a simple model A and a complex model B, and model A is the subset of model B, e.g. model A is Weibull, and model B is 3-Parameter Weibull. It can determine whether the complex model B is significantly better than the simple model A. In this app, likelihood-ratio test supports to compare Lognormal vs 3-Parameter Lognormal, Exponential vs 2-Parameter Exponential, and Weibull vs 3-Parameter Weibull.

The hypotheses for the likelihood-ratio test are:

H₀: The complex model B is the same as the simple model A.

H₁: The complex model B is significantly better than the simple model A.

If P-value for the likelihood-ratio test is less than the critical value, e.g. 0.05, reject H₀, and it indicates that the complex model B is significantly better than the simple model A

Likelihood-ratio test statistic

$\text{LRT} = -2 \, ( \ell(A) - \ell(B) )$ ,

where ℓ(A) and ℓ(B) are log likelihoods for the model A and the model B respectively.

Likelihood-ratio test P-value

Likelihood-ratio test statistic follows Chi-squared distribution with the degrees of freedom df, and df is the difference of number of parameters in the model B and the model A. Thus likelihood-ratio test P-value can be calculated as follows:

$\text{P-val} = \text{chi2cdf}( LRT, 1, 1)$ ,

where chi2cdf is the upper tail (The third parameter 1 means the upper tail) CDF function for the 𝜒² distribution.

Estimate Parameter's Standard Error

Once parameters are estimated by MLE method, the Fisher information matrix (FIM) can be calculated by Hessian matrix:

$[\mathcal{I}(\theta)]_{i,j} = - E\left[ \frac{\partial^2}{\partial \theta_i \, \partial \theta_j} \text{ln}\left( f(Y; \theta) \right) \, \Big| \, \theta \right]$

Covariance matrix of parameters can be expressed as:

$C = (n \mathcal{I} )^{-1}$

where n is the number of points.

Parameter's standard error is the square root of the diagonal elements in the covariance matrix C.

Probability Plot

Points in the probability plot are sorted in ascending order, for the ith point, its probability is calculated by the median rank (Benard's method):

$P = \frac{i - 0.3}{n + 0.4}$

The middle line is the expected percentiles for given probabilities in terms of the inverse cumulative distribution function using parameters from the MLE method.

To estimate confidence limits of percentiles, variance of percentiles is calculated by propagation of error:

$\sigma^2_{x_p} = \left( \frac{\partial x_p}{\partial \theta} \right)^T C \left( \frac{\partial x_p}{\partial \theta} \right)$

where x_p is the percentile for a given probability p, and (.)^T denotes the transpose of a vector.

Confidence limits of percentiles can be calculated as below:

$x_{pL} = x_p - Z_{\alpha} \sigma_{x_p}$

$x_{pU} = x_p + Z_{\alpha} \sigma_{x_p}$

where $Z_{\alpha} = \Phi^{-1}(1- \alpha / 2)$ .

For some positive random variables, e.g. in Lognormal, Gamma, Exponential, Weibull, Loglogistic, Folded Normal and Rayleigh distributions, confidence limits are:

$x_{pL} = e^{ \text{ln}(x_p) - Z_{\alpha} \frac{\sigma_{x_p}}{x_p} }$

$x_{pU} = e^{ \text{ln}(x_p) + Z_{\alpha} \frac{\sigma_{x_p}}{x_p} }$

For some random variables with a threshold parameter λ, e.g. in 3-Parameter Lognormal, 2-Parameter Exponential and 3-Parameter Weibull distributions, confidence limits are:

If $\lambda < 0$ ,

$x_{pL} = x_p - Z_{\alpha} \sigma_{x_p}$

$x_{pU} = x_p + Z_{\alpha} \sigma_{x_p}$

If $\lambda \ge 0$ ,

$x_{pL} = e^{ \text{ln}(x_p) - Z_{\alpha} \frac{\sigma_{x_p}}{x_p} }$

$x_{pU} = e^{ \text{ln}(x_p) + Z_{\alpha} \frac{\sigma_{x_p}}{x_p} }$

If $x_{pL} < \lambda$ , x_pL value will be corrected to λ .

Q-Q plot is shown for Gaussian Mixture distribution. Its expected percentiles are calculated by interpolation.

Reference

R.A. Lockhart and M.A. Stephens (1994). "Estimation and Tests of Fit for the Three-parameter Weibull Distribution". Journal of the Royal Statistical Society, Vol. 56, No. 3, pp. 491-500.
Ralph B. D'Agostino, Michael A. Stephens (Eds.) (1986). Goodness-of-Fit Techniques. New York: Marcel Dekker
Hartigan J A (1975). Clustering Algorithms. Wiley
W. Stute, W. G. Manteiga, and M. P. Quindimil (1993). “Bootstrap based goodness-of-fit-tests”. Metrika, 40.1: pp. 243-256.