Title: | Package for Environmental Statistics, Including US EPA Guidance |
---|---|
Description: | Graphical and statistical analyses of environmental data, with focus on analyzing chemical concentrations and physical parameters, usually in the context of mandated environmental monitoring. Major environmental statistical methods found in the literature and regulatory guidance documents, with extensive help that explains what these methods do, how to use them, and where to find them in the literature. Numerous built-in data sets from regulatory guidance documents and environmental statistics literature. Includes scripts reproducing analyses presented in the book "EnvStats: An R Package for Environmental Statistics" (Millard, 2013, Springer, ISBN 978-1-4614-8455-4, <doi:10.1007/978-1-4614-8456-1>). |
Authors: | Steven P. Millard [aut], Alexander Kowarik [ctb, cre] |
Maintainer: | Alexander Kowarik <[email protected]> |
License: | GPL (>= 3) |
Version: | 3.0.0 |
Built: | 2024-10-28 05:36:47 UTC |
Source: | https://github.com/alexkowa/envstats |
Trichloroethylene (TCE) concentrations (mg/L) at 10 groundwater monitoring wells before and after remediation.
data(ACE.13.TCE.df)
data(ACE.13.TCE.df)
A data frame with 20 observations on the following 3 variables.
TCE.mg.per.L
TCE concentrations
Well
a factor indicating the well number
Period
a factor indicating the period (before vs. after remediation)
USACE. (2013). Environmental Quality - Environmental Statistics. Engineer Manual EM 200-1-16, 31 May 2013. Department of the Army, U.S. Army Corps of Engineers, Washington, D.C. 20314-1000, p. M-10. https://www.publications.usace.army.mil/Portals/76/Publications/EngineerManuals/EM_200-1-16.pdf.
Compute a lack-of-fit and pure error anova table for either a linear model with one predictor variable or else a linear model for which all predictor variables in the model are functions of a single variable (for example, x, x^2, etc.). There must be replicate observations for at least one value of the predictor variable(s).
anovaPE(object)
anovaPE(object)
object |
an object of |
Finally, the total number of observations must be such that the degrees of freedom associated with the residual sums of squares is greater than the number of observations minus the number of unique observations.
Produces an anova table with the the sums of squares partitioned by “Lack of Fit”
and “Pure Error”. See Draper and Smith (1998, pp.47-53) for details.
This function is called by the function calibrate
.
An object of class
"anova"
inheriting from class
"data.frame"
.
Steven P. Millard ([email protected])
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.47-53.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
# The data frame EPA.97.cadmium.111.df contains calibration data for # cadmium at mass 111 (ng/L) that appeared in Gibbons et al. (1997b) # and were provided to them by the U.S. EPA. # # First, display a plot of these data along with the fitted calibration line # and 99% non-simultaneous prediction limits. See # Millard and Neerchal (2001, pp.566-569) for more details on this # example. EPA.97.cadmium.111.df # Cadmium Spike #1 0.88 0 #2 1.57 0 #3 0.70 0 #... #33 99.20 100 #34 93.71 100 #35 100.43 100 Cadmium <- EPA.97.cadmium.111.df$Cadmium Spike <- EPA.97.cadmium.111.df$Spike calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) newdata <- data.frame(Spike = seq(min(Spike), max(Spike), length.out = 100)) pred.list <- predict(calibrate.list, newdata = newdata, se.fit = TRUE) pointwise.list <- pointwise(pred.list, coverage = 0.99, individual = TRUE) plot(Spike, Cadmium, ylim = c(min(pointwise.list$lower), max(pointwise.list$upper)), xlab = "True Concentration (ng/L)", ylab = "Observed Concentration (ng/L)") abline(calibrate.list, lwd = 2) lines(newdata$Spike, pointwise.list$lower, lty = 8, lwd = 2) lines(newdata$Spike, pointwise.list$upper, lty = 8, lwd = 2) title(paste("Calibration Line and 99% Prediction Limits", "for US EPA Cadmium 111 Data", sep="\n")) rm(Cadmium, Spike, newdata, calibrate.list, pred.list, pointwise.list) #---------- # Now fit the linear model and produce the anova table to check for # lack of fit. There is no evidence for lack of fit (p = 0.41). fit <- lm(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) anova(fit) #Analysis of Variance Table # #Response: Cadmium # Df Sum Sq Mean Sq F value Pr(>F) #Spike 1 43220 43220 9356.9 < 2.2e-16 *** #Residuals 33 152 5 #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #Analysis of Variance Table # #Response: Cadmium # #Terms added sequentially (first to last) # Df Sum of Sq Mean Sq F Value Pr(F) # Spike 1 43220.27 43220.27 9356.879 0 #Residuals 33 152.43 4.62 anovaPE(fit) # Df Sum Sq Mean Sq F value Pr(>F) #Spike 1 43220 43220 9341.559 <2e-16 *** #Lack of Fit 3 14 5 0.982 0.4144 #Pure Error 30 139 5 #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 rm(fit)
# The data frame EPA.97.cadmium.111.df contains calibration data for # cadmium at mass 111 (ng/L) that appeared in Gibbons et al. (1997b) # and were provided to them by the U.S. EPA. # # First, display a plot of these data along with the fitted calibration line # and 99% non-simultaneous prediction limits. See # Millard and Neerchal (2001, pp.566-569) for more details on this # example. EPA.97.cadmium.111.df # Cadmium Spike #1 0.88 0 #2 1.57 0 #3 0.70 0 #... #33 99.20 100 #34 93.71 100 #35 100.43 100 Cadmium <- EPA.97.cadmium.111.df$Cadmium Spike <- EPA.97.cadmium.111.df$Spike calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) newdata <- data.frame(Spike = seq(min(Spike), max(Spike), length.out = 100)) pred.list <- predict(calibrate.list, newdata = newdata, se.fit = TRUE) pointwise.list <- pointwise(pred.list, coverage = 0.99, individual = TRUE) plot(Spike, Cadmium, ylim = c(min(pointwise.list$lower), max(pointwise.list$upper)), xlab = "True Concentration (ng/L)", ylab = "Observed Concentration (ng/L)") abline(calibrate.list, lwd = 2) lines(newdata$Spike, pointwise.list$lower, lty = 8, lwd = 2) lines(newdata$Spike, pointwise.list$upper, lty = 8, lwd = 2) title(paste("Calibration Line and 99% Prediction Limits", "for US EPA Cadmium 111 Data", sep="\n")) rm(Cadmium, Spike, newdata, calibrate.list, pred.list, pointwise.list) #---------- # Now fit the linear model and produce the anova table to check for # lack of fit. There is no evidence for lack of fit (p = 0.41). fit <- lm(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) anova(fit) #Analysis of Variance Table # #Response: Cadmium # Df Sum Sq Mean Sq F value Pr(>F) #Spike 1 43220 43220 9356.9 < 2.2e-16 *** #Residuals 33 152 5 #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #Analysis of Variance Table # #Response: Cadmium # #Terms added sequentially (first to last) # Df Sum of Sq Mean Sq F Value Pr(F) # Spike 1 43220.27 43220.27 9356.879 0 #Residuals 33 152.43 4.62 anovaPE(fit) # Df Sum Sq Mean Sq F value Pr(>F) #Spike 1 43220 43220 9341.559 <2e-16 *** #Lack of Fit 3 14 5 0.982 0.4144 #Pure Error 30 139 5 #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 rm(fit)
Compute the sample sizes necessary to achieve a specified power for a one-way fixed-effects analysis of variance test, given the population means, population standard deviation, and significance level.
aovN(mu.vec, sigma = 1, alpha = 0.05, power = 0.95, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
aovN(mu.vec, sigma = 1, alpha = 0.05, power = 0.95, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
mu.vec |
required numeric vector of population means. The length of
|
sigma |
optional numeric scalar specifying the population standard
deviation ( |
alpha |
optional numeric scalar between 0 and 1 indicating the Type I
error level associated with the hypothesis test. The default
value is |
power |
optional numeric scalar between 0 and 1 indicating the power
associated with the hypothesis test. The default value
is |
round.up |
optional logical scalar indicating whether to round up the value of the
computed sample size to the next smallest integer. The default
value is |
n.max |
positive integer greater then 1 indicating the maximum sample size per group.
The default value is |
tol |
optional numeric scalar indicating the tolerance to use in the
|
maxiter |
optional positive integer indicating the maximum number of iterations to use in the
|
The F-statistic to test the equality of population means
assuming each population has a normal distribution with the same
standard deviation
is presented in most basic
statistics texts, including Zar (2010, Chapter 10),
Berthouex and Brown (2002, Chapter 24), and Helsel and Hirsh (1992, pp.164-169).
The formula for the power of this test is given in Scheffe
(1959, pp.38-39,62-65). The power of the one-way fixed-effects ANOVA depends
on the sample sizes for each of the
groups, the value of the
population means for each of the
groups, the population
standard deviation
, and the significance level
. See the help file for
aovPower
.
The function aovN
assumes equal sample
sizes for each of the groups and uses a search
algorithm to determine the sample size
required to
attain a specified power, given the values of the population
means and the significance level.
numeric scalar indicating the required sample size for each
group. (The number of groups is equal to the length of the
argument mu.vec
.)
The normal and lognormal distribution are probably the two most frequently used distributions to model environmental data. Sometimes it is necessary to compare several means to determine whether any are significantly different from each other (e.g., USEPA, 2009, p.6-38). In this case, assuming normally distributed data, you perform a one-way parametric analysis of variance.
In the course of designing a sampling program, an environmental
scientist may wish to determine the relationship between sample
size, Type I error level, power, and differences in means if
one of the objectives of the sampling program is to determine
whether a particular mean differs from a group of means. The
functions aovPower
, aovN
, and
plotAovDesign
can be used to investigate these
relationships for the case of normally-distributed observations.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapters 27, 29, 30.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Scheffe, H. (1959). The Analysis of Variance. John Wiley and Sons, New York, 477pp.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.6-38.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 10.
aovPower
, plotAovDesign
,
Normal
, aov
.
# Look at how the required sample size for a one-way ANOVA # increases with increasing power: aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.8) #[1] 21 aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.9) #[1] 27 aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.95) #[1] 33 #---------------------------------------------------------------- # Look at how the required sample size for a one-way ANOVA, # given a fixed power, decreases with increasing variability # in the population means: aovN(mu.vec = c(10, 10, 11), sigma=5) #[1] 581 aovN(mu.vec = c(10, 10, 15), sigma = 5) #[1] 25 aovN(mu.vec = c(10, 13, 15), sigma = 5) #[1] 33 aovN(mu.vec = c(10, 15, 20), sigma = 5) #[1] 10 #---------------------------------------------------------------- # Look at how the required sample size for a one-way ANOVA, # given a fixed power, decreases with increasing values of # Type I error: aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.001) #[1] 89 aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.01) #[1] 67 aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.05) #[1] 50 aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.1) #[1] 42
# Look at how the required sample size for a one-way ANOVA # increases with increasing power: aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.8) #[1] 21 aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.9) #[1] 27 aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.95) #[1] 33 #---------------------------------------------------------------- # Look at how the required sample size for a one-way ANOVA, # given a fixed power, decreases with increasing variability # in the population means: aovN(mu.vec = c(10, 10, 11), sigma=5) #[1] 581 aovN(mu.vec = c(10, 10, 15), sigma = 5) #[1] 25 aovN(mu.vec = c(10, 13, 15), sigma = 5) #[1] 33 aovN(mu.vec = c(10, 15, 20), sigma = 5) #[1] 10 #---------------------------------------------------------------- # Look at how the required sample size for a one-way ANOVA, # given a fixed power, decreases with increasing values of # Type I error: aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.001) #[1] 89 aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.01) #[1] 67 aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.05) #[1] 50 aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.1) #[1] 42
Compute the power of a one-way fixed-effects analysis of variance, given the sample sizes, population means, population standard deviation, and significance level.
aovPower(n.vec, mu.vec = rep(0, length(n.vec)), sigma = 1, alpha = 0.05)
aovPower(n.vec, mu.vec = rep(0, length(n.vec)), sigma = 1, alpha = 0.05)
n.vec |
numeric vector of sample sizes for each group. The |
mu.vec |
numeric vector of population means. The length of |
sigma |
numeric scalar specifying the population standard deviation ( |
alpha |
numeric scalar between 0 and 1 indicating the Type I error level associated
with the hypothesis test. The default value is |
Consider normally distributed populations with common standard deviation
. Let
denote the mean of the
'th group
(
), and let
denote a vector of
observations from the
'th group.
The statistical method of analysis of variance (ANOVA) tests the null hypothesis:
against the alternative hypothesis that at least one of the means is different from the rest by using the F-statistic given by:
where
Under the null hypothesis (1), the F-statistic in (2) follows an
F-distribution with and
degrees of freedom.
Analysis of variance rejects the null hypothesis (1) at significance level
when
where denotes the
'th quantile of the
F-distribution with
and
degrees of freedom
(Zar, 2010, Chapter 10; Berthouex and Brown, 2002, Chapter 24;
Helsel and Hirsh, 1992, pp. 164–169).
The power of this test, denoted by , where
denotes the
probability of a Type II error, is given by:
where
and denotes a
non-central F random variable with
and
degrees of freedom and non-centrality parameter
.
Equation (7) can be re-written as:
where denotes the cumulative distribution function
of this random variable evaluated at
(Scheffe, 1959, pp.38–39, 62–65).
The power of the one-way fixed-effects ANOVA depends on the
sample sizes for each of the groups, the value of the
population means for each of the
groups, the population
standard deviation
, and the significance level
.
a numeric scalar indicating the power of the one-way fixed-effects ANOVA for the given sample sizes, population means, population standard deviation, and significance level.
The normal and lognormal distribution are probably the two most frequently used distributions to model environmental data. Sometimes it is necessary to compare several means to determine whether any are significantly different from each other (e.g., USEPA, 2009, p.6-38). In this case, assuming normally distributed data, you perform a one-way parametric analysis of variance.
In the course of designing a sampling program, an environmental
scientist may wish to determine the relationship between sample
size, Type I error level, power, and differences in means if
one of the objectives of the sampling program is to determine
whether a particular mean differs from a group of means. The
functions aovPower
, aovN
, and
plotAovDesign
can be used to investigate these
relationships for the case of normally-distributed observations.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapters 27, 29, 30.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Scheffe, H. (1959). The Analysis of Variance. John Wiley and Sons, New York, 477pp.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.6-38.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 10.
aovN
, plotAovDesign
,
Normal
, aov
.
# Look at how the power of a one-way ANOVA increases # with increasing sample size: aovPower(n.vec = rep(5, 3), mu.vec = c(10, 15, 20), sigma = 5) #[1] 0.7015083 aovPower(n.vec = rep(10, 3), mu.vec = c(10, 15, 20), sigma = 5) #[1] 0.9732551 #---------------------------------------------------------------- # Look at how the power of a one-way ANOVA increases # with increasing variability in the population means: aovPower(n.vec = rep(5,3), mu.vec = c(10, 10, 11), sigma=5) #[1] 0.05795739 aovPower(n.vec = rep(5, 3), mu.vec = c(10, 10, 15), sigma = 5) #[1] 0.2831863 aovPower(n.vec = rep(5, 3), mu.vec = c(10, 13, 15), sigma = 5) #[1] 0.2236093 aovPower(n.vec = rep(5, 3), mu.vec = c(10, 15, 20), sigma = 5) #[1] 0.7015083 #---------------------------------------------------------------- # Look at how the power of a one-way ANOVA increases # with increasing values of Type I error: aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.001) #[1] 0.02655785 aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.01) #[1] 0.1223527 aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.05) #[1] 0.3085313 aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.1) #[1] 0.4373292 #========== # The example on pages 5-11 to 5-14 of USEPA (1989b) shows # log-transformed concentrations of lead (mg/L) at two # background wells and four compliance wells, where observations # were taken once per month over four months (the data are # stored in EPA.89b.loglead.df.) Assume the true mean levels # at each well are 3.9, 3.9, 4.5, 4.5, 4.5, and 5, respectively. # Compute the power of a one-way ANOVA to test for mean # differences between wells. Use alpha=0.05, and assume the # true standard deviation is equal to the one estimated from # the data in this example. # First look at the data names(EPA.89b.loglead.df) #[1] "LogLead" "Month" "Well" "Well.type" dev.new() stripChart(LogLead ~ Well, data = EPA.89b.loglead.df, show.ci = FALSE, xlab = "Well Number", ylab="Log [ Lead (ug/L) ]", main="Lead Concentrations at Six Wells") # Note: The assumption of a constant variance across # all wells is suspect. # Now perform the ANOVA and get the estimated sd aov.list <- aov(LogLead ~ Well, data=EPA.89b.loglead.df) summary(aov.list) # Df Sum Sq Mean Sq F value Pr(>F) #Well 5 5.7447 1.14895 3.3469 0.02599 * #Residuals 18 6.1791 0.34328 #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 # Now call the function aovPower aovPower(n.vec = rep(4, 6), mu.vec = c(3.9,3.9,4.5,4.5,4.5,5), sigma=sqrt(0.34)) #[1] 0.5523148 # Clean up rm(aov.list)
# Look at how the power of a one-way ANOVA increases # with increasing sample size: aovPower(n.vec = rep(5, 3), mu.vec = c(10, 15, 20), sigma = 5) #[1] 0.7015083 aovPower(n.vec = rep(10, 3), mu.vec = c(10, 15, 20), sigma = 5) #[1] 0.9732551 #---------------------------------------------------------------- # Look at how the power of a one-way ANOVA increases # with increasing variability in the population means: aovPower(n.vec = rep(5,3), mu.vec = c(10, 10, 11), sigma=5) #[1] 0.05795739 aovPower(n.vec = rep(5, 3), mu.vec = c(10, 10, 15), sigma = 5) #[1] 0.2831863 aovPower(n.vec = rep(5, 3), mu.vec = c(10, 13, 15), sigma = 5) #[1] 0.2236093 aovPower(n.vec = rep(5, 3), mu.vec = c(10, 15, 20), sigma = 5) #[1] 0.7015083 #---------------------------------------------------------------- # Look at how the power of a one-way ANOVA increases # with increasing values of Type I error: aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.001) #[1] 0.02655785 aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.01) #[1] 0.1223527 aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.05) #[1] 0.3085313 aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.1) #[1] 0.4373292 #========== # The example on pages 5-11 to 5-14 of USEPA (1989b) shows # log-transformed concentrations of lead (mg/L) at two # background wells and four compliance wells, where observations # were taken once per month over four months (the data are # stored in EPA.89b.loglead.df.) Assume the true mean levels # at each well are 3.9, 3.9, 4.5, 4.5, 4.5, and 5, respectively. # Compute the power of a one-way ANOVA to test for mean # differences between wells. Use alpha=0.05, and assume the # true standard deviation is equal to the one estimated from # the data in this example. # First look at the data names(EPA.89b.loglead.df) #[1] "LogLead" "Month" "Well" "Well.type" dev.new() stripChart(LogLead ~ Well, data = EPA.89b.loglead.df, show.ci = FALSE, xlab = "Well Number", ylab="Log [ Lead (ug/L) ]", main="Lead Concentrations at Six Wells") # Note: The assumption of a constant variance across # all wells is suspect. # Now perform the ANOVA and get the estimated sd aov.list <- aov(LogLead ~ Well, data=EPA.89b.loglead.df) summary(aov.list) # Df Sum Sq Mean Sq F value Pr(>F) #Well 5 5.7447 1.14895 3.3469 0.02599 * #Residuals 18 6.1791 0.34328 #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 # Now call the function aovPower aovPower(n.vec = rep(4, 6), mu.vec = c(3.9,3.9,4.5,4.5,4.5,5), sigma=sqrt(0.34)) #[1] 0.5523148 # Clean up rm(aov.list)
Representation of a Number
For any number represented in base 10, compute the representation in any user-specified base.
base(n, base = 10, num.digits = max(0, floor(log(n, base))) + 1)
base(n, base = 10, num.digits = max(0, floor(log(n, base))) + 1)
n |
a non-negative integer (base 10). |
base |
a positive integer greater than 1 indicating what base to represent |
num.digits |
a positive integer indicating how many digits to use to represent |
If is a positive integer greater than 1, and
is a positive integer,
then
can be expressed uniquely in the form
where is a non-negative integer, the coefficients
are non-negative integers less than
, and
(Rosen, 1988, p.105). The function
base
computes the coefficients
.
A numeric vector of length num.digits
showing the representation of n
in base base
.
The function base
is included in EnvStats because it
is called by the function oneSamplePermutationTest
.
Steven P. Millard ([email protected])
Rosen, K.H. (1988). Discrete Mathematics and Its Applications. Random House, New York, pp.105-107.
# Compute the value of 7 in base 2. base(7, 2) #[1] 1 1 1 base(7, 2, num.digits=5) #[1] 0 0 1 1 1
# Compute the value of 7 in base 2. base(7, 2) #[1] 1 1 1 base(7, 2, num.digits=5) #[1] 0 0 1 1 1
Lead (Pb) concentrations (mg/kg) in 29 discrete environmental soil samples from a site suspected to be contaminated with lead.
Beal.2010.Pb.df data(Beal.2010.Pb.df)
Beal.2010.Pb.df data(Beal.2010.Pb.df)
A data frame with 29 observations on the following 3 variables.
Pb.char
Character vector indicating lead concentrations. Nondetects indicated with the less-than sign (e.g., <1)
Pb
numeric vector indicating lead concentration.
Censored
logical vector indicating censoring status.
Beal, D. (2010). A Macro for Calculating Summary Statistics on Left Censored Environmental Data Using the Kaplan-Meier Method. Paper SDA-09, presented at Southeast SAS Users Group 2010, September 26-28, Savannah, GA. https://analytics.ncsu.edu/sesug/2010/SDA09.Beal.pdf.
Benthic data from a monitoring program in the Chesapeake Bay, Maryland, covering July 1994 - December 1991.
Benthic.df
Benthic.df
A data frame with 585 observations on the following 7 variables.
Site.ID
Site ID
Stratum
Stratum Number (101-131)
Latitude
Latitude (degrees North)
Longitude
Longitude (negative values; degrees West)
Index
Benthic Index (between 1 and 5)
Salinity
Salinity (ppt)
Silt
Silt Content (% clay in soil)
Data from the Long Term Benthic Monitoring Program of the Chesapeake Bay. The data consist of measurements of benthic characteristics and a computed index of benthic health for several locations in the bay. Sampling methods and designs of the program are discussed in Ranasinghe et al. (1992).
The data represent observations collected at 585 separate point locations (sites). The sites are divided into 31 different strata, numbered 101 through 131, each strata consisting of geographically close sites of similar degradation conditions. The benthic index values range from 1 to 5 on a continuous scale, where high values correspond to healthier benthos. Salinity was measured in parts per thousand (ppt), and silt content is expressed as a percentage of clay in the soil with high numbers corresponding to muddy areas.
The United States Environmental Protection Agency (USEPA) established an initiative for the Chesapeake Bay in partnership with the states bordering the bay in 1984. The goal of the initiative is the restoration (abundance, health, and diversity) of living resources to the bay by reducing nutrient loadings, reducing toxic chemical impacts, and enhancing habitats. USEPA's Chesapeake Bay Program Office is responsible for implementing this initiative and has established an extensive monitoring program that includes traditional water chemistry sampling, as well as collecting data on living resources to measure progress towards meeting the restoration goals.
Sampling benthic invertebrate assemblages has been an integral part of the Chesapeake Bay monitoring program due to their ecological importance and their value as biological indicators. The condition of benthic assemblages is a measure of the ecological health of the bay, including the effects of multiple types of environmental stresses. Nevertheless, regional-scale assessment of ecological status and trends using benthic assemblages are limited by the fact that benthic assemblages are strongly influenced by naturally variable habitat elements, such as salinity, sediment type, and depth. Also, different state agencies and USEPA programs use different sampling methodologies, limiting the ability to integrate data into a unified assessment. To circumvent these limitations, USEPA has standardized benthic data from several different monitoring programs into a single database, and from that database developed a Restoration Goals Benthic Index that identifies whether benthic restoration goals are being met.
Ranasinghe, J.A., L.C. Scott, and R. Newport. (1992). Long-term Benthic Monitoring and Assessment Program for the Maryland Portion of the Bay, Jul 1984-Dec 1991. Report prepared for the Maryland Department of the Environment and the Maryland Department of Natural Resources by Versar, Inc., Columbia, MD.
attach(Benthic.df) # Show station locations #----------------------- dev.new() plot(Longitude, Latitude, xlab = "-Longitude (Degrees West)", ylab = "Latitude", main = "Sampling Station Locations") # Scatterplot matrix of benthic index, salinity, and silt #-------------------------------------------------------- dev.new() pairs(~ Index + Salinity + Silt, data = Benthic.df) # Contour and perspective plots based on loess fit # showing only predicted values within the convex hull # of station locations #----------------------------------------------------- library(sp) loess.fit <- loess(Index ~ Longitude * Latitude, data=Benthic.df, normalize=FALSE, span=0.25) lat <- Benthic.df$Latitude lon <- Benthic.df$Longitude Latitude <- seq(min(lat), max(lat), length=50) Longitude <- seq(min(lon), max(lon), length=50) predict.list <- list(Longitude=Longitude, Latitude=Latitude) predict.grid <- expand.grid(predict.list) predict.fit <- predict(loess.fit, predict.grid) index.chull <- chull(lon, lat) inside <- point.in.polygon(point.x = predict.grid$Longitude, point.y = predict.grid$Latitude, pol.x = lon[index.chull], pol.y = lat[index.chull]) predict.fit[inside == 0] <- NA dev.new() contour(Longitude, Latitude, predict.fit, levels=seq(1, 5, by=0.5), labcex=0.75, xlab="-Longitude (degrees West)", ylab="Latitude (degrees North)") title(main=paste("Contour Plot of Benthic Index", "Based on Loess Smooth", sep="\n")) dev.new() persp(Longitude, Latitude, predict.fit, xlim = c(-77.3, -75.9), ylim = c(38.1, 39.5), zlim = c(0, 6), theta = -45, phi = 30, d = 0.5, xlab="-Longitude (degrees West)", ylab="Latitude (degrees North)", zlab="Benthic Index", ticktype = "detailed") title(main=paste("Surface Plot of Benthic Index", "Based on Loess Smooth", sep="\n")) detach("Benthic.df") rm(loess.fit, lat, lon, Latitude, Longitude, predict.list, predict.grid, predict.fit, index.chull, inside)
attach(Benthic.df) # Show station locations #----------------------- dev.new() plot(Longitude, Latitude, xlab = "-Longitude (Degrees West)", ylab = "Latitude", main = "Sampling Station Locations") # Scatterplot matrix of benthic index, salinity, and silt #-------------------------------------------------------- dev.new() pairs(~ Index + Salinity + Silt, data = Benthic.df) # Contour and perspective plots based on loess fit # showing only predicted values within the convex hull # of station locations #----------------------------------------------------- library(sp) loess.fit <- loess(Index ~ Longitude * Latitude, data=Benthic.df, normalize=FALSE, span=0.25) lat <- Benthic.df$Latitude lon <- Benthic.df$Longitude Latitude <- seq(min(lat), max(lat), length=50) Longitude <- seq(min(lon), max(lon), length=50) predict.list <- list(Longitude=Longitude, Latitude=Latitude) predict.grid <- expand.grid(predict.list) predict.fit <- predict(loess.fit, predict.grid) index.chull <- chull(lon, lat) inside <- point.in.polygon(point.x = predict.grid$Longitude, point.y = predict.grid$Latitude, pol.x = lon[index.chull], pol.y = lat[index.chull]) predict.fit[inside == 0] <- NA dev.new() contour(Longitude, Latitude, predict.fit, levels=seq(1, 5, by=0.5), labcex=0.75, xlab="-Longitude (degrees West)", ylab="Latitude (degrees North)") title(main=paste("Contour Plot of Benthic Index", "Based on Loess Smooth", sep="\n")) dev.new() persp(Longitude, Latitude, predict.fit, xlim = c(-77.3, -75.9), ylim = c(38.1, 39.5), zlim = c(0, 6), theta = -45, phi = 30, d = 0.5, xlab="-Longitude (degrees West)", ylab="Latitude (degrees North)", zlab="Benthic Index", ticktype = "detailed") title(main=paste("Surface Plot of Benthic Index", "Based on Loess Smooth", sep="\n")) detach("Benthic.df") rm(loess.fit, lat, lon, Latitude, Longitude, predict.list, predict.grid, predict.fit, index.chull, inside)
Analyte concentrations (g/g) in 11 discrete environmental soil samples.
BJC.2000.df data(BJC.2000.df)
BJC.2000.df data(BJC.2000.df)
A data frame with 11 observations on the following 4 variables.
Analyte.char
Character vector indicating lead concentrations. Nondetects indicated with the letter U after the measure (e.g., 0.10U)
Analyte
numeric vector indicating analyte concentration.
Censored
logical vector indicating censoring status.
Detect
numeric vector of 0s (nondetects) and 1s (detects) indicating censoring status.
BJC. (2000). Improved Methods for Calculating Concentrations Used in Exposure Assessments. BJC/OR-416, Prepared by the Lockheed Martin Energy Research Corporation. Prepared for the U.S. Department of Energy Office of Environmental Management. Bechtel Jacobs Company, LLC. January, 2000. https://rais.ornl.gov/documents/bjc_or416.pdf.
boxcox
is a generic function used to compute the value(s) of an objective
for one or more Box-Cox power transformations, or to compute an optimal power
transformation based on a specified objective. The function invokes particular
methods
which depend on the class
of the first
argument.
Currently, there is a default method and a method for objects of class "lm"
.
boxcox(x, ...) ## Default S3 method: boxcox(x, lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, optimize = FALSE, objective.name = "PPCC", eps = .Machine$double.eps, include.x = TRUE, ...) ## S3 method for class 'lm' boxcox(x, lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, optimize = FALSE, objective.name = "PPCC", eps = .Machine$double.eps, include.x = TRUE, ...)
boxcox(x, ...) ## Default S3 method: boxcox(x, lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, optimize = FALSE, objective.name = "PPCC", eps = .Machine$double.eps, include.x = TRUE, ...) ## S3 method for class 'lm' boxcox(x, lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, optimize = FALSE, objective.name = "PPCC", eps = .Machine$double.eps, include.x = TRUE, ...)
x |
an object of class |
lambda |
numeric vector of finite values indicating what powers to use for the
Box-Cox transformation. When |
optimize |
logical scalar indicating whether to simply evalute the objective function at the
given values of |
objective.name |
character string indicating what objective to use. The possible values are
|
eps |
finite, positive numeric scalar. When the absolute value of |
include.x |
logical scalar indicating whether to include the finite, non-missing values of
the argument |
... |
optional arguments for possible future methods. Currently not used. |
Two common assumptions for several standard parametric hypothesis tests are:
The observations all come from a normal distribution.
The observations all come from distributions with the same variance.
For example, the standard one-sample t-test assumes all the observations come from the same normal distribution, and the standard two-sample t-test assumes that all the observations come from a normal distribution with the same variance, although the mean may differ between the two groups.
When the original data do not satisfy the above assumptions, data transformations
are often used to attempt to satisfy these assumptions. The rest of this section
is divided into two parts: one that discusses Box-Cox transformations in the
context of the original observations, and one that discusses Box-Cox
transformations in the context of linear models.
Box-Cox Transformations Based on the Original Observations
Box and Cox (1964) presented a formalized method for deciding on a data
transformation. Given a random variable from some distribution with
only positive values, the Box-Cox family of power transformations is defined as:
|
= | |
|
|
|
where is assumed to come from a normal distribution. This transformation is
continuous in
. Note that this transformation also preserves ordering.
See the help file for
boxcoxTransform
for more information on data
transformations.
Let denote a random sample of
observations from some distribution and assume that there exists some
value of
such that the transformed observations
|
= | |
|
|
|
() form a random sample from a normal distribution.
Box and Cox (1964) proposed choosing the appropriate value of based on
maximizing the likelihood function. Alternatively, an appropriate value of
can be chosen based on another objective, such as maximizing the
probability plot correlation coefficient or the Shapiro-Wilk goodness-of-fit
statistic.
In the case when optimize=TRUE
, the function boxcox
calls the
R function nlminb
to minimize the negative value of the
objective (i.e., maximize the objective) over the range of possible values of
specified in the argument
lambda
. The starting value for
the optimization is always (i.e., no transformation).
The rest of this sub-section explains how the objective is computed for the
various options for objective.name
.
Objective Based on Probability Plot Correlation Coefficient (objective.name="PPCC"
)
When objective.name="PPCC"
, the objective is computed as the value of the
normal probability plot correlation coefficient based on the transformed data
(see the description of the Probability Plot Correlation Coefficient (PPCC)
goodness-of-fit test in the help file for gofTest
). That is,
the objective is the correlation coefficient for the normal
quantile-quantile plot for the transformed data.
Large values of the PPCC tend to indicate a good fit to a normal distribution.
Objective Based on Shapiro-Wilk Goodness-of-Fit Statistic (objective.name="Shapiro-Wilk"
)
When objective.name="Shapiro-Wilk"
, the objective is computed as the value of
the Shapiro-Wilk goodness-of-fit statistic based on the transformed data
(see the description of the Shapiro-Wilk test in the help file for
gofTest
). Large values of the Shapiro-Wilk statistic tend to
indicate a good fit to a normal distribution.
Objective Based on Log-Likelihood Function (objective.name="Log-Likelihood"
)
When objective.name="Log-Likelihood"
, the objective is computed as the value
of the log-likelihood function. Assuming the transformed observations in
Equation (2) above come from a normal distribution with mean and
standard deviation
, we can use the change of variable formula to
write the log-likelihood function as:
where is defined in Equation (2) above (Box and Cox, 1964).
For a fixed value of
, the log-likelihood function
is maximized by replacing
and
with their maximum likelihood
estimators:
Thus, when optimize=TRUE
, Equation (3) is maximized by iteratively solving for
using the values for
and
given in
Equations (4) and (5). When
optimize=FALSE
, the value of the objective is
computed by using Equation (3), using the values of specified in the
argument
lambda
, and using the values for and
given
in Equations (4) and (5).
Box-Cox Transformation for Linear Models
In the case of a standard linear regression model with observations and
predictors:
the standard assumptions are:
The error terms come from a normal distribution with mean 0.
The variance is the same for all of the error terms and does not depend on the predictor variables.
Assuming is a random variable from some distribution that may depend on
the predictor variables and
takes on only positive values, the Box-Cox
family of power transformations is defined as:
|
= | |
|
|
|
where becomes the new response variable and the errors are now
assumed to come from a normal distribution with a mean of 0 and a constant variance.
In this case, the objective is computed as described above, but it is based on the
residuals from the fitted linear model in which the response variable is now
instead of
.
When x
is an object of class "lm"
, boxcox
returns
a list of class "boxcoxLm"
containing the results. See
the help file for boxcoxLm.object
for details.
When x
is simply a numeric vector of positive numbers,
boxcox
returns a list of class "boxcox"
containing the
results. See the help file for boxcox.object
for details.
Data transformations are often used to induce normality, homoscedasticity, and/or linearity, common assumptions of parametric statistical tests and estimation procedures. Transformations are not “tricks” used by the data analyst to hide what is going on, but rather useful tools for understanding and dealing with data (Berthouex and Brown, 2002, p.61). Hoaglin (1988) discusses “hidden” transformations that are used everyday, such as the pH scale for measuring acidity. Johnson and Wichern (2007, p.192) note that "Transformations are nothing more than a reexpression of the data in different units."
In the case of a linear model, there are at least two approaches to improving
a model fit: transform the and/or
variable(s), and/or use
more predictor variables. Often in environmental data analysis, we assume the
observations come from a lognormal distribution and automatically take
logarithms of the data. For a simple linear regression
(i.e., one predictor variable), if regression diagnostic plots indicate that a
straight line fit is not adequate, but that the variance of the errors
appears to be fairly constant, you may only need to transform the predictor
variable
or perhaps use a quadratic or cubic model in
.
On the other hand, if the diagnostic plots indicate that the constant
variance and/or normality assumptions are suspect, you probably need to consider
transforming the response variable
. Data transformations for
linear regression models are discussed in Draper and Smith (1998, Chapter 13)
and Helsel and Hirsch (1992, pp. 228-229).
One problem with data transformations is that translating results on the
transformed scale back to the original scale is not always straightforward.
Estimating quantities such as means, variances, and confidence limits in the
transformed scale and then transforming them back to the original scale
usually leads to biased and inconsistent estimates (Gilbert, 1987, p.149;
van Belle et al., 2004, p.400). For example, exponentiating the confidence
limits for a mean based on log-transformed data does not yield a
confidence interval for the mean on the original scale. Instead, this yields
a confidence interval for the median (see the help file for elnormAlt
).
It should be noted, however, that quantiles (percentiles) and rank-based
procedures are invariant to monotonic transformations
(Helsel and Hirsch, 1992, p.12).
Finally, there is no guarantee that a Box-Cox tranformation based on the
“optimal” value of will provide an adequate transformation
to allow the assumption of approximate normality and constant variance. Any
set of transformed data should be inspected relative to the assumptions you
want to make about it (Johnson and Wichern, 2007, p.194).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Box, G.E.P., and D.R. Cox. (1964). An Analysis of Transformations (with Discussion). Journal of the Royal Statistical Society, Series B 26(2), 211–252.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.47-53.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.
Hinkley, D.V., and G. Runger. (1984). The Analysis of Transformed Data (with Discussion). Journal of the American Statistical Association 79, 302–320.
Hoaglin, D.C., F.M. Mosteller, and J.W. Tukey, eds. (1983). Understanding Robust and Exploratory Data Analysis. John Wiley and Sons, New York, Chapter 4.
Hoaglin, D.C. (1988). Transformations in Everyday Experience. Chance 1, 40–45.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York, p.163.
Johnson, R.A., and D.W. Wichern. (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Pearson Prentice Hall, Upper Saddle River, NJ, pp.192–195.
Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.
Stoline, M.R. (1991). An Examination of the Lognormal and Box and Cox Family of Transformations in Fitting Environmental Data. Environmetrics 2(1), 85–106.
van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 13.
boxcox.object
, plot.boxcox
, print.boxcox
,
boxcoxLm.object
, plot.boxcoxLm
, print.boxcoxLm
,
boxcoxTransform
, Data Transformations,
Goodness-of-Fit Tests.
# Generate 30 observations from a lognormal distribution with # mean=10 and cv=2. Look at some values of various objectives # for various transformations. Note that for both the PPCC and # the Log-Likelihood objective, the optimal value of lambda is # about 0, indicating that a log transformation is appropriate. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rlnormAlt(30, mean = 10, cv = 2) dev.new() hist(x, col = "cyan") # Using the PPCC objective: #-------------------------- boxcox(x) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # # lambda PPCC # -2.0 0.5423739 # -1.5 0.6402782 # -1.0 0.7818160 # -0.5 0.9272219 # 0.0 0.9921702 # 0.5 0.9581178 # 1.0 0.8749611 # 1.5 0.7827009 # 2.0 0.7004547 boxcox(x, optimize = TRUE) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.04530789 # #Value of Objective: PPCC = 0.9925919 # Using the Log-Likelihodd objective #----------------------------------- boxcox(x, objective.name = "Log-Likelihood") #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: Log-Likelihood # #Data: x # #Sample Size: 30 # # lambda Log-Likelihood # -2.0 -154.94255 # -1.5 -128.59988 # -1.0 -106.23882 # -0.5 -90.84800 # 0.0 -85.10204 # 0.5 -88.69825 # 1.0 -99.42630 # 1.5 -115.23701 # 2.0 -134.54125 boxcox(x, objective.name = "Log-Likelihood", optimize = TRUE) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: Log-Likelihood # #Data: x # #Sample Size: 30 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.0405156 # #Value of Objective: Log-Likelihood = -85.07123 #---------- # Plot the results based on the PPCC objective #--------------------------------------------- boxcox.list <- boxcox(x) dev.new() plot(boxcox.list) #Look at QQ-Plots for the candidate values of lambda #--------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) #========== # The data frame Environmental.df contains daily measurements of # ozone concentration, wind speed, temperature, and solar radiation # in New York City for 153 consecutive days between May 1 and # September 30, 1973. In this example, we'll plot ozone vs. # temperature and look at the Q-Q plot of the residuals. Then # we'll look at possible Box-Cox transformations. The "optimal" one # based on the PPCC looks close to a log-transformation # (i.e., lambda=0). The power that produces the largest PPCC is # about 0.2, so a cube root (lambda=1/3) transformation might work too. head(Environmental.df) # ozone radiation temperature wind #05/01/1973 41 190 67 7.4 #05/02/1973 36 118 72 8.0 #05/03/1973 12 149 74 12.6 #05/04/1973 18 313 62 11.5 #05/05/1973 NA NA 56 14.3 #05/06/1973 28 NA 66 14.9 tail(Environmental.df) # ozone radiation temperature wind #09/25/1973 14 20 63 16.6 #09/26/1973 30 193 70 6.9 #09/27/1973 NA 145 77 13.2 #09/28/1973 14 191 75 14.3 #09/29/1973 18 131 76 8.0 #09/30/1973 20 223 68 11.5 # Fit the model with the raw Ozone data #-------------------------------------- ozone.fit <- lm(ozone ~ temperature, data = Environmental.df) # Plot Ozone vs. Temperature, with fitted line #--------------------------------------------- dev.new() with(Environmental.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Ozone (ppb)", main = "Ozone vs. Temperature")) abline(ozone.fit) # Look at the Q-Q Plot for the residuals #--------------------------------------- dev.new() qqPlot(ozone.fit$residuals, add.line = TRUE) # Look at Box-Cox transformations of Ozone #----------------------------------------- boxcox.list <- boxcox(ozone.fit) boxcox.list #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Linear Model: ozone.fit # #Sample Size: 116 # # lambda PPCC # -2.0 0.4286781 # -1.5 0.4673544 # -1.0 0.5896132 # -0.5 0.8301458 # 0.0 0.9871519 # 0.5 0.9819825 # 1.0 0.9408694 # 1.5 0.8840770 # 2.0 0.8213675 # Plot PPCC vs. lambda based on Q-Q plots of residuals #----------------------------------------------------- dev.new() plot(boxcox.list) # Look at Q-Q plots of residuals for the various transformation #-------------------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) # Compute the "optimal" transformation #------------------------------------- boxcox(ozone.fit, optimize = TRUE) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Linear Model: ozone.fit # #Sample Size: 116 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.2004305 # #Value of Objective: PPCC = 0.9940222 #========== # Clean up #--------- rm(x, boxcox.list, ozone.fit) graphics.off()
# Generate 30 observations from a lognormal distribution with # mean=10 and cv=2. Look at some values of various objectives # for various transformations. Note that for both the PPCC and # the Log-Likelihood objective, the optimal value of lambda is # about 0, indicating that a log transformation is appropriate. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rlnormAlt(30, mean = 10, cv = 2) dev.new() hist(x, col = "cyan") # Using the PPCC objective: #-------------------------- boxcox(x) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # # lambda PPCC # -2.0 0.5423739 # -1.5 0.6402782 # -1.0 0.7818160 # -0.5 0.9272219 # 0.0 0.9921702 # 0.5 0.9581178 # 1.0 0.8749611 # 1.5 0.7827009 # 2.0 0.7004547 boxcox(x, optimize = TRUE) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.04530789 # #Value of Objective: PPCC = 0.9925919 # Using the Log-Likelihodd objective #----------------------------------- boxcox(x, objective.name = "Log-Likelihood") #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: Log-Likelihood # #Data: x # #Sample Size: 30 # # lambda Log-Likelihood # -2.0 -154.94255 # -1.5 -128.59988 # -1.0 -106.23882 # -0.5 -90.84800 # 0.0 -85.10204 # 0.5 -88.69825 # 1.0 -99.42630 # 1.5 -115.23701 # 2.0 -134.54125 boxcox(x, objective.name = "Log-Likelihood", optimize = TRUE) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: Log-Likelihood # #Data: x # #Sample Size: 30 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.0405156 # #Value of Objective: Log-Likelihood = -85.07123 #---------- # Plot the results based on the PPCC objective #--------------------------------------------- boxcox.list <- boxcox(x) dev.new() plot(boxcox.list) #Look at QQ-Plots for the candidate values of lambda #--------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) #========== # The data frame Environmental.df contains daily measurements of # ozone concentration, wind speed, temperature, and solar radiation # in New York City for 153 consecutive days between May 1 and # September 30, 1973. In this example, we'll plot ozone vs. # temperature and look at the Q-Q plot of the residuals. Then # we'll look at possible Box-Cox transformations. The "optimal" one # based on the PPCC looks close to a log-transformation # (i.e., lambda=0). The power that produces the largest PPCC is # about 0.2, so a cube root (lambda=1/3) transformation might work too. head(Environmental.df) # ozone radiation temperature wind #05/01/1973 41 190 67 7.4 #05/02/1973 36 118 72 8.0 #05/03/1973 12 149 74 12.6 #05/04/1973 18 313 62 11.5 #05/05/1973 NA NA 56 14.3 #05/06/1973 28 NA 66 14.9 tail(Environmental.df) # ozone radiation temperature wind #09/25/1973 14 20 63 16.6 #09/26/1973 30 193 70 6.9 #09/27/1973 NA 145 77 13.2 #09/28/1973 14 191 75 14.3 #09/29/1973 18 131 76 8.0 #09/30/1973 20 223 68 11.5 # Fit the model with the raw Ozone data #-------------------------------------- ozone.fit <- lm(ozone ~ temperature, data = Environmental.df) # Plot Ozone vs. Temperature, with fitted line #--------------------------------------------- dev.new() with(Environmental.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Ozone (ppb)", main = "Ozone vs. Temperature")) abline(ozone.fit) # Look at the Q-Q Plot for the residuals #--------------------------------------- dev.new() qqPlot(ozone.fit$residuals, add.line = TRUE) # Look at Box-Cox transformations of Ozone #----------------------------------------- boxcox.list <- boxcox(ozone.fit) boxcox.list #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Linear Model: ozone.fit # #Sample Size: 116 # # lambda PPCC # -2.0 0.4286781 # -1.5 0.4673544 # -1.0 0.5896132 # -0.5 0.8301458 # 0.0 0.9871519 # 0.5 0.9819825 # 1.0 0.9408694 # 1.5 0.8840770 # 2.0 0.8213675 # Plot PPCC vs. lambda based on Q-Q plots of residuals #----------------------------------------------------- dev.new() plot(boxcox.list) # Look at Q-Q plots of residuals for the various transformation #-------------------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) # Compute the "optimal" transformation #------------------------------------- boxcox(ozone.fit, optimize = TRUE) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Linear Model: ozone.fit # #Sample Size: 116 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.2004305 # #Value of Objective: PPCC = 0.9940222 #========== # Clean up #--------- rm(x, boxcox.list, ozone.fit) graphics.off()
Objects of S3 class "boxcox"
are returned by the EnvStats
function boxcox
, which computes objective values for
user-specified powers, or computes the optimal power for the specified
objective.
Objects of class "boxcox"
are lists that contain
information about the powers that were used, the objective that was used,
the values of the objective for the given powers, and whether an
optimization was specified.
Required Components
The following components must be included in a legitimate list of
class "boxcox"
.
lambda |
Numeric vector containing the powers used in the Box-Cox transformations.
If the value of the |
objective |
Numeric vector containing the value(s) of the objective for the given value(s)
of |
objective.name |
character string indicating the objective that was used. The possible values are
|
optimize |
logical scalar indicating whether the objective was simply evaluted at the
given values of |
optimize.bounds |
Numeric vector of length 2 with a names attribute indicating the bounds within
which the optimization took place. When |
eps |
finite, positive numeric scalar indicating what value of |
sample.size |
Numeric scalar indicating the number of finite, non-missing observations. |
data.name |
The name of the data object used for the Box-Cox computations. |
bad.obs |
The number of missing ( |
Optional Component
The following component may optionally be included in a legitimate
list of class "boxcox"
. It must be included if you want to call the
function plot.boxcox
and specify Q-Q plots or
Tukey Mean-Difference Q-Q plots.
data |
Numeric vector containing the data actually used for the Box-Cox computations (i.e., the original data without any missing or infinite values). |
Generic functions that have methods for objects of class
"boxcox"
include: plot
, print
.
Since objects of class "boxcox"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
boxcox
, plot.boxcox
, print.boxcox
,
boxcoxLm.object
.
# Create an object of class "boxcox", then print it out. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rlnormAlt(30, mean = 10, cv = 2) dev.new() hist(x, col = "cyan") boxcox.list <- boxcox(x) data.class(boxcox.list) #[1] "boxcox" names(boxcox.list) # [1] "lambda" "objective" "objective.name" # [4] "optimize" "optimize.bounds" "eps" # [7] "data" "sample.size" "data.name" #[10] "bad.obs" boxcox.list #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # # lambda PPCC # -2.0 0.5423739 # -1.5 0.6402782 # -1.0 0.7818160 # -0.5 0.9272219 # 0.0 0.9921702 # 0.5 0.9581178 # 1.0 0.8749611 # 1.5 0.7827009 # 2.0 0.7004547 boxcox(x, optimize = TRUE) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.04530789 # #Value of Objective: PPCC = 0.9925919 #---------- # Clean up #--------- rm(x, boxcox.list)
# Create an object of class "boxcox", then print it out. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rlnormAlt(30, mean = 10, cv = 2) dev.new() hist(x, col = "cyan") boxcox.list <- boxcox(x) data.class(boxcox.list) #[1] "boxcox" names(boxcox.list) # [1] "lambda" "objective" "objective.name" # [4] "optimize" "optimize.bounds" "eps" # [7] "data" "sample.size" "data.name" #[10] "bad.obs" boxcox.list #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # # lambda PPCC # -2.0 0.5423739 # -1.5 0.6402782 # -1.0 0.7818160 # -0.5 0.9272219 # 0.0 0.9921702 # 0.5 0.9581178 # 1.0 0.8749611 # 1.5 0.7827009 # 2.0 0.7004547 boxcox(x, optimize = TRUE) #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.04530789 # #Value of Objective: PPCC = 0.9925919 #---------- # Clean up #--------- rm(x, boxcox.list)
Compute the value(s) of an objective for one or more Box-Cox power transformations, or to compute an optimal power transformation based on a specified objective, based on Type I censored data.
boxcoxCensored(x, censored, censoring.side = "left", lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, optimize = FALSE, objective.name = "PPCC", eps = .Machine$double.eps, include.x.and.censored = TRUE, prob.method = "michael-schucany", plot.pos.con = 0.375)
boxcoxCensored(x, censored, censoring.side = "left", lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, optimize = FALSE, objective.name = "PPCC", eps = .Machine$double.eps, include.x.and.censored = TRUE, prob.method = "michael-schucany", plot.pos.con = 0.375)
x |
a numeric vector of positive numbers.
Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible values are
|
lambda |
numeric vector of finite values indicating what powers to use for the
Box-Cox transformation. When |
optimize |
logical scalar indicating whether to simply evalute the objective function at the
given values of |
objective.name |
character string indicating what objective to use. The possible values are
|
eps |
finite, positive numeric scalar. When the absolute value of |
include.x.and.censored |
logical scalar indicating whether to include the finite, non-missing values of
the argument |
prob.method |
for multiply censored data,
character string indicating what method to use to compute the plotting positions
(empirical probabilities) when
The default value is The This argument is ignored if |
plot.pos.con |
for multiply censored data,
numeric scalar between 0 and 1 containing the value of the plotting position
constant when This argument is ignored if |
Two common assumptions for several standard parametric hypothesis tests are:
The observations all come from a normal distribution.
The observations all come from distributions with the same variance.
For example, the standard one-sample t-test assumes all the observations come from the same normal distribution, and the standard two-sample t-test assumes that all the observations come from a normal distribution with the same variance, although the mean may differ between the two groups.
When the original data do not satisfy the above assumptions, data transformations
are often used to attempt to satisfy these assumptions.
Box and Cox (1964) presented a formalized method for deciding on a data
transformation. Given a random variable from some distribution with
only positive values, the Box-Cox family of power transformations is defined as:
|
= | |
|
|
|
where is assumed to come from a normal distribution. This transformation is
continuous in
. Note that this transformation also preserves ordering.
See the help file for
boxcoxTransform
for more information on data
transformations.
Box and Cox (1964) proposed choosing the appropriate value of based on
maximizing the likelihood function. Alternatively, an appropriate value of
can be chosen based on another objective, such as maximizing the
probability plot correlation coefficient or the Shapiro-Wilk goodness-of-fit
statistic.
Shumway et al. (1989) investigated extending the method of Box and Cox (1964) to the case of Type I censored data, motivated by the desire to produce estimated means and confidence intervals for air monitoring data that included censored values.
In the case when optimize=TRUE
, the function boxcoxCensored
calls the
R function nlminb
to minimize the negative value of the
objective (i.e., maximize the objective) over the range of possible values of
specified in the argument
lambda
. The starting value for
the optimization is always (i.e., no transformation).
The next section explains assumptions and notation, and the section after that
explains how the objective is computed for the various options for
objective.name
.
Assumptions and Notation
Let denote a random sample of
observations from
some continuous distribution. Assume
(
) of these
observations are known and
(
) of these observations are
all censored below (left-censored) or all censored above (right-censored) at
fixed censoring levels
For the case when , the data are said to be Type I
multiply censored. For the case when
,
set
. If the data are left-censored
and all
known observations are greater
than or equal to
, or if the data are right-censored and all
known observations are less than or equal to
, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let denote the number of observations censored below or above censoring
level
for
, so that
Let denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
right-censored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For left-censored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity does not necessarily represent
the
'th “largest” observation from the (unknown) complete sample.
Finally, let (omega) denote the set of
subscripts in the
“ordered” sample that correspond to uncensored observations, and let
denote the set of
subscripts in the “ordered”
sample that correspond to the censored observations censored at censoring level
for
.
We assume that there exists some value of such that the transformed
observations
|
= | |
|
|
|
() form a random sample of Type I censored data from a
normal distribution.
Note that for the censored observations, Equation (4) becomes:
|
= | |
|
|
|
where .
Computing the Objective
Objective Based on Probability Plot Correlation Coefficient (objective.name="PPCC"
)
When objective.name="PPCC"
, the objective is computed as the value of the
normal probability plot correlation coefficient based on the transformed data
(see the description of the Probability Plot Correlation Coefficient (PPCC)
goodness-of-fit test in the help file for gofTestCensored
). That is,
the objective is the correlation coefficient for the normal
quantile-quantile plot for the transformed data.
Large values of the PPCC tend to indicate a good fit to a normal distribution.
Objective Based on Shapiro-Wilk Goodness-of-Fit Statistic (objective.name="Shapiro-Wilk"
)
When objective.name="Shapiro-Wilk"
, the objective is computed as the value of
the Shapiro-Wilk goodness-of-fit statistic based on the transformed data
(see the description of the Shapiro-Wilk test in the help file for
gofTestCensored
). Large values of the Shapiro-Wilk statistic tend to
indicate a good fit to a normal distribution.
Objective Based on Log-Likelihood Function (objective.name="Log-Likelihood"
)
When objective.name="Log-Likelihood"
, the objective is computed as the value
of the log-likelihood function. Assuming the transformed observations in
Equation (4) above come from a normal distribution with mean and
standard deviation
, we can use the change of variable formula to
write the log-likelihood function as follows.
For Type I left censored data, the likelihood function is given by:
where and
denote the probability density function (pdf) and
cumulative distribution function (cdf) of the population. That is,
where and
denote the pdf and cdf of the standard normal
distribution, respectively (Shumway et al., 1989). For left singly
censored data, Equation (6) simplifies to:
Similarly, for Type I right censored data, the likelihood function is given by:
and for right singly censored data this simplifies to:
For a fixed value of , the log-likelihood function
is maximized by replacing
and
with their maximum likelihood
estimators (see the section Maximum Likelihood Estimation in the help file
for
enormCensored
).
Thus, when optimize=TRUE
, Equation (6) or (10) is maximized by iteratively
solving for using the MLEs for
and
.
When
optimize=FALSE
, the value of the objective is computed by using
Equation (6) or (10), using the values of specified in the
argument
lambda
, and using the MLEs of and
.
boxcoxCensored
returns a list of class "boxcoxCensored"
containing the results.
See the help file for boxcoxCensored.object
for details.
Data transformations are often used to induce normality, homoscedasticity, and/or linearity, common assumptions of parametric statistical tests and estimation procedures. Transformations are not “tricks” used by the data analyst to hide what is going on, but rather useful tools for understanding and dealing with data (Berthouex and Brown, 2002, p.61). Hoaglin (1988) discusses “hidden” transformations that are used everyday, such as the pH scale for measuring acidity. Johnson and Wichern (2007, p.192) note that "Transformations are nothing more than a reexpression of the data in different units."
Shumway et al. (1989) investigated extending the method of Box and Cox (1964) to the case of Type I censored data, motivated by the desire to produce estimated means and confidence intervals for air monitoring data that included censored values.
Stoline (1991) compared the goodness-of-fit of Box-Cox transformed data (based on
using the “optimal” power transformation from a finite set of values between
-1.5 and 1.5) with log-transformed data for 17 groundwater chemistry variables.
Using the Probability Plot Correlation Coefficient statistic for censored data as a
measure of goodness-of-fit (see gofTest
), Stoline (1991) found that
only 6 of the variables were adequately modeled by a Box-Cox transformation
(p >0.10 for these 6 variables). Of these variables, five were adequately modeled
by a a log transformation. Ten of variables were “marginally” fit by an
optimal Box-Cox transformation, and of these 10 only 6 were marginally fit by a
log transformation. Based on these results, Stoline (1991) recommends checking
the assumption of lognormality before automatically assuming environmental data fit
a lognormal distribution.
One problem with data transformations is that translating results on the
transformed scale back to the original scale is not always straightforward.
Estimating quantities such as means, variances, and confidence limits in the
transformed scale and then transforming them back to the original scale
usually leads to biased and inconsistent estimates (Gilbert, 1987, p.149;
van Belle et al., 2004, p.400). For example, exponentiating the confidence
limits for a mean based on log-transformed data does not yield a
confidence interval for the mean on the original scale. Instead, this yields
a confidence interval for the median (see the help file for
elnormAltCensored
).
It should be noted, however, that quantiles (percentiles) and rank-based
procedures are invariant to monotonic transformations
(Helsel and Hirsch, 1992, p.12).
Finally, there is no guarantee that a Box-Cox tranformation based on the
“optimal” value of will provide an adequate transformation
to allow the assumption of approximate normality and constant variance. Any
set of transformed data should be inspected relative to the assumptions you
want to make about it (Johnson and Wichern, 2007, p.194).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Box, G.E.P., and D.R. Cox. (1964). An Analysis of Transformations (with Discussion). Journal of the Royal Statistical Society, Series B 26(2), 211–252.
Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, pp.50–59.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.47-53.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.
Hinkley, D.V., and G. Runger. (1984). The Analysis of Transformed Data (with Discussion). Journal of the American Statistical Association 79, 302–320.
Hoaglin, D.C., F.M. Mosteller, and J.W. Tukey, eds. (1983). Understanding Robust and Exploratory Data Analysis. John Wiley and Sons, New York, Chapter 4.
Hoaglin, D.C. (1988). Transformations in Everyday Experience. Chance 1, 40–45.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York, p.163.
Johnson, R.A., and D.W. Wichern. (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Pearson Prentice Hall, Upper Saddle River, NJ, pp.192–195.
Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.
Stoline, M.R. (1991). An Examination of the Lognormal and Box and Cox Family of Transformations in Fitting Environmental Data. Environmetrics 2(1), 85–106.
van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 13.
boxcoxCensored.object
, plot.boxcoxCensored
,
print.boxcoxCensored
,
boxcox
, Data Transformations, Goodness-of-Fit Tests.
# Generate 15 observations from a lognormal distribution with # mean=10 and cv=2 and censor the observations less than 2. # Then generate 15 more observations from this distribution and # censor the observations less than 4. # Then Look at some values of various objectives for various transformations. # Note that for both the PPCC objective the optimal value is about -0.3, # whereas for the Log-Likelihood objective it is about 0.3. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x.1 <- rlnormAlt(15, mean = 10, cv = 2) censored.1 <- x.1 < 2 x.1[censored.1] <- 2 x.2 <- rlnormAlt(15, mean = 10, cv = 2) censored.2 <- x.2 < 4 x.2[censored.2] <- 4 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) #-------------------------- # Using the PPCC objective: #-------------------------- boxcoxCensored(x, censored) #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: PPCC # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # # lambda PPCC # -2.0 0.8954683 # -1.5 0.9338467 # -1.0 0.9643680 # -0.5 0.9812969 # 0.0 0.9776834 # 0.5 0.9471025 # 1.0 0.8901990 # 1.5 0.8187488 # 2.0 0.7480494 boxcoxCensored(x, censored, optimize = TRUE) #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: PPCC # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = -0.3194799 # #Value of Objective: PPCC = 0.9827546 #----------------------------------- # Using the Log-Likelihodd objective #----------------------------------- boxcoxCensored(x, censored, objective.name = "Log-Likelihood") #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: Log-Likelihood # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # # lambda Log-Likelihood # -2.0 -95.38785 # -1.5 -84.76697 # -1.0 -75.36204 # -0.5 -68.12058 # 0.0 -63.98902 # 0.5 -63.56701 # 1.0 -66.92599 # 1.5 -73.61638 # 2.0 -82.87970 boxcoxCensored(x, censored, objective.name = "Log-Likelihood", optimize = TRUE) #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: Log-Likelihood # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.3049744 # #Value of Objective: Log-Likelihood = -63.2733 #---------- # Plot the results based on the PPCC objective #--------------------------------------------- boxcox.list <- boxcoxCensored(x, censored) dev.new() plot(boxcox.list) #Look at QQ-Plots for the candidate values of lambda #--------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) #========== # Clean up #--------- rm(x.1, censored.1, x.2, censored.2, x, censored, boxcox.list) graphics.off()
# Generate 15 observations from a lognormal distribution with # mean=10 and cv=2 and censor the observations less than 2. # Then generate 15 more observations from this distribution and # censor the observations less than 4. # Then Look at some values of various objectives for various transformations. # Note that for both the PPCC objective the optimal value is about -0.3, # whereas for the Log-Likelihood objective it is about 0.3. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x.1 <- rlnormAlt(15, mean = 10, cv = 2) censored.1 <- x.1 < 2 x.1[censored.1] <- 2 x.2 <- rlnormAlt(15, mean = 10, cv = 2) censored.2 <- x.2 < 4 x.2[censored.2] <- 4 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) #-------------------------- # Using the PPCC objective: #-------------------------- boxcoxCensored(x, censored) #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: PPCC # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # # lambda PPCC # -2.0 0.8954683 # -1.5 0.9338467 # -1.0 0.9643680 # -0.5 0.9812969 # 0.0 0.9776834 # 0.5 0.9471025 # 1.0 0.8901990 # 1.5 0.8187488 # 2.0 0.7480494 boxcoxCensored(x, censored, optimize = TRUE) #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: PPCC # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = -0.3194799 # #Value of Objective: PPCC = 0.9827546 #----------------------------------- # Using the Log-Likelihodd objective #----------------------------------- boxcoxCensored(x, censored, objective.name = "Log-Likelihood") #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: Log-Likelihood # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # # lambda Log-Likelihood # -2.0 -95.38785 # -1.5 -84.76697 # -1.0 -75.36204 # -0.5 -68.12058 # 0.0 -63.98902 # 0.5 -63.56701 # 1.0 -66.92599 # 1.5 -73.61638 # 2.0 -82.87970 boxcoxCensored(x, censored, objective.name = "Log-Likelihood", optimize = TRUE) #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: Log-Likelihood # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = 0.3049744 # #Value of Objective: Log-Likelihood = -63.2733 #---------- # Plot the results based on the PPCC objective #--------------------------------------------- boxcox.list <- boxcoxCensored(x, censored) dev.new() plot(boxcox.list) #Look at QQ-Plots for the candidate values of lambda #--------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) #========== # Clean up #--------- rm(x.1, censored.1, x.2, censored.2, x, censored, boxcox.list) graphics.off()
Objects of S3 class "boxcoxCensored"
are returned by the EnvStats
function boxcoxCensored
, which computes objective values for
user-specified powers, or computes the optimal power for the specified
objective, based on Type I censored data.
Objects of class "boxcoxCensored"
are lists that contain
information about the powers that were used, the objective that was used,
the values of the objective for the given powers, and whether an
optimization was specified.
Required Components
The following components must be included in a legitimate list of
class "boxcoxCensored"
.
lambda |
Numeric vector containing the powers used in the Box-Cox transformations.
If the value of the |
objective |
Numeric vector containing the value(s) of the objective for the given value(s)
of |
objective.name |
Character string indicating the objective that was used. The possible values are
|
optimize |
Logical scalar indicating whether the objective was simply evaluted at the
given values of |
optimize.bounds |
Numeric vector of length 2 with a names attribute indicating the bounds within
which the optimization took place. When |
eps |
Finite, positive numeric scalar indicating what value of |
sample.size |
Numeric scalar indicating the number of finite, non-missing observations. |
censoring.side |
Character string indicating the censoring side. Possible values are
|
censoring.levels |
Numeric vector containing the censoring levels. |
percent.censored |
Numeric scalar indicating the percent of observations that are censored. |
data.name |
The name of the data object used for the Box-Cox computations. |
censoring.name |
The name of the data object indicating which observations are censored. |
bad.obs |
The number of missing ( |
Optional Component
The following components may optionally be included in a legitimate
list of class "boxcoxCensored"
. They must be included if you want to
call the function plot.boxcoxCensored
and specify Q-Q plots or
Tukey Mean-Difference Q-Q plots.
data |
Numeric vector containing the data actually used for the Box-Cox computations (i.e., the original data without any missing or infinite values). |
censored |
Logical vector indicating which of the vales in the component |
Generic functions that have methods for objects of class
"boxcoxCensored"
include: plot
, print
.
Since objects of class "boxcoxCensored"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
boxcoxCensored
, plot.boxcoxCensored
,
print.boxcoxCensored
.
# Create an object of class "boxcoxCensored", then print it out. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x.1 <- rlnormAlt(15, mean = 10, cv = 2) censored.1 <- x.1 < 2 x.1[censored.1] <- 2 x.2 <- rlnormAlt(15, mean = 10, cv = 2) censored.2 <- x.2 < 4 x.2[censored.2] <- 4 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) boxcox.list <- boxcoxCensored(x, censored) data.class(boxcox.list) #[1] "boxcoxCensored" names(boxcox.list) # [1] "lambda" "objective" "objective.name" # [4] "optimize" "optimize.bounds" "eps" # [7] "data" "censored" "sample.size" #[10] "censoring.side" "censoring.levels" "percent.censored" #[13] "data.name" "censoring.name" "bad.obs" boxcox.list #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: PPCC # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # # lambda PPCC # -2.0 0.8954683 # -1.5 0.9338467 # -1.0 0.9643680 # -0.5 0.9812969 # 0.0 0.9776834 # 0.5 0.9471025 # 1.0 0.8901990 # 1.5 0.8187488 # 2.0 0.7480494 boxcox.list2 <- boxcox(x, optimize = TRUE) names(boxcox.list2) # [1] "lambda" "objective" "objective.name" # [4] "optimize" "optimize.bounds" "eps" # [7] "data" "sample.size" "data.name" #[10] "bad.obs" boxcox.list2 #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = -0.5826431 # #Value of Objective: PPCC = 0.9755402 #========== # Clean up #--------- rm(x.1, censored.1, x.2, censored.2, x, censored, boxcox.list, boxcox.list2)
# Create an object of class "boxcoxCensored", then print it out. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x.1 <- rlnormAlt(15, mean = 10, cv = 2) censored.1 <- x.1 < 2 x.1[censored.1] <- 2 x.2 <- rlnormAlt(15, mean = 10, cv = 2) censored.2 <- x.2 < 4 x.2[censored.2] <- 4 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) boxcox.list <- boxcoxCensored(x, censored) data.class(boxcox.list) #[1] "boxcoxCensored" names(boxcox.list) # [1] "lambda" "objective" "objective.name" # [4] "optimize" "optimize.bounds" "eps" # [7] "data" "censored" "sample.size" #[10] "censoring.side" "censoring.levels" "percent.censored" #[13] "data.name" "censoring.name" "bad.obs" boxcox.list #Results of Box-Cox Transformation #Based on Type I Censored Data #--------------------------------- # #Objective Name: PPCC # #Data: x # #Censoring Variable: censored # #Censoring Side: left # #Censoring Level(s): 2 4 # #Sample Size: 30 # #Percent Censored: 26.7% # # lambda PPCC # -2.0 0.8954683 # -1.5 0.9338467 # -1.0 0.9643680 # -0.5 0.9812969 # 0.0 0.9776834 # 0.5 0.9471025 # 1.0 0.8901990 # 1.5 0.8187488 # 2.0 0.7480494 boxcox.list2 <- boxcox(x, optimize = TRUE) names(boxcox.list2) # [1] "lambda" "objective" "objective.name" # [4] "optimize" "optimize.bounds" "eps" # [7] "data" "sample.size" "data.name" #[10] "bad.obs" boxcox.list2 #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Data: x # #Sample Size: 30 # #Bounds for Optimization: lower = -2 # upper = 2 # #Optimal Value: lambda = -0.5826431 # #Value of Objective: PPCC = 0.9755402 #========== # Clean up #--------- rm(x.1, censored.1, x.2, censored.2, x, censored, boxcox.list, boxcox.list2)
Objects of S3 class "boxcoxLm"
are returned by the EnvStats
function boxcox
when the argument x
is an object
of class "lm"
. In this case, boxcox
computes
values of an objective function for user-specified powers, or computes the
optimal power for the specified objective, based on residuals from the linear model.
Objects of class "boxcoxLm"
are lists that contain
information about the "lm"
object that was suplied,
the powers that were used, the objective that was used,
the values of the objective for the given powers, and whether an
optimization was specified.
The following components must be included in a legitimate list of
class "boxcoxLm"
.
lambda |
Numeric vector containing the powers used in the Box-Cox transformations.
If the value of the |
objective |
Numeric vector containing the value(s) of the objective for the given value(s)
of |
objective.name |
character string indicating the objective that was used. The possible values are
|
optimize |
logical scalar indicating whether the objective was simply evaluted at the
given values of |
optimize.bounds |
Numeric vector of length 2 with a names attribute indicating the bounds within
which the optimization took place. When |
eps |
finite, positive numeric scalar indicating what value of |
lm.obj |
the value of the argument |
sample.size |
Numeric scalar indicating the number of finite, non-missing observations. |
data.name |
The name of the data object used for the Box-Cox computations. |
Generic functions that have methods for objects of class
"boxcoxLm"
include: plot
, print
.
Since objects of class "boxcoxLm"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
boxcox
, plot.boxcoxLm
, print.boxcoxLm
,
boxcox.object
.
# Create an object of class "boxcoxLm", then print it out. # The data frame Environmental.df contains daily measurements of # ozone concentration, wind speed, temperature, and solar radiation # in New York City for 153 consecutive days between May 1 and # September 30, 1973. In this example, we'll plot ozone vs. # temperature and look at the Q-Q plot of the residuals. Then # we'll look at possible Box-Cox transformations. The "optimal" one # based on the PPCC looks close to a log-transformation # (i.e., lambda=0). The power that produces the largest PPCC is # about 0.2, so a cube root (lambda=1/3) transformation might work too. # Fit the model with the raw Ozone data #-------------------------------------- ozone.fit <- lm(ozone ~ temperature, data = Environmental.df) # Plot Ozone vs. Temperature, with fitted line #--------------------------------------------- dev.new() with(Environmental.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Ozone (ppb)", main = "Ozone vs. Temperature")) abline(ozone.fit) # Look at the Q-Q Plot for the residuals #--------------------------------------- dev.new() qqPlot(ozone.fit$residuals, add.line = TRUE) # Look at Box-Cox transformations of Ozone #----------------------------------------- boxcox.list <- boxcox(ozone.fit) boxcox.list #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Linear Model: ozone.fit # #Sample Size: 116 # # lambda PPCC # -2.0 0.4286781 # -1.5 0.4673544 # -1.0 0.5896132 # -0.5 0.8301458 # 0.0 0.9871519 # 0.5 0.9819825 # 1.0 0.9408694 # 1.5 0.8840770 # 2.0 0.8213675 #---------- # Clean up #--------- rm(ozone.fit, boxcox.list)
# Create an object of class "boxcoxLm", then print it out. # The data frame Environmental.df contains daily measurements of # ozone concentration, wind speed, temperature, and solar radiation # in New York City for 153 consecutive days between May 1 and # September 30, 1973. In this example, we'll plot ozone vs. # temperature and look at the Q-Q plot of the residuals. Then # we'll look at possible Box-Cox transformations. The "optimal" one # based on the PPCC looks close to a log-transformation # (i.e., lambda=0). The power that produces the largest PPCC is # about 0.2, so a cube root (lambda=1/3) transformation might work too. # Fit the model with the raw Ozone data #-------------------------------------- ozone.fit <- lm(ozone ~ temperature, data = Environmental.df) # Plot Ozone vs. Temperature, with fitted line #--------------------------------------------- dev.new() with(Environmental.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Ozone (ppb)", main = "Ozone vs. Temperature")) abline(ozone.fit) # Look at the Q-Q Plot for the residuals #--------------------------------------- dev.new() qqPlot(ozone.fit$residuals, add.line = TRUE) # Look at Box-Cox transformations of Ozone #----------------------------------------- boxcox.list <- boxcox(ozone.fit) boxcox.list #Results of Box-Cox Transformation #--------------------------------- # #Objective Name: PPCC # #Linear Model: ozone.fit # #Sample Size: 116 # # lambda PPCC # -2.0 0.4286781 # -1.5 0.4673544 # -1.0 0.5896132 # -0.5 0.8301458 # 0.0 0.9871519 # 0.5 0.9819825 # 1.0 0.9408694 # 1.5 0.8840770 # 2.0 0.8213675 #---------- # Clean up #--------- rm(ozone.fit, boxcox.list)
Apply a Box-Cox power transformation to a set of data to attempt to induce normality and homogeneity of variance.
boxcoxTransform(x, lambda, eps = .Machine$double.eps)
boxcoxTransform(x, lambda, eps = .Machine$double.eps)
x |
a numeric vector of positive numbers. |
lambda |
finite numeric scalar indicating what power to use for the Box-Cox transformation. |
eps |
finite, positive numeric scalar. When the absolute value of |
Two common assumptions for several standard parametric hypothesis tests are:
The observations all come from a normal distribution.
The observations all come from distributions with the same variance.
For example, the standard one-sample t-test assumes all the observations come from the same normal distribution, and the standard two-sample t-test assumes that all the observations come from a normal distribution with the same variance, although the mean may differ between the two groups. For standard linear regression models, these assumptions can be stated as: the error terms all come from a normal distribution with mean 0 and and a constant variance.
Often, especially with environmental data, the above assumptions do not hold because the original data are skewed and/or they follow a distribution that is not really shaped like a normal distribution. It is sometimes possible, however, to transform the original data so that the transformed observations in fact come from a normal distribution or close to a normal distribution. The transformation may also induce homogeneity of variance and, for the case of a linear regression model, a linear relationship between the response and predictor variable(s).
Sometimes, theoretical considerations indicate an appropriate transformation. For example, count data often follow a Poisson distribution, and it can be shown that taking the square root of observations from a Poisson distribution tends to make these data look more bell-shaped (Johnson et al., 1992, p.163; Johnson and Wichern, 2007, p.192; Zar, 2010, p.291). A common example in the environmental field is that chemical concentration data often appear to come from a lognormal distribution or some other positively-skewed distribution (e.g., gamma). In this case, taking the logarithm of the observations often appears to yield normally distributed data.
Ideally, a data transformation is chosen based on knowledge of the process generating the data, as well as graphical tools such as quantile-quantile plots and histograms.
Box and Cox (1964) presented a formalized method for deciding on a data
transformation. Given a random variable from some distribution with
only positive values, the Box-Cox family of power transformations is defined as:
|
= | |
|
|
|
where is assumed to come from a normal distribution. This transformation is
continuous in
. Note that this transformation also preserves ordering;
that is, if
then
.
Box and Cox (1964) proposed choosing the appropriate value of
based on maximizing a likelihood function. See the help file for
boxcox
for details.
Note that for non-zero values of , instead of using the formula of
Box and Cox in Equation (1), you may simply use the power transformation:
since these two equations differ only by a scale difference and origin shift, and the essential character of the transformed distribution remains unchanged.
The value corresponds to no transformation. Values of
less than 1 shrink large values of
, and are therefore
useful for transforming positively-skewed (right-skewed) data. Values of
larger than 1 inflate large values of
, and are therefore
useful for transforming negatively-skewed (left-skewed) data
(Helsel and Hirsch, 1992, pp.13-14; Johnson and Wichern, 2007, p.193).
Commonly used values of
include 0 (log transformation),
0.5 (square-root transformation), -1 (reciprocal), and -0.5 (reciprocal root).
It is often recommend that when dealing with several similar data sets, it is best to find a common transformation that works reasonably well for all the data sets, rather than using slightly different transformations for each data set (Helsel and Hirsch, 1992, p.14; Shumway et al., 1989).
numeric vector of transformed observations.
Data transformations are often used to induce normality, homoscedasticity, and/or linearity, common assumptions of parametric statistical tests and estimation procedures. Transformations are not “tricks” used by the data analyst to hide what is going on, but rather useful tools for understanding and dealing with data (Berthouex and Brown, 2002, p.61). Hoaglin (1988) discusses “hidden” transformations that are used everyday, such as the pH scale for measuring acidity.
In the case of a linear model, there are at least two approaches to improving
a model fit: transform the and/or
variable(s), and/or use
more predictor variables. Often in environmental data analysis, we assume the
observations come from a lognormal distribution and automatically take
logarithms of the data. For a simple linear regression
(i.e., one predictor variable), if regression diagnostic plots indicate that a
straight line fit is not adequate, but that the variance of the errors
appears to be fairly constant, you may only need to transform the predictor
variable
or perhaps use a quadratic or cubic model in
.
On the other hand, if the diagnostic plots indicate that the constant
variance and/or normality assumptions are suspect, you probably need to consider
transforming the response variable
. Data transformations for
linear regression models are discussed in Draper and Smith (1998, Chapter 13)
and Helsel and Hirsch (1992, pp. 228-229).
One problem with data transformations is that translating results on the
transformed scale back to the original scale is not always straightforward.
Estimating quantities such as means, variances, and confidence limits in the
transformed scale and then transforming them back to the original scale
usually leads to biased and inconsistent estimates (Gilbert, 1987, p.149;
van Belle et al., 2004, p.400). For example, exponentiating the confidence
limits for a mean based on log-transformed data does not yield a
confidence interval for the mean on the original scale. Instead, this yields
a confidence interval for the median (see the help file for elnormAlt
).
It should be noted, however, that quantiles (percentiles) and rank-based
procedures are invariant to monotonic transformations
(Helsel and Hirsch, 1992, p.12).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Box, G.E.P., and D.R. Cox. (1964). An Analysis of Transformations (with Discussion). Journal of the Royal Statistical Society, Series B 26(2), 211–252.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.47-53.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.
Hinkley, D.V., and G. Runger. (1984). The Analysis of Transformed Data (with Discussion). Journal of the American Statistical Association 79, 302–320.
Hoaglin, D.C., F.M. Mosteller, and J.W. Tukey, eds. (1983). Understanding Robust and Exploratory Data Analysis. John Wiley and Sons, New York, Chapter 4.
Hoaglin, D.C. (1988). Transformations in Everyday Experience. Chance 1, 40–45.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York, p.163.
Johnson, R.A., and D.W. Wichern. (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Pearson Prentice Hall, Upper Saddle River, NJ, pp.192–195.
Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.
Stoline, M.R. (1991). An Examination of the Lognormal and Box and Cox Family of Transformations in Fitting Environmental Data. Environmetrics 2(1), 85–106.
van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 13.
boxcox
, Data Transformations, Goodness-of-Fit Tests.
# Generate 30 observations from a lognormal distribution with # mean=10 and cv=2, then look at some normal quantile-quantile # plots for various transformations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rlnormAlt(30, mean = 10, cv = 2) dev.new() qqPlot(x, add.line = TRUE) dev.new() qqPlot(boxcoxTransform(x, lambda = 0.5), add.line = TRUE) dev.new() qqPlot(boxcoxTransform(x, lambda = 0), add.line = TRUE) # Clean up #--------- rm(x)
# Generate 30 observations from a lognormal distribution with # mean=10 and cv=2, then look at some normal quantile-quantile # plots for various transformations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rlnormAlt(30, mean = 10, cv = 2) dev.new() qqPlot(x, add.line = TRUE) dev.new() qqPlot(boxcoxTransform(x, lambda = 0.5), add.line = TRUE) dev.new() qqPlot(boxcoxTransform(x, lambda = 0), add.line = TRUE) # Clean up #--------- rm(x)
Fit a calibration line or curve based on linear regression.
calibrate(formula, data, test.higher.orders = TRUE, max.order = 4, p.crit = 0.05, F.test = "partial", weights, subset, na.action, method = "qr", model = FALSE, x = FALSE, y = FALSE, contrasts = NULL, warn = TRUE, ...)
calibrate(formula, data, test.higher.orders = TRUE, max.order = 4, p.crit = 0.05, F.test = "partial", weights, subset, na.action, method = "qr", model = FALSE, x = FALSE, y = FALSE, contrasts = NULL, warn = TRUE, ...)
formula |
a |
data |
an optional data frame, list or environment (or object coercible by |
test.higher.orders |
logical scalar indicating whether to start with a model that contains a single predictor
variable and test the fit of higher order polynomials to consider for the calibration
curve ( |
max.order |
integer indicating the maximum order of the polynomial to consider for the
calibration curve. The default value is |
p.crit |
numeric scaler between 0 and 1 indicating the p-value to use for the stepwise regression
when determining which polynomial model to use. The default value is |
F.test |
character string indicating whether to perform the stepwise regression using the
standard partial F-test ( |
weights |
optional vector of observation weights; if supplied, the algorithm fits to minimize the sum of the
weights multiplied into the squared residuals. The length of weights must be the same as
the number of observations. The weights must be nonnegative and it is strongly recommended
that they be strictly positive, since zero weights are ambiguous, compared to use of the
|
subset |
optional expression saying which subset of the rows of the data should be used in the fit. This can be a logical vector (which is replicated to have length equal to the number of observations), or a numeric vector indicating which observation numbers are to be included, or a character vector of the row names to be included. All observations are included by default. |
na.action |
optional function which indicates what should happen when the data contain |
method |
optional method to be used; for fitting, currently only |
model , x , y , qr
|
optional logicals. If |
contrasts |
an optional list. See the argument |
warn |
logical scalar indicating whether to issue a warning ( |
... |
additional arguments to be passed to the low level regression fitting functions
(see |
A simple and frequently used calibration model is a straight line where the response variable S denotes the signal of the machine and the predictor variable C denotes the true concentration in the physical sample. The error term is assumed to follow a normal distribution with mean 0. Note that the average value of the signal for a blank (C = 0) is the intercept. Other possible calibration models include higher order polynomial models such as a quadratic or cubic model.
In a typical setup, a small number of samples (e.g., n = 6) with known concentrations are measured and the signal is recorded. A sample with no chemical in it, called a blank, is also measured. (You have to be careful to define exactly what you mean by a “blank.” A blank could mean a container from the lab that has nothing in it but is prepared in a similar fashion to containers with actual samples in them. Or it could mean a field blank: the container was taken out to the field and subjected to the same process that all other containers were subjected to, except a physical sample of soil or water was not placed in the container.) Usually, replicate measures at the same known concentrations are taken. (The term “replicate” must be well defined to distinguish between for example the same physical samples that are measured more than once vs. two different physical samples of the same known concentration.)
The function calibrate
initially fits a linear calibration model. If the argument
max.order
is greater than 1, calibrate
then performs forward stepwise linear
regression to determine the “best” polynomial model.
In the case where replicates are not availble, calibrate
uses standard stepwise
ANOVA to compare models (Draper and Smith, 1998, p.335). In this case, if the p-value
for the partial F-test to compare models is greater than or equal to p.crit
, then
the model with fewer terms is used as the final model.
In the case where replicates are available, if F.test="lof"
, then for each model
calibrate
computes the p-value of the ANOVA for lack-of-fit vs. pure error
(Draper and Smith, 1998, Chapters 2; see anovaPE
). If the p-value is
greater than or equal to p.crit
, then this is the final model; otherwise the next
higher-order term is added to the polynomial and the model is re-fit. If, during the
stepwise procedure, the degrees of freedom associated with the residual sums of squares
of a model to be tested is less than or equal to the number of observations minus the
number of unique observations, calibrate
uses the partial F-test instead of the
lack-of-fit F-test.
The stepwise algorithm terminates when either the p-value is greater than or equal to
p.crit
, or the currently selected model in the algorithm is of order
max.order
. The algorithm will terminate earlier than this if the next model to be
fit includes singularities so that not all coefficients can be estimted.
An object of class
"calibrate"
that inherits from
class
"lm"
and includes a component called
x
that stores the model matrix (the values of the predictor variables for the final
calibration model).
Almost always the process of determining the concentration of a chemical in a soil,
water, or air sample involves using some kind of machine that produces a signal, and
this signal is related to the concentration of the chemical in the physical sample.
The process of relating the machine signal to the concentration of the chemical is
called calibration. Once calibration has been performed, estimated
concentrations in physical samples with unknown concentrations are computed using
inverse regression (see inversePredictCalibrate
). The uncertainty
in the process used to estimate the concentration may be quantified with decision,
detection, and quantitation limits.
Steven P. Millard ([email protected])
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, Chapter 3 and p.335.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring. Second Edition. John Wiley & Sons, Hoboken. Chapter 6, p. 111.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey. Chapter 3, p. 22.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.562-575.
calibrate.object
, anovaPE
,
inversePredictCalibrate
,
detectionLimitCalibrate
, lm
.
# The data frame EPA.97.cadmium.111.df contains calibration data for # cadmium at mass 111 (ng/L) that appeared in Gibbons et al. (1997b) # and were provided to them by the U.S. EPA. # Display a plot of these data along with the fitted calibration line # and 99% non-simultaneous prediction limits. See # Millard and Neerchal (2001, pp.566-569) for more details on this # example. Cadmium <- EPA.97.cadmium.111.df$Cadmium Spike <- EPA.97.cadmium.111.df$Spike calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) newdata <- data.frame(Spike = seq(min(Spike), max(Spike), len = 100)) pred.list <- predict(calibrate.list, newdata = newdata, se.fit = TRUE) pointwise.list <- pointwise(pred.list, coverage = 0.99, individual = TRUE) dev.new() plot(Spike, Cadmium, ylim = c(min(pointwise.list$lower), max(pointwise.list$upper)), xlab = "True Concentration (ng/L)", ylab = "Observed Concentration (ng/L)") abline(calibrate.list, lwd = 2) lines(newdata$Spike, pointwise.list$lower, lty = 8, lwd = 2) lines(newdata$Spike, pointwise.list$upper, lty = 8, lwd = 2) title(paste("Calibration Line and 99% Prediction Limits", "for US EPA Cadmium 111 Data", sep = "\n")) #---------- # Clean up #--------- rm(Cadmium, Spike, newdata, calibrate.list, pred.list, pointwise.list) graphics.off()
# The data frame EPA.97.cadmium.111.df contains calibration data for # cadmium at mass 111 (ng/L) that appeared in Gibbons et al. (1997b) # and were provided to them by the U.S. EPA. # Display a plot of these data along with the fitted calibration line # and 99% non-simultaneous prediction limits. See # Millard and Neerchal (2001, pp.566-569) for more details on this # example. Cadmium <- EPA.97.cadmium.111.df$Cadmium Spike <- EPA.97.cadmium.111.df$Spike calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) newdata <- data.frame(Spike = seq(min(Spike), max(Spike), len = 100)) pred.list <- predict(calibrate.list, newdata = newdata, se.fit = TRUE) pointwise.list <- pointwise(pred.list, coverage = 0.99, individual = TRUE) dev.new() plot(Spike, Cadmium, ylim = c(min(pointwise.list$lower), max(pointwise.list$upper)), xlab = "True Concentration (ng/L)", ylab = "Observed Concentration (ng/L)") abline(calibrate.list, lwd = 2) lines(newdata$Spike, pointwise.list$lower, lty = 8, lwd = 2) lines(newdata$Spike, pointwise.list$upper, lty = 8, lwd = 2) title(paste("Calibration Line and 99% Prediction Limits", "for US EPA Cadmium 111 Data", sep = "\n")) #---------- # Clean up #--------- rm(Cadmium, Spike, newdata, calibrate.list, pred.list, pointwise.list) graphics.off()
Objects of S3 class "calibrate"
are returned by the EnvStats
function calibrate
, which fits a calibration line or curve based
on linear regression.
Objects of class "calibrate"
are lists that inherit from
class
"lm"
and include a component called
x
that stores the model matrix (the values of the predictor variables
for the final calibration model).
See the help file for lm
.
Required Components
Besides the usual components in the list returned by the function lm
,
the following components must be included in a legitimate list of
class "calibrate"
.
x |
the model matrix from the linear model fit. |
Generic functions that have methods for objects of class
"calibrate"
include:
NONE AT PRESENT.
Since objects of class "calibrate"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
calibrate
, inversePredictCalibrate
,
detectionLimitCalibrate
.
# Create an object of class "calibrate", then print it out. # The data frame EPA.97.cadmium.111.df contains calibration data for # cadmium at mass 111 (ng/L) that appeared in Gibbons et al. (1997b) # and were provided to them by the U.S. EPA. calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) names(calibrate.list) calibrate.list #---------- # Clean up #--------- rm(calibrate.list)
# Create an object of class "calibrate", then print it out. # The data frame EPA.97.cadmium.111.df contains calibration data for # cadmium at mass 111 (ng/L) that appeared in Gibbons et al. (1997b) # and were provided to them by the U.S. EPA. calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) names(calibrate.list) calibrate.list #---------- # Clean up #--------- rm(calibrate.list)
Detailed abstract of the manuscript:
Castillo, E., and A. Hadi. (1994). Parameter and Quantile Estimation for the Generalized Extreme-Value Distribution. Environmetrics 5, 417–432.
Abstract
Castillo and Hadi (1994) introduce a new way to estimate the parameters and
quantiles of the generalized extreme value distribution (GEVD)
with parameters location=
,
scale=
, and
shape=
. The estimator is based on a two-stage procedure using
order statistics, denoted here by “TSOE”, which stands for
two-stage order-statistics estimator. Castillo and Hadi (1994) compare the TSOE
to the maximum likelihood estimator (MLE; Jenkinson, 1969; Prescott and Walden, 1983)
and probability-weighted moments estimator (PWME;
Hosking et al., 1985).
Castillo and Hadi (1994) note that for some samples the likelihood may not have
a local maximum, and also when the likelihood can be made
infinite so the MLE does not exist. They also note, as do
Hosking et al., 1985), that when
,
the moments and probability-weighed moments of the GEVD do not exist, hence
neither does the PWME. (Hosking et al., however, claim that in practice the
shape parameter usually lies between -1/2 and 1/2.) On the other hand, the
TSOE exists for all values of
.
Based on computer simulations, Castillo and Hadi (1994) found that the
performance (bias and root mean squared error) of the TSOE is comparable to the
PWME for values of in the range
.
They also found that the TSOE is superior to the PWME for large values of
. Their results, however, are based on using the PWME computed
using the approximation given in equation (14) of Hosking et al. (1985, p.253).
The true PWME is computed using equation (12) of Hosking et al. (1985, p.253).
Hosking et al. (1985) introduced the approximation as a matter of computational
convenience, and noted that it is valid in the range
.
If Castillo and Hadi (1994) had used the true PWME for values of
larger than 1/2, they probably would have gotten very different results for the
PWME. (Note: the function
egevd
with method="pwme"
uses
the exact equation (12) of Hosking et al. (1985), not the approximation (14)).
Castillo and Hadi (1994) suggest using the bootstrap or jackknife to obtain
variance estimates and confidence intervals for the distribution parameters
based on the TSOE.
More Details
Let be a vector of
observations from a generalized extreme value distribution with
parameters
location=
,
scale=
, and
shape=
with cumulative distribution function
.
Also, let
denote the ordered values of
.
First Stage
Castillo and Hadi (1994) propose as initial estimates of the distribution
parameters the solutions to the following set of simultaneous equations based
on just three observations from the total sample of size :
where , and
denotes the 'th plotting position for a sample of size
; that is, a
nonparametric estimate of the value of
at
. Typically,
plotting positions have the form:
where . In their simulation studies, Castillo and Hadi (1994)
used a=0.35, b=0.
Since is arbitrary in the above set of equations (1), denote the solutions
to these equations by:
There are thus sets of estimates.
Castillo and Hadi (1994) show that the estimate of the shape parameter, ,
is the solution to the equation:
where
Castillo and Hadi (1994) show how to easily solve equation (3) using the method of bisection.
Once the estimate of the shape parameter is obtained, the other estimates are given by:
Second Stage
Apply a robust function to the sets of estimates obtained in the
first stage. Castillo and Hadi (1994) suggest using either the median or the
least median of squares (using a column of 1's as the predictor variable;
see the help file for lmsreg in the package MASS). Using
the median, for example, the final distribution parameter estimates are
given by:
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Hosking, J.R.M. (1985). Algorithm AS 215: Maximum-Likelihood Estimation of the Parameters of the Generalized Extreme-Value Distribution. Applied Statistics 34(3), 301–310.
Jenkinson, A.F. (1969). Statistics of Extremes. Technical Note 98, World Meteorological Office, Geneva.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Prescott, P., and A.T. Walden. (1983). Maximum Likelihood Estimation of the Three-Parameter Generalized Extreme-Value Distribution from Censored Samples. Journal of Statistical Computing and Simulation 16, 241–250.
Generalized Extreme Value Distribution, egevd
,
Hosking et al., 1985).
For one sample, plots the empirical cumulative distribution function (ecdf) along with a theoretical cumulative distribution function (cdf). For two samples, plots the two ecdf's. These plots are used to graphically assess goodness of fit.
cdfCompare(x, y = NULL, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = NULL, distribution = "norm", param.list = NULL, estimate.params = is.null(param.list), est.arg.list = NULL, x.col = "blue", y.or.fitted.col = "black", x.lwd = 3 * par("cex"), y.or.fitted.lwd = 3 * par("cex"), x.lty = 1, y.or.fitted.lty = 2, digits = .Options$digits, ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
cdfCompare(x, y = NULL, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = NULL, distribution = "norm", param.list = NULL, estimate.params = is.null(param.list), est.arg.list = NULL, x.col = "blue", y.or.fitted.col = "black", x.lwd = 3 * par("cex"), y.or.fitted.lwd = 3 * par("cex"), x.lty = 1, y.or.fitted.lty = 2, digits = .Options$digits, ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
x |
numeric vector of observations. Missing ( |
y |
a numeric vector (not necessarily of the same length as |
discrete |
logical scalar indicating whether the assumed parent distribution of |
prob.method |
character string indicating what method to use to compute the plotting positions
(empirical probabilities). Possible values are
|
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position constant.
When |
distribution |
when |
param.list |
when |
estimate.params |
when |
est.arg.list |
when |
x.col |
a numeric scalar or character string determining the color of the empirical cdf
(based on |
y.or.fitted.col |
a numeric scalar or character string determining the color of the empirical cdf
(based on |
x.lwd |
a numeric scalar determining the width of the empirical cdf (based on |
y.or.fitted.lwd |
a numeric scalar determining the width of the empirical cdf (based on |
x.lty |
a numeric scalar determining the line type of the empirical cdf
(based on |
y.or.fitted.lty |
a numeric scalar determining the line type of the empirical cdf
(based on |
digits |
when |
type , main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters (see |
When both x
and y
are supplied, the function cdfCompare
creates the empirical cdf plot of x
and y
on
the same plot by calling the function ecdfPlot
.
When y
is not supplied, the function cdfCompare
creates the
emprical cdf plot of x
(by calling ecdfPlot
) and the
theoretical cdf plot (by calling cdfPlot
and using the
argument distribution
) on the same plot.
When y
is supplied, cdfCompare
invisibly returns a list with
components:
x.ecdf.list |
a list with components |
y.ecdf.list |
a list with components |
When y
is not supplied, cdfCompare
invisibly returns a list with
components:
x.ecdf.list |
a list with components |
fitted.cdf.list |
a list with components |
An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data. It is easy to determine quartiles and the minimum and maximum values from such a plot. Also, ecdf plots allow you to assess local density: a higher density of observations occurs where the slope is steep.
Chambers et al. (1983, pp.11-16) plot the observed order statistics on the
-axis vs. the ecdf on the
-axis and call this a quantile plot.
Empirical cumulative distribution function (ecdf) plots are often plotted with
theoretical cdf plots (see cdfPlot
and cdfCompare
) to
graphically assess whether a sample of observations comes from a particular
distribution. The Kolmogorov-Smirnov goodness-of-fit test
(see gofTest
) is the statistical companion of this kind of
comparison; it is based on the maximum vertical distance between the empirical
cdf plot and the theoretical cdf plot. More often, however,
quantile-quantile (Q-Q) plots are used instead of ecdf plots to graphically assess
departures from an assumed distribution (see qqPlot
).
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.
# Generate 20 observations from a normal (Gaussian) distribution # with mean=10 and sd=2 and compare the empirical cdf with a # theoretical normal cdf that is based on estimating the parameters. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rnorm(20, mean = 10, sd = 2) dev.new() cdfCompare(x) #---------- # Generate 30 observations from an exponential distribution with parameter # rate=0.1 (see the R help file for Exponential) and compare the empirical # cdf with the empirical cdf of the normal observations generated in the # previous example: set.seed(432) y <- rexp(30, rate = 0.1) dev.new() cdfCompare(x, y) #========== # Generate 20 observations from a Poisson distribution with parameter lambda=10 # (see the R help file for Poisson) and compare the empirical cdf with a # theoretical Poisson cdf based on estimating the distribution parameters. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rpois(20, lambda = 10) dev.new() cdfCompare(x, dist = "pois") #========== # Clean up #--------- rm(x, y) graphics.off()
# Generate 20 observations from a normal (Gaussian) distribution # with mean=10 and sd=2 and compare the empirical cdf with a # theoretical normal cdf that is based on estimating the parameters. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rnorm(20, mean = 10, sd = 2) dev.new() cdfCompare(x) #---------- # Generate 30 observations from an exponential distribution with parameter # rate=0.1 (see the R help file for Exponential) and compare the empirical # cdf with the empirical cdf of the normal observations generated in the # previous example: set.seed(432) y <- rexp(30, rate = 0.1) dev.new() cdfCompare(x, y) #========== # Generate 20 observations from a Poisson distribution with parameter lambda=10 # (see the R help file for Poisson) and compare the empirical cdf with a # theoretical Poisson cdf based on estimating the distribution parameters. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rpois(20, lambda = 10) dev.new() cdfCompare(x, dist = "pois") #========== # Clean up #--------- rm(x, y) graphics.off()
For one sample, plots the empirical cumulative distribution function (ecdf) along with a theoretical cumulative distribution function (cdf). For two samples, plots the two ecdf's. These plots are used to graphically assess goodness of fit.
cdfCompareCensored(x, censored, censoring.side = "left", y = NULL, y.censored = NULL, y.censoring.side = censoring.side, discrete = FALSE, prob.method = "michael-schucany", plot.pos.con = NULL, distribution = "norm", param.list = NULL, estimate.params = is.null(param.list), est.arg.list = NULL, x.col = "blue", y.or.fitted.col = "black", x.lwd = 3 * par("cex"), y.or.fitted.lwd = 3 * par("cex"), x.lty = 1, y.or.fitted.lty = 2, include.x.cen = FALSE, x.cen.pch = ifelse(censoring.side == "left", 6, 2), x.cen.cex = par("cex"), x.cen.col = "red", include.y.cen = FALSE, y.cen.pch = ifelse(y.censoring.side == "left", 6, 2), y.cen.cex = par("cex"), y.cen.col = "black", digits = .Options$digits, ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
cdfCompareCensored(x, censored, censoring.side = "left", y = NULL, y.censored = NULL, y.censoring.side = censoring.side, discrete = FALSE, prob.method = "michael-schucany", plot.pos.con = NULL, distribution = "norm", param.list = NULL, estimate.params = is.null(param.list), est.arg.list = NULL, x.col = "blue", y.or.fitted.col = "black", x.lwd = 3 * par("cex"), y.or.fitted.lwd = 3 * par("cex"), x.lty = 1, y.or.fitted.lty = 2, include.x.cen = FALSE, x.cen.pch = ifelse(censoring.side == "left", 6, 2), x.cen.cex = par("cex"), x.cen.col = "red", include.y.cen = FALSE, y.cen.pch = ifelse(y.censoring.side == "left", 6, 2), y.cen.cex = par("cex"), y.cen.col = "black", digits = .Options$digits, ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible values are
|
y |
a numeric vector (not necessarily of the same length as |
y.censored |
numeric or logical vector indicating which values of This argument is ignored when |
y.censoring.side |
character string indicating on which side the censoring occurs for the values of
|
discrete |
logical scalar indicating whether the assumed parent distribution of |
prob.method |
character string indicating what method to use to compute the plotting positions (empirical probabilities).
Possible values are
The |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position constant.
When |
distribution |
when |
param.list |
when |
estimate.params |
when |
est.arg.list |
when |
x.col |
a numeric scalar or character string determining the color of the empirical cdf
(based on |
y.or.fitted.col |
a numeric scalar or character string determining the color of the empirical cdf
(based on |
x.lwd |
a numeric scalar determining the width of the empirical cdf (based on |
y.or.fitted.lwd |
a numeric scalar determining the width of the empirical cdf (based on |
x.lty |
a numeric scalar determining the line type of the empirical cdf
(based on |
y.or.fitted.lty |
a numeric scalar determining the line type of the empirical cdf
(based on |
include.x.cen |
logical scalar indicating whether to include censored values in |
x.cen.pch |
numeric scalar or character string indicating the plotting character to use to plot
censored values in |
x.cen.cex |
numeric scalar that determines the size of the plotting character used to plot
censored values in |
x.cen.col |
numeric scalar or character string that determines the color of the plotting
character used to plot censored values in |
include.y.cen |
logical scalar indicating whether to include censored values in |
y.cen.pch |
numeric scalar or character string indicating the plotting character to use to plot
censored values in |
y.cen.cex |
numeric scalar that determines the size of the plotting character used to plot
censored values in |
y.cen.col |
numeric scalar or character string that determines the color of the plotting
character used to plot censored values in |
digits |
when |
type , main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters (see |
When both x
and y
are supplied, the function cdfCompareCensored
creates the empirical cdf plot of x
and y
on
the same plot by calling the function ecdfPlotCensored
.
When y
is not supplied, the function cdfCompareCensored
creates the
emprical cdf plot of x
(by calling ecdfPlotCensored
) and the
theoretical cdf plot (by calling cdfPlot
and using the
argument distribution
) on the same plot.
When y
is supplied, cdfCompareCensored
invisibly returns a list with
components:
x.ecdf.list |
a list with components |
y.ecdf.list |
a list with components |
When y
is not supplied, cdfCompareCensored
invisibly returns a list with
components:
x.ecdf.list |
a list with components |
fitted.cdf.list |
a list with components |
An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data. It is easy to determine quartiles and the minimum and maximum values from such a plot. Also, ecdf plots allow you to assess local density: a higher density of observations occurs where the slope is steep.
Chambers et al. (1983, pp.11-16) plot the observed order statistics on the
-axis vs. the ecdf on the
-axis and call this a quantile plot.
Censored observations complicate the procedures used to graphically explore data.
Techniques from survival analysis and life testing have been developed to generalize
the procedures for constructing plotting positions, empirical cdf plots, and
q-q plots to data sets with censored observations
(see ppointsCensored
).
Empirical cumulative distribution function (ecdf) plots are often plotted with
theoretical cdf plots to graphically assess whether a sample of observations
comes from a particular distribution. More often, however, quantile-quantile
(Q-Q) plots are used instead of ecdf plots to graphically assess departures from
an assumed distribution (see qqPlotCensored
).
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.
Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997-2004.
Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715-727.
Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.
Lee, E.T., and J.W. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley & Sons, Hoboken, New Jersey, 513pp.
Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461-496.
Nelson, W. (1972). Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14, 945-966.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
cdfPlot
, ecdfPlotCensored
, qqPlotCensored
.
# Generate 20 observations from a normal distribution with mean=20 and sd=5, # censor all observations less than 18, then compare the empirical cdf with a # theoretical normal cdf that is based on estimating the parameters. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(333) x <- sort(rnorm(20, mean=20, sd=5)) x # [1] 9.743551 12.370197 14.375499 15.628482 15.883507 17.080124 # [7] 17.197588 18.097714 18.654182 19.585942 20.219308 20.268505 #[13] 20.552964 21.388695 21.763587 21.823639 23.168039 26.165269 #[19] 26.843362 29.673405 censored <- x < 18 x[censored] <- 18 sum(censored) #[1] 7 dev.new() cdfCompareCensored(x, censored) # Clean up #--------- rm(x, censored) #========== # Example 15-1 of USEPA (2009, page 15-10) gives an example of # computing plotting positions based on censored manganese # concentrations (ppb) in groundwater collected at 5 monitoring # wells. The data for this example are stored in # EPA.09.Ex.15.1.manganese.df. Here we will compare the empirical # cdf based on Kaplan-Meier plotting positions or Michael-Schucany # plotting positions with various assumed distributions # (based on estimating the parameters of these distributions): # 1) normal distribution # 2) lognormal distribution # 3) gamma distribution # First look at the data: #------------------------ EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #4 4 Well.1 21.6 21.6 FALSE #5 5 Well.1 <2 2.0 TRUE #... #21 1 Well.5 17.9 17.9 FALSE #22 2 Well.5 22.7 22.7 FALSE #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Assume a normal distribution #----------------------------- # Michael-Schucany plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored)) # Kaplan-Meier plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, prob.method = "kaplan-meier")) # Assume a lognormal distribution #-------------------------------- # Michael-Schucany plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, dist = "lnorm")) # Kaplan-Meier plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, dist = "lnorm", prob.method = "kaplan-meier")) # Assume a gamma distribution #---------------------------- # Michael-Schucany plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, dist = "gamma")) # Kaplan-Meier plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, dist = "gamma", prob.method = "kaplan-meier")) # Clean up #--------- graphics.off() #========== # Compare the distributions of copper and zinc between the Alluvial Fan Zone # and the Basin-Trough Zone using the data of Millard and Deverel (1988). # The data are stored in Millard.Deverel.88.df. Millard.Deverel.88.df # Cu.orig Cu Cu.censored Zn.orig Zn Zn.censored Zone Location #1 < 1 1 TRUE <10 10 TRUE Alluvial.Fan 1 #2 < 1 1 TRUE 9 9 FALSE Alluvial.Fan 2 #3 3 3 FALSE NA NA FALSE Alluvial.Fan 3 #. #. #. #116 5 5 FALSE 50 50 FALSE Basin.Trough 48 #117 14 14 FALSE 90 90 FALSE Basin.Trough 49 #118 4 4 FALSE 20 20 FALSE Basin.Trough 50 Cu.AF <- with(Millard.Deverel.88.df, Cu[Zone == "Alluvial.Fan"]) Cu.AF.cen <- with(Millard.Deverel.88.df, Cu.censored[Zone == "Alluvial.Fan"]) Cu.BT <- with(Millard.Deverel.88.df, Cu[Zone == "Basin.Trough"]) Cu.BT.cen <- with(Millard.Deverel.88.df, Cu.censored[Zone == "Basin.Trough"]) Zn.AF <- with(Millard.Deverel.88.df, Zn[Zone == "Alluvial.Fan"]) Zn.AF.cen <- with(Millard.Deverel.88.df, Zn.censored[Zone == "Alluvial.Fan"]) Zn.BT <- with(Millard.Deverel.88.df, Zn[Zone == "Basin.Trough"]) Zn.BT.cen <- with(Millard.Deverel.88.df, Zn.censored[Zone == "Basin.Trough"]) # First compare the copper concentrations #---------------------------------------- dev.new() cdfCompareCensored(x = Cu.AF, censored = Cu.AF.cen, y = Cu.BT, y.censored = Cu.BT.cen) # Now compare the zinc concentrations #------------------------------------ dev.new() cdfCompareCensored(x = Zn.AF, censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen) # Compare the Zinc concentrations again, but delete # the one "outlier". #-------------------------------------------------- summaryStats(Zn.AF) # N Mean SD Median Min Max NA's N.Total #Zn.AF 67 23.5075 74.4192 10 3 620 1 68 summaryStats(Zn.BT) # N Mean SD Median Min Max #Zn.BT 50 21.94 18.7044 18.5 3 90 which(Zn.AF == 620) #[1] 38 summaryStats(Zn.AF[-38]) # N Mean SD Median Min Max NA's N.Total #Zn.AF[-38] 66 14.4697 8.1604 10 3 50 1 67 dev.new() cdfCompareCensored(x = Zn.AF[-38], censored = Zn.AF.cen[-38], y = Zn.BT, y.censored = Zn.BT.cen) #---------- # Clean up #--------- rm(Cu.AF, Cu.AF.cen, Cu.BT, Cu.BT.cen, Zn.AF, Zn.AF.cen, Zn.BT, Zn.BT.cen) graphics.off()
# Generate 20 observations from a normal distribution with mean=20 and sd=5, # censor all observations less than 18, then compare the empirical cdf with a # theoretical normal cdf that is based on estimating the parameters. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(333) x <- sort(rnorm(20, mean=20, sd=5)) x # [1] 9.743551 12.370197 14.375499 15.628482 15.883507 17.080124 # [7] 17.197588 18.097714 18.654182 19.585942 20.219308 20.268505 #[13] 20.552964 21.388695 21.763587 21.823639 23.168039 26.165269 #[19] 26.843362 29.673405 censored <- x < 18 x[censored] <- 18 sum(censored) #[1] 7 dev.new() cdfCompareCensored(x, censored) # Clean up #--------- rm(x, censored) #========== # Example 15-1 of USEPA (2009, page 15-10) gives an example of # computing plotting positions based on censored manganese # concentrations (ppb) in groundwater collected at 5 monitoring # wells. The data for this example are stored in # EPA.09.Ex.15.1.manganese.df. Here we will compare the empirical # cdf based on Kaplan-Meier plotting positions or Michael-Schucany # plotting positions with various assumed distributions # (based on estimating the parameters of these distributions): # 1) normal distribution # 2) lognormal distribution # 3) gamma distribution # First look at the data: #------------------------ EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #4 4 Well.1 21.6 21.6 FALSE #5 5 Well.1 <2 2.0 TRUE #... #21 1 Well.5 17.9 17.9 FALSE #22 2 Well.5 22.7 22.7 FALSE #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Assume a normal distribution #----------------------------- # Michael-Schucany plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored)) # Kaplan-Meier plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, prob.method = "kaplan-meier")) # Assume a lognormal distribution #-------------------------------- # Michael-Schucany plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, dist = "lnorm")) # Kaplan-Meier plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, dist = "lnorm", prob.method = "kaplan-meier")) # Assume a gamma distribution #---------------------------- # Michael-Schucany plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, dist = "gamma")) # Kaplan-Meier plotting positions: dev.new() with(EPA.09.Ex.15.1.manganese.df, cdfCompareCensored(Manganese.ppb, Censored, dist = "gamma", prob.method = "kaplan-meier")) # Clean up #--------- graphics.off() #========== # Compare the distributions of copper and zinc between the Alluvial Fan Zone # and the Basin-Trough Zone using the data of Millard and Deverel (1988). # The data are stored in Millard.Deverel.88.df. Millard.Deverel.88.df # Cu.orig Cu Cu.censored Zn.orig Zn Zn.censored Zone Location #1 < 1 1 TRUE <10 10 TRUE Alluvial.Fan 1 #2 < 1 1 TRUE 9 9 FALSE Alluvial.Fan 2 #3 3 3 FALSE NA NA FALSE Alluvial.Fan 3 #. #. #. #116 5 5 FALSE 50 50 FALSE Basin.Trough 48 #117 14 14 FALSE 90 90 FALSE Basin.Trough 49 #118 4 4 FALSE 20 20 FALSE Basin.Trough 50 Cu.AF <- with(Millard.Deverel.88.df, Cu[Zone == "Alluvial.Fan"]) Cu.AF.cen <- with(Millard.Deverel.88.df, Cu.censored[Zone == "Alluvial.Fan"]) Cu.BT <- with(Millard.Deverel.88.df, Cu[Zone == "Basin.Trough"]) Cu.BT.cen <- with(Millard.Deverel.88.df, Cu.censored[Zone == "Basin.Trough"]) Zn.AF <- with(Millard.Deverel.88.df, Zn[Zone == "Alluvial.Fan"]) Zn.AF.cen <- with(Millard.Deverel.88.df, Zn.censored[Zone == "Alluvial.Fan"]) Zn.BT <- with(Millard.Deverel.88.df, Zn[Zone == "Basin.Trough"]) Zn.BT.cen <- with(Millard.Deverel.88.df, Zn.censored[Zone == "Basin.Trough"]) # First compare the copper concentrations #---------------------------------------- dev.new() cdfCompareCensored(x = Cu.AF, censored = Cu.AF.cen, y = Cu.BT, y.censored = Cu.BT.cen) # Now compare the zinc concentrations #------------------------------------ dev.new() cdfCompareCensored(x = Zn.AF, censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen) # Compare the Zinc concentrations again, but delete # the one "outlier". #-------------------------------------------------- summaryStats(Zn.AF) # N Mean SD Median Min Max NA's N.Total #Zn.AF 67 23.5075 74.4192 10 3 620 1 68 summaryStats(Zn.BT) # N Mean SD Median Min Max #Zn.BT 50 21.94 18.7044 18.5 3 90 which(Zn.AF == 620) #[1] 38 summaryStats(Zn.AF[-38]) # N Mean SD Median Min Max NA's N.Total #Zn.AF[-38] 66 14.4697 8.1604 10 3 50 1 67 dev.new() cdfCompareCensored(x = Zn.AF[-38], censored = Zn.AF.cen[-38], y = Zn.BT, y.censored = Zn.BT.cen) #---------- # Clean up #--------- rm(Cu.AF, Cu.AF.cen, Cu.BT, Cu.BT.cen, Zn.AF, Zn.AF.cen, Zn.BT, Zn.BT.cen) graphics.off()
Produce a cumulative distribution function (cdf) plot for a user-specified distribution.
cdfPlot(distribution = "norm", param.list = list(mean = 0, sd = 1), left.tail.cutoff = ifelse(is.finite(supp.min), 0, 0.001), right.tail.cutoff = ifelse(is.finite(supp.max), 0, 0.001), plot.it = TRUE, add = FALSE, n.points = 1000, cdf.col = "black", cdf.lwd = 3 * par("cex"), cdf.lty = 1, curve.fill = FALSE, curve.fill.col = "cyan", digits = .Options$digits, ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
cdfPlot(distribution = "norm", param.list = list(mean = 0, sd = 1), left.tail.cutoff = ifelse(is.finite(supp.min), 0, 0.001), right.tail.cutoff = ifelse(is.finite(supp.max), 0, 0.001), plot.it = TRUE, add = FALSE, n.points = 1000, cdf.col = "black", cdf.lwd = 3 * par("cex"), cdf.lty = 1, curve.fill = FALSE, curve.fill.col = "cyan", digits = .Options$digits, ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
distribution |
a character string denoting the distribution abbreviation. The default value is
|
param.list |
a list with values for the parameters of the distribution. The default value is
|
left.tail.cutoff |
a numeric scalar indicating what proportion of the left-tail of the probability
distribution to omit from the plot. For densities with a finite support minimum
(e.g., Lognormal) the default value is |
right.tail.cutoff |
a scalar indicating what proportion of the right-tail of the probability
distribution to omit from the plot. For densities with a finite support maximum
(e.g., Binomial) the default value is |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see |
add |
a logical scalar indicating whether to add the cumulative distribution function curve
to the existing plot ( |
n.points |
a numeric scalar specifying at how many evenly-spaced points the cumulative
distribution function will be evaluated. The default value is |
cdf.col |
a numeric scalar or character string determining
the color of the cdf line in the plot.
The default value is |
cdf.lwd |
a numeric scalar determining the width of the cdf
line in the plot.
The default value is |
cdf.lty |
a numeric scalar determining the line type of
the cdf line in the plot.
The default value is |
curve.fill |
a logical value indicating whether to fill in
the area below the cumulative distribution function curve with the color specified by
|
curve.fill.col |
when |
digits |
a scalar indicating how many significant digits to print for the distribution
parameters. The default value is |
type , main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters (see |
The cumulative distribution function (cdf) of a random variable ,
usually denoted
, is defined as:
That is, is the probability that
is less than or equal to
. This is the probability that the random variable
takes on a
value in the interval
and is simply the (Lebesgue) integral of
the pdf evaluated between
and
. That is,
where denotes the probability density function of
evaluated at
. For discrete distributions, Equation (2) translates to
summing up the probabilities of all values in this interval:
A cumulative distribution function (cdf) plot plots the values of the cdf against quantiles of the specified distribution. Theoretical cdf plots are sometimes plotted along with empirical cdf plots to visually assess whether data have a particular distribution.
cdfPlot
invisibly returns a list giving coordinates of the points
that have been or would have been plotted:
Quantiles |
The quantiles used for the plot. |
Cumulative.Probabilities |
The values of the cdf associated with the quantiles. |
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Distribution.df
, ecdfPlot
, cdfCompare
,
pdfPlot
.
# Plot the cdf of the standard normal distribution #------------------------------------------------- dev.new() cdfPlot() #========== # Plot the cdf of the standard normal distribution # and a N(2, 2) distribution on the sample plot. #------------------------------------------------- dev.new() cdfPlot(param.list = list(mean=2, sd=2), main = "") cdfPlot(add = TRUE, cdf.col = "red") legend("topleft", legend = c("N(2,2)", "N(0,1)"), col = c("black", "red"), lwd = 3 * par("cex")) title("CDF Plots for Two Normal Distributions") #========== # Clean up #--------- graphics.off()
# Plot the cdf of the standard normal distribution #------------------------------------------------- dev.new() cdfPlot() #========== # Plot the cdf of the standard normal distribution # and a N(2, 2) distribution on the sample plot. #------------------------------------------------- dev.new() cdfPlot(param.list = list(mean=2, sd=2), main = "") cdfPlot(add = TRUE, cdf.col = "red") legend("topleft", legend = c("N(2,2)", "N(0,1)"), col = c("black", "red"), lwd = 3 * par("cex")) title("CDF Plots for Two Normal Distributions") #========== # Clean up #--------- graphics.off()
For a skewed distribution, estimate the mean, standard deviation, and skew; test the null hypothesis that the mean is equal to a user-specified value vs. a one-sided alternative; and create a one-sided confidence interval for the mean.
chenTTest(x, y = NULL, alternative = "greater", mu = 0, paired = !is.null(y), conf.level = 0.95, ci.method = "z")
chenTTest(x, y = NULL, alternative = "greater", mu = 0, paired = !is.null(y), conf.level = 0.95, ci.method = "z")
x |
numeric vector of observations. Missing ( |
y |
optional numeric vector of observations that are paired with the observations in
|
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
mu |
numeric scalar indicating the hypothesized value of the mean. The default value is
|
paired |
character string indicating whether to perform a paired or one-sample t-test. The
possible values are |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated with the
confidence interval for the population mean. The default value is |
ci.method |
character string indicating which critical value to use to construct the confidence
interval for the mean. The possible values are |
One-Sample Case (paired=FALSE
)
Let be a vector of
independent
and identically distributed (i.i.d.) observations from some distribution with mean
and standard deviation
.
Background: The Conventional Student's t-Test
Assume that the observations come from a normal (Gaussian) distribution, and
consider the test of the null hypothesis:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
):
the lower one-sided alternative (alternative="less"
):
and the two-sided alternative:
The test of the null hypothesis (1) versus any of the three alternatives (2)-(4) is usually based on the Student t-statistic:
where
(see the R help file for t.test
). Under the null hypothesis (1),
the t-statistic in (5) follows a Student's t-distribution with
degrees of freedom (Zar, 2010, p.99; Johnson et al., 1995, pp.362-363).
The t-statistic is fairly robust to departures from normality in terms of
maintaining Type I error and power, provided that the sample size is sufficiently
large.
Chen's Modified t-Test for Skewed Distributions
In the case when the underlying distribution of the observations is
positively skewed and the sample size is small, the sampling distribution of the
t-statistic under the null hypothesis (1) does not follow a Student's t-distribution,
but is instead negatively skewed. For the test against the upper alternative in (2)
above, this leads to a Type I error smaller than the one assumed and a loss of power
(Chen, 1995b, p.767).
Similarly, in the case when the underlying distribution of the observations
is negatively skewed and the sample size is small, the sampling distribution of the
t-statistic is positively skewed. For the test against the lower alternative in (3)
above, this also leads to a Type I error smaller than the one assumed and a loss of
power.
In order to overcome these problems, Chen (1995b) proposed the following modified t-statistic that takes into account the skew of the underlying distribution:
where
Note that the quantity in (9) is an estimate of
the skew of the underlying distribution and is based on unbiased estimators of
central moments (see the help file for
skewness
).
For a positively-skewed distribution, Chen's modified t-test rejects the null hypothesis (1) in favor of the upper one-sided alternative (2) if the t-statistic in (8) is too large. For a negatively-skewed distribution, Chen's modified t-test rejects the null hypothesis (1) in favor of the lower one-sided alternative (3) if the t-statistic in (8) is too small.
Chen's modified t-test is not applicable to testing the two-sided alternative
(4). It should also not be used to test the upper one-sided alternative (2)
based on negatively-skewed data, nor should it be used to test the lower one-sided
alternative (3) based on positively-skewed data.
Determination of Critical Values and p-Values
Chen (1995b) performed a simulation study in which the modified t-statistic in (8)
was compared to a critical value based on the normal distribution (z-value),
a critical value based on Student's t-distribution (t-value), and the average of the
critical z-value and t-value. Based on the simulation study, Chen (1995b) suggests
using either the z-value or average of the z-value and t-value when
(the sample size) is small (e.g.,
) or
(the Type I error)
is small (e.g.
), and using either the t-value or the average
of the z-value and t-value when
or
.
The function chenTTest
returns three different p-values: one based on the
normal distribution, one based on Student's t-distribution, and one based on the
average of these two p-values. This last p-value should roughly correspond to a
p-value based on the distribution of the average of a normal and Student's t
random variable.
Computing Confidence Intervals
The function chenTTest
computes a one-sided confidence interval for the true
mean based on finding all possible values of
for which the null
hypothesis (1) will not be rejected, with the confidence level determined by the
argument
conf.level
. The argument ci.method
determines which p-value
is used in the algorithm to determine the bounds on . When
ci.method="z"
, the p-value is based on the normal distribution, when
ci.method="t"
, the p-value is based on Student's t-distribution, and when
ci.method="Avg. of z and t"
the p-value is based on the average of the
p-values based on the normal and Student's t-distribution.
Paired-Sample Case (paired=TRUE
)
When the argument paired=TRUE
, the arguments x
and y
are assumed
to have the same length, and the differences
are assumed to be i.i.d. observations from some distribution with mean
and standard deviation
. Chen's modified t-test can then be applied
to the differences.
a list of class "htest"
containing the results of the hypothesis test. See
the help file for htest.object
for details.
The presentation of Chen's (1995b) method in USEPA (2002d) and Singh et al. (2010b, p. 52) is incorrect for two reasons: it is based on an intermediate formula instead of the actual statistic that Chen proposes, and it uses the intermediate formula to compute an upper confidence limit for the mean when the sample data are positively skewed. As explained above, for the case of positively skewed data, Chen's method is appropriate to test the upper one-sided alternative hypothesis that the population mean is greater than some specified value, and a one-sided upper alternative corresponds to creating a one-sided lower confidence limit, not an upper confidence limit (see, for example, Millard and Neerchal, 2001, p. 371).
A frequent question in environmental statistics is “Is the concentration of chemical X greater than Y units?” For example, in groundwater assessment (compliance) monitoring at hazardous and solid waste sites, the concentration of a chemical in the groundwater at a downgradient may be compared to a groundwater protection standard (GWPS). If the concentration is “above” the GWPS, then the site enters corrective action monitoring. As another example, soil screening at a Superfund site involves comparing the concentration of a chemical in the soil with a pre-determined soil screening level (SSL). If the concentration is “above” the SSL, then further investigation and possible remedial action is required. Determining what it means for the chemical concentration to be “above” a GWPS or an SSL is a policy decision: the average of the distribution of the chemical concentration must be above the GWPS or SSL, or the median must be above the GWPS or SSL, or the 95'th percentile must be above the GWPS or SSL, or something else. Often, the first interpretation is used.
The regulatory guidance document Soil Screening Guidance: Technical Background Document (USEPA, 1996c, Part 4) recommends using Chen's t-test as one possible method to compare chemical concentrations in soil samples to a soil screening level (SSL). The document notes that the distribution of chemical concentrations will almost always be positively-skewed, but not necessarily fit a lognormal distribution well (USEPA, 1996c, pp.107, 117-119). It also notes that using a confidence interval based on Land's (1971) method is extremely sensitive to the assumption of a lognormal distribution, while Chen's test is robust with respect to maintaining Type I and Type II errors for a variety of positively-skewed distributions (USEPA, 1996c, pp.99, 117-119, 123-125).
Hypothesis tests you can use to perform tests of location include: Student's t-test, Fisher's randomization test, the Wilcoxon signed rank test, Chen's modified t-test, the sign test, and a test based on a bootstrap confidence interval. For a discussion comparing the performance of these tests, see Millard and Neerchal (2001, pp.408–409).
Steven P. Millard ([email protected])
Chen, L. (1995b). Testing the Mean of Skewed Distributions. Journal of the American Statistical Association 90(430), 767–772.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapters 28, 31.
Land, C.E. (1971). Confidence Intervals for Linear Functions of the Normal Mean and Variance. The Annals of Mathematical Statistics 42(4), 1187–1205.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.402–404.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (1996c). Soil Screening Guidance: Technical Background Document. EPA/540/R-95/128, PB96963502. Office of Emergency and Remedial Response, U.S. Environmental Protection Agency, Washington, D.C., May, 1996.
USEPA. (2002d). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
# The guidance document "Calculating Upper Confidence Limits for # Exposure Point Concentrations at Hazardous Waste Sites" # (USEPA, 2002d, Exhibit 9, p. 16) contains an example of 60 observations # from an exposure unit. Here we will use Chen's modified t-test to test # the null hypothesis that the average concentration is less than 30 mg/L # versus the alternative that it is greater than 30 mg/L. # In EnvStats these data are stored in the vector EPA.02d.Ex.9.mg.per.L.vec. sort(EPA.02d.Ex.9.mg.per.L.vec) # [1] 16 17 17 17 18 18 20 20 20 21 21 21 21 21 21 22 #[17] 22 22 23 23 23 23 24 24 24 25 25 25 25 25 25 26 #[33] 26 26 26 27 27 28 28 28 28 29 29 30 30 31 32 32 #[49] 32 33 33 35 35 97 98 105 107 111 117 119 dev.new() hist(EPA.02d.Ex.9.mg.per.L.vec, col = "cyan", xlab = "Concentration (mg/L)") # The Shapiro-Wilk goodness-of-fit test rejects the null hypothesis of a # normal, lognormal, and gamma distribution: gofTest(EPA.02d.Ex.9.mg.per.L.vec)$p.value #[1] 2.496781e-12 gofTest(EPA.02d.Ex.9.mg.per.L.vec, dist = "lnorm")$p.value #[1] 3.349035e-09 gofTest(EPA.02d.Ex.9.mg.per.L.vec, dist = "gamma")$p.value #[1] 1.564341e-10 # Use Chen's modified t-test to test the null hypothesis that # the average concentration is less than 30 mg/L versus the # alternative that it is greater than 30 mg/L. chenTTest(EPA.02d.Ex.9.mg.per.L.vec, mu = 30) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mean = 30 # #Alternative Hypothesis: True mean is greater than 30 # #Test Name: One-sample t-Test # Modified for # Positively-Skewed Distributions # (Chen, 1995) # #Estimated Parameter(s): mean = 34.566667 # sd = 27.330598 # skew = 2.365778 # #Data: EPA.02d.Ex.9.mg.per.L.vec # #Sample Size: 60 # #Test Statistic: t = 1.574075 # #Test Statistic Parameter: df = 59 # #P-values: z = 0.05773508 # t = 0.06040889 # Avg. of z and t = 0.05907199 # #Confidence Interval for: mean # #Confidence Interval Method: Based on z # #Confidence Interval Type: Lower # #Confidence Level: 95% # #Confidence Interval: LCL = 29.82 # UCL = Inf # The estimated mean, standard deviation, and skew are 35, 27, and 2.4, # respectively. The p-value is 0.06, and the lower 95% confidence interval # is [29.8, Inf). Depending on what you use for your Type I error rate, you # may or may not want to reject the null hypothesis.
# The guidance document "Calculating Upper Confidence Limits for # Exposure Point Concentrations at Hazardous Waste Sites" # (USEPA, 2002d, Exhibit 9, p. 16) contains an example of 60 observations # from an exposure unit. Here we will use Chen's modified t-test to test # the null hypothesis that the average concentration is less than 30 mg/L # versus the alternative that it is greater than 30 mg/L. # In EnvStats these data are stored in the vector EPA.02d.Ex.9.mg.per.L.vec. sort(EPA.02d.Ex.9.mg.per.L.vec) # [1] 16 17 17 17 18 18 20 20 20 21 21 21 21 21 21 22 #[17] 22 22 23 23 23 23 24 24 24 25 25 25 25 25 25 26 #[33] 26 26 26 27 27 28 28 28 28 29 29 30 30 31 32 32 #[49] 32 33 33 35 35 97 98 105 107 111 117 119 dev.new() hist(EPA.02d.Ex.9.mg.per.L.vec, col = "cyan", xlab = "Concentration (mg/L)") # The Shapiro-Wilk goodness-of-fit test rejects the null hypothesis of a # normal, lognormal, and gamma distribution: gofTest(EPA.02d.Ex.9.mg.per.L.vec)$p.value #[1] 2.496781e-12 gofTest(EPA.02d.Ex.9.mg.per.L.vec, dist = "lnorm")$p.value #[1] 3.349035e-09 gofTest(EPA.02d.Ex.9.mg.per.L.vec, dist = "gamma")$p.value #[1] 1.564341e-10 # Use Chen's modified t-test to test the null hypothesis that # the average concentration is less than 30 mg/L versus the # alternative that it is greater than 30 mg/L. chenTTest(EPA.02d.Ex.9.mg.per.L.vec, mu = 30) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mean = 30 # #Alternative Hypothesis: True mean is greater than 30 # #Test Name: One-sample t-Test # Modified for # Positively-Skewed Distributions # (Chen, 1995) # #Estimated Parameter(s): mean = 34.566667 # sd = 27.330598 # skew = 2.365778 # #Data: EPA.02d.Ex.9.mg.per.L.vec # #Sample Size: 60 # #Test Statistic: t = 1.574075 # #Test Statistic Parameter: df = 59 # #P-values: z = 0.05773508 # t = 0.06040889 # Avg. of z and t = 0.05907199 # #Confidence Interval for: mean # #Confidence Interval Method: Based on z # #Confidence Interval Type: Lower # #Confidence Level: 95% # #Confidence Interval: LCL = 29.82 # UCL = Inf # The estimated mean, standard deviation, and skew are 35, 27, and 2.4, # respectively. The p-value is 0.06, and the lower 95% confidence interval # is [29.8, Inf). Depending on what you use for your Type I error rate, you # may or may not want to reject the null hypothesis.
Density, distribution function, quantile function, and random generation for the chi distribution.
dchi(x, df) pchi(q, df) qchi(p, df) rchi(n, df)
dchi(x, df) pchi(q, df) qchi(p, df) rchi(n, df)
x |
vector of (positive) quantiles. |
q |
vector of (positive) quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
df |
vector of (positive) degrees of freedom (> 0). Non-integer values are allowed. |
Elements of x
, q
, p
, or df
that are missing will
cause the corresponding elements of the result to be missing.
The chi distribution with degrees of freedom is the distribution of the
positive square root of a random variable having a
chi-squared distribution with
degrees of freedom.
The chi density function is given by:
where denotes the density function of a chi-square random variable
with
degrees of freedom.
density (dchi
), probability (pchi
), quantile (qchi
), or
random sample (rchi
) for the chi distribution with df
degrees of freedom.
The chi distribution takes on positive real values. It is important because
for a sample of observations from a normal distribution,
the sample standard deviation multiplied by the square root of the degrees of
freedom
and divided by the true standard deviation follows a chi
distribution with
degrees of freedom. The chi distribution is also
used in computing exact prediction intervals for the next
observations
from a normal distribution (see
predIntNorm
).
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Chisquare, Normal, predIntNorm
,
Probability Distributions and Random Numbers.
# Density of a chi distribution with 4 degrees of freedom, evaluated at 3: dchi(3, 4) #[1] 0.1499715 #---------- # The 95'th percentile of a chi distribution with 10 degrees of freedom: qchi(.95, 10) #[1] 4.278672 #---------- # The cumulative distribution function of a chi distribution with # 5 degrees of freedom evaluated at 3: pchi(3, 5) #[1] 0.8909358 #---------- # A random sample of 2 numbers from a chi distribution with 7 degrees of freedom. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rchi(2, 7) #[1] 3.271632 2.035179
# Density of a chi distribution with 4 degrees of freedom, evaluated at 3: dchi(3, 4) #[1] 0.1499715 #---------- # The 95'th percentile of a chi distribution with 10 degrees of freedom: qchi(.95, 10) #[1] 4.278672 #---------- # The cumulative distribution function of a chi distribution with # 5 degrees of freedom evaluated at 3: pchi(3, 5) #[1] 0.8909358 #---------- # A random sample of 2 numbers from a chi distribution with 7 degrees of freedom. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rchi(2, 7) #[1] 3.271632 2.035179
Compute the half-width of a confidence interval for a binomial proportion or the difference between two proportions, given the sample size(s), estimated proportion(s), and confidence level.
ciBinomHalfWidth(n.or.n1, p.hat.or.p1.hat = 0.5, n2 = n.or.n1, p2.hat = 0.4, conf.level = 0.95, sample.type = "one.sample", ci.method = "score", correct = TRUE, warn = TRUE)
ciBinomHalfWidth(n.or.n1, p.hat.or.p1.hat = 0.5, n2 = n.or.n1, p2.hat = 0.4, conf.level = 0.95, sample.type = "one.sample", ci.method = "score", correct = TRUE, warn = TRUE)
n.or.n1 |
numeric vector of sample sizes. |
p.hat.or.p1.hat |
numeric vector of estimated proportions. |
n2 |
numeric vector of sample sizes for group 2. The default value is the value of |
p2.hat |
numeric vector of estimated proportions for group 2.
This argument is ignored when |
conf.level |
numeric vector of numbers between 0 and 1 indicating the confidence level associated with
the confidence interval(s). The default value is |
sample.type |
character string indicating whether this is a one-sample or two-sample confidence interval.
When |
ci.method |
character string indicating which method to use to construct the confidence interval.
Possible values are |
correct |
logical scalar indicating whether to use the continuity correction when |
warn |
logical scalar indicating whether to issue a warning when |
If the arguments n.or.n1
, p.hat.or.p1.hat
, n2
, p2.hat
, and
conf.level
are not all the same length, they are replicated to be the same length as
the length of the longest argument.
The values of p.hat.or.p1.hat
and p2.hat
are automatically adjusted
to the closest legitimate values, given the user-supplied values of n.or.n1
and
n2
. For example, if n.or.n1=5
, legitimate values for
p.hat.or.p1.hat
are 0, 0.2, 0.4, 0.6, 0.8 and 1. In this case, if the
user supplies p.hat.or.p1.hat=0.45
, then p.hat.or.p1.hat
is reset to p.hat.or.p1.hat=0.4
, and if the user supplies p.hat.or.p1.hat=0.55
,
then p.hat.or.p1.hat
is reset to p.hat.or.p1.hat=0.6
. In cases where
the two closest legitimate values are equal distance from the user-suppled value of
p.hat.or.p1.hat
or p2.hat
, the value closest to 0.5 is chosen since
that will tend to yield the wider confidence interval.
One-Sample Case (sample.type="one.sample"
).
ci.method="score"
The confidence interval for based on the
score method was developed by Wilson (1927) and is discussed by Newcombe (1998a),
Agresti and Coull (1998), and Agresti and Caffo (2000). When
ci=TRUE
and
ci.method="score"
, the function ebinom
calls the R function
prop.test
to compute the confidence interval. This method
has been shown to provide the best performance (in terms of actual coverage matching
assumed coverage) of all the methods provided here, although unlike the exact
method, the actual coverage can fall below the assumed coverage.
ci.method="exact"
The confidence interval for based on the
exact (Clopper-Pearson) method is discussed by Newcombe (1998a), Agresti and Coull (1998),
and Zar (2010, pp.543-547). This is the method used in the R function
binom.test
. This method ensures the actual coverage is greater than
or equal to the assumed coverage.
ci.method="Wald"
The confidence interval for based on the
Wald method (with or without a correction for continuity) is the usual
“normal approximation” method and is discussed by Newcombe (1998a),
Agresti and Coull (1998), Agresti and Caffo (2000), and Zar (2010, pp.543-547).
This method is never recommended but is included for historical purposes.
ci.method="adjusted Wald"
The confidence interval for based on the
adjusted Wald method is discussed by Agresti and Coull (1998), Agresti and Caffo (2000), and
Zar (2010, pp.543-547). This is a simple modification of the Wald method and
performs surpringly well.
Two-Sample Case (sample.type="two.sample"
).
ci.method="score"
This method is presented in Newcombe (1998b) and
is based on the score method developed by Wilson (1927) for the one-sample case.
This is the method used by the R function prop.test
. In a comparison of
11 methods, Newcombe (1998b) showed this method performs remarkably well.
ci.method="Wald"
The confidence interval for the difference between two proportions based on the Wald method (with or without a correction for continuity) is the usual “normal approximation” method and is discussed by Newcombe (1998b), Agresti and Caffo (2000), and Zar (2010, pp.549-552). This method is not recommended but is included for historical purposes.
ci.method="adjusted Wald"
This method is discussed by Agresti and Caffo (2000), and Zar (2010, pp.549-552). This is a simple modification of the Wald method and performs surpringly well.
a list with information about the half-widths, sample sizes, and estimated proportions.
One-Sample Case (sample.type="one.sample"
).
When sample.type="one.sample"
, the function ciBinomHalfWidth
returns a list with these components:
half.width |
the half-width(s) of the confidence interval(s) |
n |
the sample size(s) associated with the confidence interval(s) |
p.hat |
the estimated proportion(s) |
method |
the method used to construct the confidence interval(s) |
Two-Sample Case (sample.type="two.sample"
).
When sample.type="two.sample"
, the function ciBinomHalfWidth
returns a list with these components:
half.width |
the half-width(s) of the confidence interval(s) |
n1 |
the sample size(s) for group 1 associated with the confidence interval(s) |
p1.hat |
the estimated proportion(s) for group 1 |
n2 |
the sample size(s) for group 2 associated with the confidence interval(s) |
p2.hat |
the estimated proportion(s) for group 2 |
method |
the method used to construct the confidence interval(s) |
The binomial distribution is used to model processes with binary
(Yes-No, Success-Failure, Heads-Tails, etc.) outcomes. It is assumed that the outcome of any
one trial is independent of any other trial, and that the probability of “success”, ,
is the same on each trial. A binomial discrete random variable
is the number of
“successes” in
independent trials. A special case of the binomial distribution
occurs when
, in which case
is also called a Bernoulli random variable.
In the context of environmental statistics, the binomial distribution is sometimes used to model
the proportion of times a chemical concentration exceeds a set standard in a given period of time
(e.g., Gilbert, 1987, p.143), or to compare the proportion of detects in a compliance well vs. a
background well (e.g., USEPA, 1989b, Chapter 8, p.3-7). (However, USEPA 2009, p.8-27
recommends using the Wilcoxon rank sum test (wilcox.test
) instead of
comparing proportions.)
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, confidence level, and half-width if one of the objectives of
the sampling program is to produce confidence intervals. The functions ciBinomHalfWidth
,
ciBinomN
, and plotCiBinomDesign
can be used to investigate these
relationships for the case of binomial proportions.
Steven P. Millard ([email protected])
Agresti, A., and B.A. Coull. (1998). Approximate is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126.
Agresti, A., and B. Caffo. (2000). Simple and Effective Confidence Intervals for Proportions and Differences of Proportions Result from Adding Two Successes and Two Failures. The American Statistician, 54(4), 280–288.
Berthouex, P.M., and L.C. Brown. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, FL, Chapters 2 and 15.
Cochran, W.G. (1977). Sampling Techniques. John Wiley and Sons, New York, Chapter 3.
Fisher, R.A., and F. Yates. (1963). Statistical Tables for Biological, Agricultural, and Medical Research. 6th edition. Hafner, New York, 146pp.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. Second Edition. John Wiley and Sons, New York, Chapters 1-2.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, Chapter 11.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Newcombe, R.G. (1998a). Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods. Statistics in Medicine, 17, 857–872.
Newcombe, R.G. (1998b). Interval Estimation for the Difference Between Independent Proportions: Comparison of Eleven Methods. Statistics in Medicine, 17, 873–890.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL, Chapter 4.
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.6-38.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 24.
ciBinomN
, plotCiBinomDesign
,
ebinom
, binom.test
, prop.test
.
# Look at how the half-width of a one-sample confidence interval # decreases with sample size: ciBinomHalfWidth(n.or.n1 = c(10, 50, 100, 500)) #$half.width #[1] 0.26340691 0.13355486 0.09616847 0.04365873 # #$n #[1] 10 50 100 500 # #$p.hat #[1] 0.5 0.5 0.5 0.5 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Look at how the half-width of a one-sample confidence interval # tends to decrease as the estimated value of p decreases below # 0.5 or increases above 0.5: seq(0.2, 0.8, by = 0.1) #[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ciBinomHalfWidth(n.or.n1 = 30, p.hat = seq(0.2, 0.8, by = 0.1)) #$half.width #[1] 0.1536299 0.1707256 0.1801322 0.1684587 0.1801322 0.1707256 #[7] 0.1536299 # #$n #[1] 30 30 30 30 30 30 30 # #$p.hat #[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Look at how the half-width of a one-sample confidence interval # increases with increasing confidence level: ciBinomHalfWidth(n.or.n1 = 20, conf.level = c(0.8, 0.9, 0.95, 0.99)) #$half.width #[1] 0.1377380 0.1725962 0.2007020 0.2495523 # #$n #[1] 20 20 20 20 # #$p.hat #[1] 0.5 0.5 0.5 0.5 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Compare the half-widths for a one-sample # confidence interval based on the different methods: ciBinomHalfWidth(n.or.n1 = 30, ci.method = "score")$half.width #[1] 0.1684587 ciBinomHalfWidth(n.or.n1 = 30, ci.method = "exact")$half.width #[1] 0.1870297 ciBinomHalfWidth(n.or.n1 = 30, ci.method = "adjusted Wald")$half.width #[1] 0.1684587 ciBinomHalfWidth(n.or.n1 = 30, ci.method = "Wald")$half.width #[1] 0.1955861 #---------------------------------------------------------------- # Look at how the half-width of a two-sample # confidence interval decreases with increasing # sample sizes: ciBinomHalfWidth(n.or.n1 = c(10, 50, 100, 500), sample.type = "two") #$half.width #[1] 0.53385652 0.21402654 0.14719748 0.06335658 # #$n1 #[1] 10 50 100 500 # #$p1.hat #[1] 0.5 0.5 0.5 0.5 # #$n2 #[1] 10 50 100 500 # #$p2.hat #[1] 0.4 0.4 0.4 0.4 # #$method #[1] "Score normal approximation, with continuity correction"
# Look at how the half-width of a one-sample confidence interval # decreases with sample size: ciBinomHalfWidth(n.or.n1 = c(10, 50, 100, 500)) #$half.width #[1] 0.26340691 0.13355486 0.09616847 0.04365873 # #$n #[1] 10 50 100 500 # #$p.hat #[1] 0.5 0.5 0.5 0.5 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Look at how the half-width of a one-sample confidence interval # tends to decrease as the estimated value of p decreases below # 0.5 or increases above 0.5: seq(0.2, 0.8, by = 0.1) #[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ciBinomHalfWidth(n.or.n1 = 30, p.hat = seq(0.2, 0.8, by = 0.1)) #$half.width #[1] 0.1536299 0.1707256 0.1801322 0.1684587 0.1801322 0.1707256 #[7] 0.1536299 # #$n #[1] 30 30 30 30 30 30 30 # #$p.hat #[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Look at how the half-width of a one-sample confidence interval # increases with increasing confidence level: ciBinomHalfWidth(n.or.n1 = 20, conf.level = c(0.8, 0.9, 0.95, 0.99)) #$half.width #[1] 0.1377380 0.1725962 0.2007020 0.2495523 # #$n #[1] 20 20 20 20 # #$p.hat #[1] 0.5 0.5 0.5 0.5 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Compare the half-widths for a one-sample # confidence interval based on the different methods: ciBinomHalfWidth(n.or.n1 = 30, ci.method = "score")$half.width #[1] 0.1684587 ciBinomHalfWidth(n.or.n1 = 30, ci.method = "exact")$half.width #[1] 0.1870297 ciBinomHalfWidth(n.or.n1 = 30, ci.method = "adjusted Wald")$half.width #[1] 0.1684587 ciBinomHalfWidth(n.or.n1 = 30, ci.method = "Wald")$half.width #[1] 0.1955861 #---------------------------------------------------------------- # Look at how the half-width of a two-sample # confidence interval decreases with increasing # sample sizes: ciBinomHalfWidth(n.or.n1 = c(10, 50, 100, 500), sample.type = "two") #$half.width #[1] 0.53385652 0.21402654 0.14719748 0.06335658 # #$n1 #[1] 10 50 100 500 # #$p1.hat #[1] 0.5 0.5 0.5 0.5 # #$n2 #[1] 10 50 100 500 # #$p2.hat #[1] 0.4 0.4 0.4 0.4 # #$method #[1] "Score normal approximation, with continuity correction"
Compute the sample size necessary to achieve a specified half-width of a confidence interval for a binomial proportion or the difference between two proportions, given the estimated proportion(s), and confidence level.
ciBinomN(half.width, p.hat.or.p1.hat = 0.5, p2.hat = 0.4, conf.level = 0.95, sample.type = "one.sample", ratio = 1, ci.method = "score", correct = TRUE, warn = TRUE, n.or.n1.min = 2, n.or.n1.max = 10000, tol.half.width = 5e-04, tol.p.hat = 5e-04, tol = 1e-7, maxiter = 1000)
ciBinomN(half.width, p.hat.or.p1.hat = 0.5, p2.hat = 0.4, conf.level = 0.95, sample.type = "one.sample", ratio = 1, ci.method = "score", correct = TRUE, warn = TRUE, n.or.n1.min = 2, n.or.n1.max = 10000, tol.half.width = 5e-04, tol.p.hat = 5e-04, tol = 1e-7, maxiter = 1000)
half.width |
numeric vector of (positive) half-widths.
Missing ( |
p.hat.or.p1.hat |
numeric vector of estimated proportions. |
p2.hat |
numeric vector of estimated proportions for group 2.
This argument is ignored when |
conf.level |
numeric vector of numbers between 0 and 1 indicating the confidence level associated with
the confidence interval(s). The default value is |
sample.type |
character string indicating whether this is a one-sample or two-sample confidence interval. |
ratio |
numeric vector indicating the ratio of sample size in group 2 to
sample size in group 1 ( |
ci.method |
character string indicating which method to use to construct the confidence interval. Possible values are:
The exact method is only available for the one-sample case, i.e., when |
correct |
logical scalar indicating whether to use the continuity correction when |
warn |
logical scalar indicating whether to issue a warning when |
n.or.n1.min |
integer indicating the minimum allowed value for |
n.or.n1.max |
integer indicating the maximum allowed value for |
tol.half.width |
numeric scalar indicating the tolerance to use for the half width for
the search algorithm. The sample sizes are computed so that the actual
half width is less than or equal to |
tol.p.hat |
numeric scalar indicating the tolerance to use for the estimated
proportion(s) for the search algorithm.
For the one-sample case, the sample sizes are computed so that
the absolute value of the difference between the user supplied
value of |
tol |
positive scalar indicating the tolerance to use for the search algorithm
(passed to |
maxiter |
integer indicating the maximum number of iterations to use for
the search algorithm (passed to |
If the arguments half.width
, p.hat.or.p1.hat
, p2.hat
,
conf.level
and ratio
are not all the same length, they are
replicated to be the same length as the length of the longest argument.
For the one-sample case, the arguments p.hat.or.p1.hat
, tol.p.hat
,
half.width
, and tol.half.width
must satisfy: (p.hat.or.p1.hat + tol.p.hat + half.width + tol.half.width) <= 1
,
and (p.hat.or.p1.hat - tol.p.hat - half.width - tol.half.width) >= 0
.
For the two-sample case, the arguments p.hat.or.p1.hat
, p2.hat
,
tol.p.hat
, half.width
, and tol.half.width
must satisfy: ((p.hat.or.p1.hat + tol.p.hat) - (p2.hat - tol.p.hat) + half.width + tol.half.width) <= 1
, and ((p.hat.or.p1.hat - tol.p.hat) - (p2.hat + tol.p.hat) - half.width - tol.half.width) >= -1
.
The function ciBinomN
uses the search algorithm in the
function uniroot
to call the function
ciBinomHalfWidth
to find the values of
(
sample.type="one.sample"
) or
and
(sample.type="two.sample"
) that satisfy the requirements for the half-width,
estimated proportions, and confidence level. See the Details section of the help file for
ciBinomHalfWidth
for more information.
a list with information about the sample sizes, estimated proportions, and half-widths.
One-Sample Case (sample.type="one.sample"
).
When sample.type="one.sample"
, the function ciBinomN
returns a list with these components:
n |
the sample size(s) associated with the confidence interval(s) |
p.hat |
the estimated proportion(s) |
half.width |
the half-width(s) of the confidence interval(s) |
method |
the method used to construct the confidence interval(s) |
Two-Sample Case (sample.type="two.sample"
).
When sample.type="two.sample"
, the function ciBinomN
returns a list with these components:
n1 |
the sample size(s) for group 1 associated with the confidence interval(s) |
n2 |
the sample size(s) for group 2 associated with the confidence interval(s) |
p1.hat |
the estimated proportion(s) for group 1 |
p2.hat |
the estimated proportion(s) for group 2 |
half.width |
the half-width(s) of the confidence interval(s) |
method |
the method used to construct the confidence interval(s) |
The binomial distribution is used to model processes with binary
(Yes-No, Success-Failure, Heads-Tails, etc.) outcomes. It is assumed that the outcome of any
one trial is independent of any other trial, and that the probability of “success”, ,
is the same on each trial. A binomial discrete random variable
is the number of
“successes” in
independent trials. A special case of the binomial distribution
occurs when
, in which case
is also called a Bernoulli random variable.
In the context of environmental statistics, the binomial distribution is sometimes used to model
the proportion of times a chemical concentration exceeds a set standard in a given period of time
(e.g., Gilbert, 1987, p.143), or to compare the proportion of detects in a compliance well vs. a
background well (e.g., USEPA, 1989b, Chapter 8, p.3-7). (However, USEPA 2009, p.8-27
recommends using the Wilcoxon rank sum test (wilcox.test
) instead of
comparing proportions.)
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, confidence level, and half-width if one of the objectives of
the sampling program is to produce confidence intervals. The functions ciBinomHalfWidth
,
ciBinomN
, and plotCiBinomDesign
can be used to investigate these
relationships for the case of binomial proportions.
Steven P. Millard ([email protected])
Agresti, A., and B.A. Coull. (1998). Approximate is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126.
Agresti, A., and B. Caffo. (2000). Simple and Effective Confidence Intervals for Proportions and Differences of Proportions Result from Adding Two Successes and Two Failures. The American Statistician, 54(4), 280–288.
Berthouex, P.M., and L.C. Brown. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, FL, Chapters 2 and 15.
Cochran, W.G. (1977). Sampling Techniques. John Wiley and Sons, New York, Chapter 3.
Fisher, R.A., and F. Yates. (1963). Statistical Tables for Biological, Agricultural, and Medical Research. 6th edition. Hafner, New York, 146pp.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. Second Edition. John Wiley and Sons, New York, Chapters 1-2.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, Chapter 11.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Newcombe, R.G. (1998a). Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods. Statistics in Medicine, 17, 857–872.
Newcombe, R.G. (1998b). Interval Estimation for the Difference Between Independent Proportions: Comparison of Eleven Methods. Statistics in Medicine, 17, 873–890.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL, Chapter 4.
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.6-38.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 24.
ciBinomHalfWidth
, uniroot
,
plotCiBinomDesign
, ebinom
, binom.test
, prop.test
.
# Look at how the required sample size of a one-sample # confidence interval increases with decreasing # required half-width: ciBinomN(half.width = c(0.1, 0.05, 0.03)) #$n #[1] 92 374 1030 # #$p.hat #[1] 0.5 0.5 0.5 # #$half.width #[1] 0.10010168 0.05041541 0.03047833 # #$method #[1] "Score normal approximation, with continuity correction" #---------- # Note that the required sample size decreases if we are less # stringent about how much the confidence interval width can # deviate from the supplied value of the 'half.width' argument: ciBinomN(half.width = c(0.1, 0.05, 0.03), tol.half.width = 0.005) #$n #[1] 84 314 782 # #$p.hat #[1] 0.5 0.5 0.5 # #$half.width #[1] 0.10456066 0.05496837 0.03495833 # #$method #[1] "Score normal approximation, with continuity correction" #-------------------------------------------------------------------- # Look at how the required sample size for a one-sample # confidence interval tends to decrease as the estimated # value of p decreases below 0.5 or increases above 0.5: seq(0.2, 0.8, by = 0.1) #[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ciBinomN(half.width = 0.1, p.hat = seq(0.2, 0.8, by = 0.1)) #$n #[1] 70 90 100 92 100 90 70 # #$p.hat #[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 # #$half.width #[1] 0.09931015 0.09839843 0.09910818 0.10010168 0.09910818 0.09839843 #[7] 0.09931015 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Look at how the required sample size for a one-sample # confidence interval increases with increasing confidence level: ciBinomN(half.width = 0.05, conf.level = c(0.8, 0.9, 0.95, 0.99)) #$n #[1] 160 264 374 644 # #$p.hat #[1] 0.5 0.5 0.5 0.5 # #$half.width #[1] 0.05039976 0.05035948 0.05041541 0.05049152 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Compare required sample size for a one-sample # confidence interval based on the different methods: ciBinomN(half.width = 0.05, ci.method = "score") #$n #[1] 374 # #$p.hat #[1] 0.5 # #$half.width #[1] 0.05041541 # #$method #[1] "Score normal approximation, with continuity correction" ciBinomN(half.width = 0.05, ci.method = "exact") #$n #[1] 394 # #$p.hat #[1] 0.5 # #$half.width #[1] 0.05047916 # #$method #[1] "Exact" ciBinomN(half.width = 0.05, ci.method = "adjusted Wald") #$n #[1] 374 # #$p.hat #[1] 0.5 # #$half.width #[1] 0.05041541 # #$method #[1] "Adjusted Wald normal approximation" ciBinomN(half.width = 0.05, ci.method = "Wald") #$n #[1] 398 # #$p.hat #[1] 0.5 # #$half.width #[1] 0.05037834 # #$method #[1] "Wald normal approximation, with continuity correction" #---------------------------------------------------------------- ## Not run: # Look at how the required sample size of a two-sample # confidence interval increases with decreasing # required half-width: ciBinomN(half.width = c(0.1, 0.05, 0.03), sample.type = "two") #$n1 #[1] 210 778 2089 # #$n2 #[1] 210 778 2089 # #$p1.hat #[1] 0.5000000 0.5000000 0.4997607 # #$p2.hat #[1] 0.4000000 0.3997429 0.4001915 # #$half.width #[1] 0.09943716 0.05047044 0.03049753 # #$method #[1] "Score normal approximation, with continuity correction" ## End(Not run)
# Look at how the required sample size of a one-sample # confidence interval increases with decreasing # required half-width: ciBinomN(half.width = c(0.1, 0.05, 0.03)) #$n #[1] 92 374 1030 # #$p.hat #[1] 0.5 0.5 0.5 # #$half.width #[1] 0.10010168 0.05041541 0.03047833 # #$method #[1] "Score normal approximation, with continuity correction" #---------- # Note that the required sample size decreases if we are less # stringent about how much the confidence interval width can # deviate from the supplied value of the 'half.width' argument: ciBinomN(half.width = c(0.1, 0.05, 0.03), tol.half.width = 0.005) #$n #[1] 84 314 782 # #$p.hat #[1] 0.5 0.5 0.5 # #$half.width #[1] 0.10456066 0.05496837 0.03495833 # #$method #[1] "Score normal approximation, with continuity correction" #-------------------------------------------------------------------- # Look at how the required sample size for a one-sample # confidence interval tends to decrease as the estimated # value of p decreases below 0.5 or increases above 0.5: seq(0.2, 0.8, by = 0.1) #[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ciBinomN(half.width = 0.1, p.hat = seq(0.2, 0.8, by = 0.1)) #$n #[1] 70 90 100 92 100 90 70 # #$p.hat #[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 # #$half.width #[1] 0.09931015 0.09839843 0.09910818 0.10010168 0.09910818 0.09839843 #[7] 0.09931015 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Look at how the required sample size for a one-sample # confidence interval increases with increasing confidence level: ciBinomN(half.width = 0.05, conf.level = c(0.8, 0.9, 0.95, 0.99)) #$n #[1] 160 264 374 644 # #$p.hat #[1] 0.5 0.5 0.5 0.5 # #$half.width #[1] 0.05039976 0.05035948 0.05041541 0.05049152 # #$method #[1] "Score normal approximation, with continuity correction" #---------------------------------------------------------------- # Compare required sample size for a one-sample # confidence interval based on the different methods: ciBinomN(half.width = 0.05, ci.method = "score") #$n #[1] 374 # #$p.hat #[1] 0.5 # #$half.width #[1] 0.05041541 # #$method #[1] "Score normal approximation, with continuity correction" ciBinomN(half.width = 0.05, ci.method = "exact") #$n #[1] 394 # #$p.hat #[1] 0.5 # #$half.width #[1] 0.05047916 # #$method #[1] "Exact" ciBinomN(half.width = 0.05, ci.method = "adjusted Wald") #$n #[1] 374 # #$p.hat #[1] 0.5 # #$half.width #[1] 0.05041541 # #$method #[1] "Adjusted Wald normal approximation" ciBinomN(half.width = 0.05, ci.method = "Wald") #$n #[1] 398 # #$p.hat #[1] 0.5 # #$half.width #[1] 0.05037834 # #$method #[1] "Wald normal approximation, with continuity correction" #---------------------------------------------------------------- ## Not run: # Look at how the required sample size of a two-sample # confidence interval increases with decreasing # required half-width: ciBinomN(half.width = c(0.1, 0.05, 0.03), sample.type = "two") #$n1 #[1] 210 778 2089 # #$n2 #[1] 210 778 2089 # #$p1.hat #[1] 0.5000000 0.5000000 0.4997607 # #$p2.hat #[1] 0.4000000 0.3997429 0.4001915 # #$half.width #[1] 0.09943716 0.05047044 0.03049753 # #$method #[1] "Score normal approximation, with continuity correction" ## End(Not run)
Compute the half-width of a confidence interval for the mean of a normal distribution or the difference between two means, given the sample size(s), estimated standard deviation, and confidence level.
ciNormHalfWidth(n.or.n1, n2 = n.or.n1, sigma.hat = 1, conf.level = 0.95, sample.type = ifelse(missing(n2), "one.sample", "two.sample"))
ciNormHalfWidth(n.or.n1, n2 = n.or.n1, sigma.hat = 1, conf.level = 0.95, sample.type = ifelse(missing(n2), "one.sample", "two.sample"))
n.or.n1 |
numeric vector of sample sizes. When |
n2 |
numeric vector of sample sizes for group 2. The default value is the value of |
sigma.hat |
numeric vector specifying the value(s) of the estimated standard deviation(s). |
conf.level |
numeric vector of numbers between 0 and 1 indicating the confidence level
associated with the confidence interval(s). The default value is |
sample.type |
character string indicating whether this is a one-sample |
If the arguments n.or.n1
, n2
, sigma.hat
, and
conf.level
are not all the same length, they are replicated to be the same length
as the length of the longest argument.
One-Sample Case (sample.type="one.sample"
)
Let denote a vector of
observations from a normal distribution with mean
and standard deviation
. A two-sided
confidence interval for
is given by:
where
and is the
'th quantile of
Student's t-distribution with
degrees of freedom
(Zar, 2010; Gilbert, 1987; Ott, 1995; Helsel and Hirsch, 1992). Thus, the
half-width of this confidence interval is given by:
Two-Sample Case (sample.type="two.sample"
)
Let denote a vector of
observations from a normal distribution with mean
and
standard deviation
, and let
denote a vector of
observations from a normal distribution with mean
and
standard deviation
. A two-sided
confidence
interval for
is given by:
where
(Zar, 2010, p.142; Helsel and Hirsch, 1992, p.135, Berthouex and Brown, 2002, pp.157–158). Thus, the half-width of this confidence interval is given by:
Note that for the two-sample case, the function ciNormHalfWidth
assumes the
two populations have the same standard deviation.
a numeric vector of half-widths.
The normal distribution and lognormal distribution are probably the two most frequently used distributions to model environmental data. In order to make any kind of probability statement about a normally-distributed population (of chemical concentrations for example), you have to first estimate the mean and standard deviation (the population parameters) of the distribution. Once you estimate these parameters, it is often useful to characterize the uncertainty in the estimate of the mean. This is done with confidence intervals.
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, confidence level, and half-width if one of the objectives
of the sampling program is to produce confidence intervals. The functions
ciNormHalfWidth
, ciNormN
, and plotCiNormDesign
can be used to investigate these relationships for the case of normally-distributed observations.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.
Millard, S.P., and N. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.21-3.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapters 7 and 8.
ciNormN
, plotCiNormDesign
, Normal
,
enorm
, t.test
Estimating Distribution Parameters.
# Look at how the half-width of a one-sample confidence interval # decreases with increasing sample size: seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 hw <- ciNormHalfWidth(n.or.n1 = seq(5, 30, by = 5)) round(hw, 2) #[1] 1.24 0.72 0.55 0.47 0.41 0.37 #---------------------------------------------------------------- # Look at how the half-width of a one-sample confidence interval # increases with increasing estimated standard deviation: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 hw <- ciNormHalfWidth(n.or.n1 = 20, sigma.hat = seq(0.5, 2, by = 0.5)) round(hw, 2) #[1] 0.23 0.47 0.70 0.94 #---------------------------------------------------------------- # Look at how the half-width of a one-sample confidence interval # increases with increasing confidence level: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 hw <- ciNormHalfWidth(n.or.n1 = 20, conf.level = seq(0.5, 0.9, by = 0.1)) round(hw, 2) #[1] 0.15 0.19 0.24 0.30 0.39 #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), # determine how adding another four months of observations to # increase the sample size from 4 to 8 will affect the half-width # of a two-sided 95% confidence interval for the Aldicarb level at # the first compliance well. # # Use the estimated standard deviation from the first four months # of data. (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) # Note that the half-width changes from 34% of the observed mean to # 18% of the observed mean by increasing the sample size from # 4 to 8. EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #... mu.hat <- with(EPA.09.Ex.21.1.aldicarb.df, mean(Aldicarb.ppb[Well=="Well.1"])) mu.hat #[1] 23.1 sigma.hat <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well=="Well.1"])) sigma.hat #[1] 4.93491 hw.4 <- ciNormHalfWidth(n.or.n1 = 4, sigma.hat = sigma.hat) hw.4 #[1] 7.852543 hw.8 <- ciNormHalfWidth(n.or.n1 = 8, sigma.hat = sigma.hat) hw.8 #[1] 4.125688 100 * hw.4/mu.hat #[1] 33.99369 100 * hw.8/mu.hat #[1] 17.86012 #========== # Clean up #--------- rm(hw, mu.hat, sigma.hat, hw.4, hw.8)
# Look at how the half-width of a one-sample confidence interval # decreases with increasing sample size: seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 hw <- ciNormHalfWidth(n.or.n1 = seq(5, 30, by = 5)) round(hw, 2) #[1] 1.24 0.72 0.55 0.47 0.41 0.37 #---------------------------------------------------------------- # Look at how the half-width of a one-sample confidence interval # increases with increasing estimated standard deviation: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 hw <- ciNormHalfWidth(n.or.n1 = 20, sigma.hat = seq(0.5, 2, by = 0.5)) round(hw, 2) #[1] 0.23 0.47 0.70 0.94 #---------------------------------------------------------------- # Look at how the half-width of a one-sample confidence interval # increases with increasing confidence level: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 hw <- ciNormHalfWidth(n.or.n1 = 20, conf.level = seq(0.5, 0.9, by = 0.1)) round(hw, 2) #[1] 0.15 0.19 0.24 0.30 0.39 #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), # determine how adding another four months of observations to # increase the sample size from 4 to 8 will affect the half-width # of a two-sided 95% confidence interval for the Aldicarb level at # the first compliance well. # # Use the estimated standard deviation from the first four months # of data. (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) # Note that the half-width changes from 34% of the observed mean to # 18% of the observed mean by increasing the sample size from # 4 to 8. EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #... mu.hat <- with(EPA.09.Ex.21.1.aldicarb.df, mean(Aldicarb.ppb[Well=="Well.1"])) mu.hat #[1] 23.1 sigma.hat <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well=="Well.1"])) sigma.hat #[1] 4.93491 hw.4 <- ciNormHalfWidth(n.or.n1 = 4, sigma.hat = sigma.hat) hw.4 #[1] 7.852543 hw.8 <- ciNormHalfWidth(n.or.n1 = 8, sigma.hat = sigma.hat) hw.8 #[1] 4.125688 100 * hw.4/mu.hat #[1] 33.99369 100 * hw.8/mu.hat #[1] 17.86012 #========== # Clean up #--------- rm(hw, mu.hat, sigma.hat, hw.4, hw.8)
Compute the sample size necessary to achieve a specified half-width of a confidence interval for the mean of a normal distribution or the difference between two means, given the estimated standard deviation and confidence level.
ciNormN(half.width, sigma.hat = 1, conf.level = 0.95, sample.type = ifelse(is.null(n2), "one.sample", "two.sample"), n2 = NULL, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
ciNormN(half.width, sigma.hat = 1, conf.level = 0.95, sample.type = ifelse(is.null(n2), "one.sample", "two.sample"), n2 = NULL, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
half.width |
numeric vector of (positive) half-widths.
Missing ( |
sigma.hat |
numeric vector specifying the value(s) of the estimated standard deviation(s). |
conf.level |
numeric vector of numbers between 0 and 1 indicating the confidence level
associated with the confidence interval(s). The default value is |
sample.type |
character string indicating whether this is a one-sample |
n2 |
numeric vector of sample sizes for group 2. The default value is |
round.up |
logical scalar indicating whether to round up the values of the computed sample size(s)
to the next smallest integer. The default value is |
n.max |
positive integer greater than 1 specifying the maximum sample size for the single
group when |
tol |
numeric scalar indicating the tolerance to use in the |
maxiter |
positive integer indicating the maximum number of iterations to use in the
|
If the arguments half.width
, n2
, sigma.hat
, and
conf.level
are not all the same length, they are replicated to be the same length as
the length of the longest argument.
The function ciNormN
uses the formulas given in the help file for
ciNormHalfWidth
for the half-width of the confidence interval
to iteratively solve for the sample size. For the two-sample case, the default
is to assume equal sample sizes for each group unless the argument n2
is supplied.
When sample.type="one.sample"
, or sample.type="two.sample"
and n2
is not supplied (so equal sample sizes for each group is assumed),
the function ciNormN
returns a numeric vector of sample sizes.
When sample.type="two.sample"
and n2
is supplied,
the function ciNormN
returns a list with two components called n1
and n2
,
specifying the sample sizes for each group.
The normal distribution and lognormal distribution are probably the two most frequently used distributions to model environmental data. In order to make any kind of probability statement about a normally-distributed population (of chemical concentrations for example), you have to first estimate the mean and standard deviation (the population parameters) of the distribution. Once you estimate these parameters, it is often useful to characterize the uncertainty in the estimate of the mean. This is done with confidence intervals.
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, confidence level, and half-width if one of the objectives
of the sampling program is to produce confidence intervals. The functions
ciNormHalfWidth
, ciNormN
, and plotCiNormDesign
can be used to investigate these relationships for the case of normally-distributed observations.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.
Millard, S.P., and N. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.21-3.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapters 7 and 8.
ciNormHalfWidth
, plotCiNormDesign
, Normal
,
enorm
, t.test
,
Estimating Distribution Parameters.
# Look at how the required sample size for a one-sample # confidence interval decreases with increasing half-width: seq(0.25, 1, by = 0.25) #[1] 0.25 0.50 0.75 1.00 ciNormN(half.width = seq(0.25, 1, by = 0.25)) #[1] 64 18 10 7 ciNormN(seq(0.25, 1, by=0.25), round = FALSE) #[1] 63.897899 17.832337 9.325967 6.352717 #---------------------------------------------------------------- # Look at how the required sample size for a one-sample # confidence interval increases with increasing estimated # standard deviation for a fixed half-width: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 ciNormN(half.width = 0.5, sigma.hat = seq(0.5, 2, by = 0.5)) #[1] 7 18 38 64 #---------------------------------------------------------------- # Look at how the required sample size for a one-sample # confidence interval increases with increasing confidence # level for a fixed half-width: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 ciNormN(half.width = 0.25, conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 9 13 19 28 46 #---------------------------------------------------------------- # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), # determine the required sample size in order to achieve a # half-width that is 10% of the observed mean (based on the first # four months of observations) for the Aldicarb level at the first # compliance well. Assume a 95% confidence level and use the # estimated standard deviation from the first four months of data. # (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) # # The required sample size is 20, so almost two years of data are # required assuming observations are taken once per month. EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #... mu.hat <- with(EPA.09.Ex.21.1.aldicarb.df, mean(Aldicarb.ppb[Well=="Well.1"])) mu.hat #[1] 23.1 sigma.hat <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well=="Well.1"])) sigma.hat #[1] 4.93491 ciNormN(half.width = 0.1 * mu.hat, sigma.hat = sigma.hat) #[1] 20 #---------- # Clean up rm(mu.hat, sigma.hat)
# Look at how the required sample size for a one-sample # confidence interval decreases with increasing half-width: seq(0.25, 1, by = 0.25) #[1] 0.25 0.50 0.75 1.00 ciNormN(half.width = seq(0.25, 1, by = 0.25)) #[1] 64 18 10 7 ciNormN(seq(0.25, 1, by=0.25), round = FALSE) #[1] 63.897899 17.832337 9.325967 6.352717 #---------------------------------------------------------------- # Look at how the required sample size for a one-sample # confidence interval increases with increasing estimated # standard deviation for a fixed half-width: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 ciNormN(half.width = 0.5, sigma.hat = seq(0.5, 2, by = 0.5)) #[1] 7 18 38 64 #---------------------------------------------------------------- # Look at how the required sample size for a one-sample # confidence interval increases with increasing confidence # level for a fixed half-width: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 ciNormN(half.width = 0.25, conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 9 13 19 28 46 #---------------------------------------------------------------- # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), # determine the required sample size in order to achieve a # half-width that is 10% of the observed mean (based on the first # four months of observations) for the Aldicarb level at the first # compliance well. Assume a 95% confidence level and use the # estimated standard deviation from the first four months of data. # (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) # # The required sample size is 20, so almost two years of data are # required assuming observations are taken once per month. EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #... mu.hat <- with(EPA.09.Ex.21.1.aldicarb.df, mean(Aldicarb.ppb[Well=="Well.1"])) mu.hat #[1] 23.1 sigma.hat <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well=="Well.1"])) sigma.hat #[1] 4.93491 ciNormN(half.width = 0.1 * mu.hat, sigma.hat = sigma.hat) #[1] 20 #---------- # Clean up rm(mu.hat, sigma.hat)
Compute the confidence level associated with a nonparametric confidence interval for a quantile, given the sample size and order statistics associated with the lower and upper bounds.
ciNparConfLevel(n, p = 0.5, lcl.rank = ifelse(ci.type == "upper", 0, 1), n.plus.one.minus.ucl.rank = ifelse(ci.type == "lower", 0, 1), ci.type = "two.sided")
ciNparConfLevel(n, p = 0.5, lcl.rank = ifelse(ci.type == "upper", 0, 1), n.plus.one.minus.ucl.rank = ifelse(ci.type == "lower", 0, 1), ci.type = "two.sided")
n |
numeric vector of sample sizes.
Missing ( |
p |
numeric vector of probabilities specifying which quantiles to consider for
the sample size calculation. All values of |
lcl.rank , n.plus.one.minus.ucl.rank
|
numeric vectors of non-negative integers indicating the ranks of the
order statistics that are used for the lower and upper bounds of the
confidence interval for the specified quantile(s). When |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
If the arguments n
, p
, lcl.rank
, and
n.plus.one.minus.ucl.rank
are not all the same length, they are
replicated to be the
same length as the length of the longest argument.
The help file for eqnpar
explains how nonparametric confidence
intervals for quantiles are constructed and how the confidence level
associated with the confidence interval is computed based on specified values
for the sample size and the ranks of the order statistics used for
the bounds of the confidence interval.
A numeric vector of confidence levels.
See the help file for eqnpar
.
Steven P. Millard ([email protected])
See the help file for eqnpar
.
eqnpar
, ciNparN
,
plotCiNparDesign
.
# Look at how the confidence level of a nonparametric confidence interval # increases with increasing sample size for a fixed quantile: seq(5, 25, by = 5) #[1] 5 10 15 20 25 round(ciNparConfLevel(n = seq(5, 25, by = 5), p = 0.9), 2) #[1] 0.41 0.65 0.79 0.88 0.93 #--------- # Look at how the confidence level of a nonparametric confidence interval # decreases as the quantile moves away from 0.5: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(ciNparConfLevel(n = 10, p = seq(0.5, 0.9, by = 0.1)), 2) #[1] 1.00 0.99 0.97 0.89 0.65 #========== # Reproduce Example 21-6 on pages 21-21 to 21-22 of USEPA (2009). # Use 12 measurements of nitrate (mg/L) at a well used for drinking water # to determine with 95% confidence whether or not the infant-based, acute # risk standard of 10 mg/L has been violated. Assume that the risk # standard represents an upper 95'th percentile limit on nitrate # concentrations. So what we need to do is construct a one-sided # lower nonparametric confidence interval for the 95'th percentile # that has associated confidence level of no more than 95%, and we will # compare the lower confidence limit with the MCL of 10 mg/L. # # The data for this example are stored in EPA.09.Ex.21.6.nitrate.df. # Look at the data: #------------------ EPA.09.Ex.21.6.nitrate.df # Sampling.Date Date Nitrate.mg.per.l.orig Nitrate.mg.per.l Censored #1 7/28/1999 1999-07-28 <5.0 5.0 TRUE #2 9/3/1999 1999-09-03 12.3 12.3 FALSE #3 11/24/1999 1999-11-24 <5.0 5.0 TRUE #4 5/3/2000 2000-05-03 <5.0 5.0 TRUE #5 7/14/2000 2000-07-14 8.1 8.1 FALSE #6 10/31/2000 2000-10-31 <5.0 5.0 TRUE #7 12/14/2000 2000-12-14 11 11.0 FALSE #8 3/27/2001 2001-03-27 35.1 35.1 FALSE #9 6/13/2001 2001-06-13 <5.0 5.0 TRUE #10 9/16/2001 2001-09-16 <5.0 5.0 TRUE #11 11/26/2001 2001-11-26 9.3 9.3 FALSE #12 3/2/2002 2002-03-02 10.3 10.3 FALSE # Determine what order statistic to use for the lower confidence limit # in order to achieve no more than 95% confidence. #--------------------------------------------------------------------- conf.levels <- ciNparConfLevel(n = 12, p = 0.95, lcl.rank = 1:12, ci.type = "lower") names(conf.levels) <- 1:12 round(conf.levels, 2) # 1 2 3 4 5 6 7 8 9 10 11 12 #1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.88 0.54 # Using the 11'th largest observation for the lower confidence limit # yields a confidence level of 88%. Using the 10'th largest # observation yields a confidence level of 98%. The example in # USEPA (2009) uses the 10'th largest observation. # # The 10'th largest observation is 11 mg/L which exceeds the # MCL of 10 mg/L, so there is evidence of contamination. #-------------------------------------------------------------------- with(EPA.09.Ex.21.6.nitrate.df, eqnpar(Nitrate.mg.per.l, p = 0.95, ci = TRUE, ci.type = "lower", lcl.rank = 10)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 95'th %ile = 22.56 # #Quantile Estimation Method: Nonparametric # #Data: Nitrate.mg.per.l # #Sample Size: 12 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: lower # #Confidence Level: 98.04317% # #Confidence Limit Rank(s): 10 # #Confidence Interval: LCL = 11 # UCL = Inf #========== # Clean up #--------- rm(conf.levels)
# Look at how the confidence level of a nonparametric confidence interval # increases with increasing sample size for a fixed quantile: seq(5, 25, by = 5) #[1] 5 10 15 20 25 round(ciNparConfLevel(n = seq(5, 25, by = 5), p = 0.9), 2) #[1] 0.41 0.65 0.79 0.88 0.93 #--------- # Look at how the confidence level of a nonparametric confidence interval # decreases as the quantile moves away from 0.5: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(ciNparConfLevel(n = 10, p = seq(0.5, 0.9, by = 0.1)), 2) #[1] 1.00 0.99 0.97 0.89 0.65 #========== # Reproduce Example 21-6 on pages 21-21 to 21-22 of USEPA (2009). # Use 12 measurements of nitrate (mg/L) at a well used for drinking water # to determine with 95% confidence whether or not the infant-based, acute # risk standard of 10 mg/L has been violated. Assume that the risk # standard represents an upper 95'th percentile limit on nitrate # concentrations. So what we need to do is construct a one-sided # lower nonparametric confidence interval for the 95'th percentile # that has associated confidence level of no more than 95%, and we will # compare the lower confidence limit with the MCL of 10 mg/L. # # The data for this example are stored in EPA.09.Ex.21.6.nitrate.df. # Look at the data: #------------------ EPA.09.Ex.21.6.nitrate.df # Sampling.Date Date Nitrate.mg.per.l.orig Nitrate.mg.per.l Censored #1 7/28/1999 1999-07-28 <5.0 5.0 TRUE #2 9/3/1999 1999-09-03 12.3 12.3 FALSE #3 11/24/1999 1999-11-24 <5.0 5.0 TRUE #4 5/3/2000 2000-05-03 <5.0 5.0 TRUE #5 7/14/2000 2000-07-14 8.1 8.1 FALSE #6 10/31/2000 2000-10-31 <5.0 5.0 TRUE #7 12/14/2000 2000-12-14 11 11.0 FALSE #8 3/27/2001 2001-03-27 35.1 35.1 FALSE #9 6/13/2001 2001-06-13 <5.0 5.0 TRUE #10 9/16/2001 2001-09-16 <5.0 5.0 TRUE #11 11/26/2001 2001-11-26 9.3 9.3 FALSE #12 3/2/2002 2002-03-02 10.3 10.3 FALSE # Determine what order statistic to use for the lower confidence limit # in order to achieve no more than 95% confidence. #--------------------------------------------------------------------- conf.levels <- ciNparConfLevel(n = 12, p = 0.95, lcl.rank = 1:12, ci.type = "lower") names(conf.levels) <- 1:12 round(conf.levels, 2) # 1 2 3 4 5 6 7 8 9 10 11 12 #1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.88 0.54 # Using the 11'th largest observation for the lower confidence limit # yields a confidence level of 88%. Using the 10'th largest # observation yields a confidence level of 98%. The example in # USEPA (2009) uses the 10'th largest observation. # # The 10'th largest observation is 11 mg/L which exceeds the # MCL of 10 mg/L, so there is evidence of contamination. #-------------------------------------------------------------------- with(EPA.09.Ex.21.6.nitrate.df, eqnpar(Nitrate.mg.per.l, p = 0.95, ci = TRUE, ci.type = "lower", lcl.rank = 10)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 95'th %ile = 22.56 # #Quantile Estimation Method: Nonparametric # #Data: Nitrate.mg.per.l # #Sample Size: 12 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: lower # #Confidence Level: 98.04317% # #Confidence Limit Rank(s): 10 # #Confidence Interval: LCL = 11 # UCL = Inf #========== # Clean up #--------- rm(conf.levels)
Compute the sample size necessary to achieve a specified confidence level for a nonparametric confidence interval for a quantile.
ciNparN(p = 0.5, lcl.rank = ifelse(ci.type == "upper", 0, 1), n.plus.one.minus.ucl.rank = ifelse(ci.type == "lower", 0, 1), ci.type = "two.sided", conf.level = 0.95)
ciNparN(p = 0.5, lcl.rank = ifelse(ci.type == "upper", 0, 1), n.plus.one.minus.ucl.rank = ifelse(ci.type == "lower", 0, 1), ci.type = "two.sided", conf.level = 0.95)
p |
numeric vector of probabilities specifying the quantiles.
All values of |
lcl.rank , n.plus.one.minus.ucl.rank
|
numeric vectors of non-negative integers indicating the ranks of the
order statistics that are used for the lower and upper bounds of the
confidence interval for the specified quantile(s). When |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
numeric vector of numbers between 0 and 1 indicating the confidence level
associated with the confidence interval(s). The default value is
|
If the arguments p
, lcl.rank
,
n.plus.one.minus.ucl.rank
and conf.level
are not all the
same length, they are replicated to be the
same length as the length of the longest argument.
The help file for eqnpar
explains how nonparametric confidence
intervals for quantiles are constructed and how the confidence level
associated with the confidence interval is computed based on specified values
for the sample size and the ranks of the order statistics used for
the bounds of the confidence interval.
The function ciNparN
determines the required the sample size via
a nonlinear optimization.
numeric vector of sample sizes.
See the help file for eqnpar
.
Steven P. Millard ([email protected])
See the help file for eqnpar
.
eqnpar
, ciNparConfLevel
,
plotCiNparDesign
.
# Look at how the required sample size for a confidence interval # increases with increasing confidence level for a fixed quantile: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 ciNparN(p = 0.9, conf.level=seq(0.5, 0.9, by = 0.1)) #[1] 7 9 12 16 22 #---------- # Look at how the required sample size for a confidence interval increases # as the quantile moves away from 0.5: ciNparN(p = seq(0.5, 0.9, by = 0.1)) #[1] 6 7 9 14 29
# Look at how the required sample size for a confidence interval # increases with increasing confidence level for a fixed quantile: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 ciNparN(p = 0.9, conf.level=seq(0.5, 0.9, by = 0.1)) #[1] 7 9 12 16 22 #---------- # Look at how the required sample size for a confidence interval increases # as the quantile moves away from 0.5: ciNparN(p = seq(0.5, 0.9, by = 0.1)) #[1] 6 7 9 14 29
Create a table of confidence intervals for the mean of a normal distribution or the difference between two means following Bacchetti (2010), by varying the estimated standard deviation and the estimated mean or differene between the two estimated means given the sample size(s).
ciTableMean(n1 = 10, n2 = n1, diff.or.mean = 2:0, SD = 1:3, sample.type = "two.sample", ci.type = "two.sided", conf.level = 0.95, digits = 1)
ciTableMean(n1 = 10, n2 = n1, diff.or.mean = 2:0, SD = 1:3, sample.type = "two.sample", ci.type = "two.sided", conf.level = 0.95, digits = 1)
n1 |
positive integer greater than 1 specifying the sample size when |
n2 |
positive integer greater than 1 specifying the sample size for group 2 when
|
diff.or.mean |
numeric vector indicating either the assumed difference between the two sample means
when |
SD |
numeric vector of positive values specifying the assumed estimated standard
deviation. The default value is |
sample.type |
character string specifying whether to create confidence intervals for the difference
between two means ( |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
digits |
positive integer indicating how many decimal places to display in the table. The
default value is |
Following Bacchetti (2010) (see NOTE below), the function ciTableMean
allows you to perform sensitivity analyses while planning future studies by
producing a table of confidence intervals for the mean or the difference
between two means by varying the estimated standard deviation and the
estimated mean or differene between the two estimated means given the
sample size(s).
One Sample Case (sample.type="one.sample"
)
Let be a vector of
observations from an normal (Gaussian) distribution with
parameters
mean=
and
sd=
.
The usual confidence interval for is constructed as follows.
If
ci.type="two-sided"
, the 100% confidence interval for
is given by:
where
and is the
'th quantile of
Student's t-distribution with
degrees of freedom
(Zar, 2010; Gilbert, 1987; Ott, 1995; Helsel and Hirsch, 1992).
If ci.type="lower"
, the 100% confidence interval for
is given by:
and if ci.type="upper"
, the confidence interval is given by:
For the one-sample case, the argument n1
corresponds to in
Equation (1), the argument
diff.or.mean
corresponds to
in Equation (2), and the argument
SD
corresponds to
in Equation (3).
Two Sample Case (sample.type="two.sample"
)
Let be a vector of
observations from an normal (Gaussian) distribution
with parameters
mean=
and
sd=
, and let
be a vector of
observations from an normal (Gaussian) distribution
with parameters
mean=
and
sd=
.
The usual confidence interval for the difference between the two population means
is constructed as follows.
If
ci.type="two-sided"
, the 100% confidence interval for
is given by:
where
and is the
'th quantile of
Student's t-distribution with
degrees of freedom
(Zar, 2010; Gilbert, 1987; Ott, 1995; Helsel and Hirsch, 1992).
If ci.type="lower"
, the 100% confidence interval for
is given by:
and if ci.type="upper"
, the confidence interval is given by:
For the two-sample case, the arguments n1
and n2
correspond to
and
in Equation (6), the argument
diff.or.mean
corresponds
to in Equations (7) and (8),
and the argument
SD
corresponds to in Equation (9).
a data frame with the rows varying the standard deviation and the columns varying the estimated mean or difference between the means. Elements of the data frame are character strings indicating the confidence intervals.
Bacchetti (2010) presents strong arguments against the current convention in scientific research for computing sample size that is based on formulas that use a fixed Type I error (usually 5%) and a fixed minimal power (often 80%) without regard to costs. He notes that a key input to these formulas is a measure of variability (usually a standard deviation) that is difficult to measure accurately "unless there is so much preliminary data that the study isn't really needed." Also, study designers often avoid defining what a scientifically meaningful difference is by presenting sample size results in terms of the effect size (i.e., the difference of interest divided by the elusive standard deviation). Bacchetti (2010) encourages study designers to use simple tables in a sensitivity analysis to see what results of a study may look like for low, moderate, and high rates of variability and large, intermediate, and no underlying differences in the populations or processes being studied.
Steven P. Millard ([email protected])
Bacchetti, P. (2010). Current sample size conventions: Flaws, Harms, and Alternatives. BMC Medicine 8, 17–23.
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
enorm
, t.test
, ciTableProp
,
ciNormHalfWidth
, ciNormN
,
plotCiNormDesign
.
# Show how potential confidence intervals for the difference between two means # will look assuming standard deviations of 1, 2, or 3, differences between # the two means of 2, 1, or 0, and a sample size of 10 in each group. ciTableMean() # Diff=2 Diff=1 Diff=0 #SD=1 [ 1.1, 2.9] [ 0.1, 1.9] [-0.9, 0.9] #SD=2 [ 0.1, 3.9] [-0.9, 2.9] [-1.9, 1.9] #SD=3 [-0.8, 4.8] [-1.8, 3.8] [-2.8, 2.8] #========== # Show how a potential confidence interval for a mean will look assuming # standard deviations of 1, 2, or 5, a sample mean of 5, 3, or 1, and # a sample size of 15. ciTableMean(n1 = 15, diff.or.mean = c(5, 3, 1), SD = c(1, 2, 5), sample.type = "one") # Mean=5 Mean=3 Mean=1 #SD=1 [ 4.4, 5.6] [ 2.4, 3.6] [ 0.4, 1.6] #SD=2 [ 3.9, 6.1] [ 1.9, 4.1] [-0.1, 2.1] #SD=5 [ 2.2, 7.8] [ 0.2, 5.8] [-1.8, 3.8] #========== # The data frame EPA.09.Ex.16.1.sulfate.df contains sulfate concentrations # (ppm) at one background and one downgradient well. The estimated # mean and standard deviation for the background well are 536 and 27 ppm, # respectively, based on a sample size of n = 8 quarterly samples taken over # 2 years. A two-sided 95% confidence interval for this mean is [514, 559], # which has a half-width of 23 ppm. # # The estimated mean and standard deviation for the downgradient well are # 608 and 18 ppm, respectively, based on a sample size of n = 6 quarterly # samples. A two-sided 95% confidence interval for the difference between # this mean and the background mean is [44, 100] ppm. # # Suppose we want to design a future sampling program and are interested in # the size of the confidence interval for the difference between the two means. # We will use ciTableMean to generate a table of possible confidence intervals # by varying the assumed standard deviation and assumed differences between # the means. # Look at the data #----------------- EPA.09.Ex.16.1.sulfate.df # Month Year Well.type Sulfate.ppm #1 Jan 1995 Background 560 #2 Apr 1995 Background 530 #3 Jul 1995 Background 570 #4 Oct 1995 Background 490 #5 Jan 1996 Background 510 #6 Apr 1996 Background 550 #7 Jul 1996 Background 550 #8 Oct 1996 Background 530 #9 Jan 1995 Downgradient NA #10 Apr 1995 Downgradient NA #11 Jul 1995 Downgradient 600 #12 Oct 1995 Downgradient 590 #13 Jan 1996 Downgradient 590 #14 Apr 1996 Downgradient 630 #15 Jul 1996 Downgradient 610 #16 Oct 1996 Downgradient 630 # Compute the estimated mean and standard deviation for the # background well. #----------------------------------------------------------- Sulfate.back <- with(EPA.09.Ex.16.1.sulfate.df, Sulfate.ppm[Well.type == "Background"]) enorm(Sulfate.back, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 536.2500 # sd = 26.6927 # #Estimation Method: mvue # #Data: Sulfate.back # #Sample Size: 8 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 513.9343 # UCL = 558.5657 # Compute the estimated mean and standard deviation for the # downgradient well. #---------------------------------------------------------- Sulfate.down <- with(EPA.09.Ex.16.1.sulfate.df, Sulfate.ppm[Well.type == "Downgradient"]) enorm(Sulfate.down, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 608.33333 # sd = 18.34848 # #Estimation Method: mvue # #Data: Sulfate.down # #Sample Size: 6 # #Number NA/NaN/Inf's: 2 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 589.0778 # UCL = 627.5889 # Compute the estimated difference between the means and the confidence # interval for the difference: #---------------------------------------------------------------------- t.test(Sulfate.down, Sulfate.back, var.equal = TRUE) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: difference in means = 0 # #Alternative Hypothesis: True difference in means is not equal to 0 # #Test Name: Two Sample t-test # #Estimated Parameter(s): mean of x = 608.3333 # mean of y = 536.2500 # #Data: Sulfate.down and Sulfate.back # #Test Statistic: t = 5.660985 # #Test Statistic Parameter: df = 12 # #P-value: 0.0001054306 # #95% Confidence Interval: LCL = 44.33974 # UCL = 99.82693 # Use ciTableMean to look how the confidence interval for the difference # between the background and downgradient means in a future study using eight # quarterly samples at each well varies with assumed value of the pooled standard # deviation and the observed difference between the sample means. #-------------------------------------------------------------------------------- # Our current estimate of the pooled standard deviation is 24 ppm: summary(lm(Sulfate.ppm ~ Well.type, data = EPA.09.Ex.16.1.sulfate.df))$sigma #[1] 23.57759 # We can see that if this is overly optimistic and in our next study the # pooled standard deviation is around 50 ppm, then if the observed difference # between the means is 50 ppm, the lower end of the confidence interval for # the difference between the two means will include 0, so we may want to # increase our sample size. ciTableMean(n1 = 8, n2 = 8, diff = c(100, 50, 0), SD = c(15, 25, 50), digits = 0) # Diff=100 Diff=50 Diff=0 #SD=15 [ 84, 116] [ 34, 66] [-16, 16] #SD=25 [ 73, 127] [ 23, 77] [-27, 27] #SD=50 [ 46, 154] [ -4, 104] [-54, 54] #========== # Clean up #--------- rm(Sulfate.back, Sulfate.down)
# Show how potential confidence intervals for the difference between two means # will look assuming standard deviations of 1, 2, or 3, differences between # the two means of 2, 1, or 0, and a sample size of 10 in each group. ciTableMean() # Diff=2 Diff=1 Diff=0 #SD=1 [ 1.1, 2.9] [ 0.1, 1.9] [-0.9, 0.9] #SD=2 [ 0.1, 3.9] [-0.9, 2.9] [-1.9, 1.9] #SD=3 [-0.8, 4.8] [-1.8, 3.8] [-2.8, 2.8] #========== # Show how a potential confidence interval for a mean will look assuming # standard deviations of 1, 2, or 5, a sample mean of 5, 3, or 1, and # a sample size of 15. ciTableMean(n1 = 15, diff.or.mean = c(5, 3, 1), SD = c(1, 2, 5), sample.type = "one") # Mean=5 Mean=3 Mean=1 #SD=1 [ 4.4, 5.6] [ 2.4, 3.6] [ 0.4, 1.6] #SD=2 [ 3.9, 6.1] [ 1.9, 4.1] [-0.1, 2.1] #SD=5 [ 2.2, 7.8] [ 0.2, 5.8] [-1.8, 3.8] #========== # The data frame EPA.09.Ex.16.1.sulfate.df contains sulfate concentrations # (ppm) at one background and one downgradient well. The estimated # mean and standard deviation for the background well are 536 and 27 ppm, # respectively, based on a sample size of n = 8 quarterly samples taken over # 2 years. A two-sided 95% confidence interval for this mean is [514, 559], # which has a half-width of 23 ppm. # # The estimated mean and standard deviation for the downgradient well are # 608 and 18 ppm, respectively, based on a sample size of n = 6 quarterly # samples. A two-sided 95% confidence interval for the difference between # this mean and the background mean is [44, 100] ppm. # # Suppose we want to design a future sampling program and are interested in # the size of the confidence interval for the difference between the two means. # We will use ciTableMean to generate a table of possible confidence intervals # by varying the assumed standard deviation and assumed differences between # the means. # Look at the data #----------------- EPA.09.Ex.16.1.sulfate.df # Month Year Well.type Sulfate.ppm #1 Jan 1995 Background 560 #2 Apr 1995 Background 530 #3 Jul 1995 Background 570 #4 Oct 1995 Background 490 #5 Jan 1996 Background 510 #6 Apr 1996 Background 550 #7 Jul 1996 Background 550 #8 Oct 1996 Background 530 #9 Jan 1995 Downgradient NA #10 Apr 1995 Downgradient NA #11 Jul 1995 Downgradient 600 #12 Oct 1995 Downgradient 590 #13 Jan 1996 Downgradient 590 #14 Apr 1996 Downgradient 630 #15 Jul 1996 Downgradient 610 #16 Oct 1996 Downgradient 630 # Compute the estimated mean and standard deviation for the # background well. #----------------------------------------------------------- Sulfate.back <- with(EPA.09.Ex.16.1.sulfate.df, Sulfate.ppm[Well.type == "Background"]) enorm(Sulfate.back, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 536.2500 # sd = 26.6927 # #Estimation Method: mvue # #Data: Sulfate.back # #Sample Size: 8 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 513.9343 # UCL = 558.5657 # Compute the estimated mean and standard deviation for the # downgradient well. #---------------------------------------------------------- Sulfate.down <- with(EPA.09.Ex.16.1.sulfate.df, Sulfate.ppm[Well.type == "Downgradient"]) enorm(Sulfate.down, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 608.33333 # sd = 18.34848 # #Estimation Method: mvue # #Data: Sulfate.down # #Sample Size: 6 # #Number NA/NaN/Inf's: 2 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 589.0778 # UCL = 627.5889 # Compute the estimated difference between the means and the confidence # interval for the difference: #---------------------------------------------------------------------- t.test(Sulfate.down, Sulfate.back, var.equal = TRUE) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: difference in means = 0 # #Alternative Hypothesis: True difference in means is not equal to 0 # #Test Name: Two Sample t-test # #Estimated Parameter(s): mean of x = 608.3333 # mean of y = 536.2500 # #Data: Sulfate.down and Sulfate.back # #Test Statistic: t = 5.660985 # #Test Statistic Parameter: df = 12 # #P-value: 0.0001054306 # #95% Confidence Interval: LCL = 44.33974 # UCL = 99.82693 # Use ciTableMean to look how the confidence interval for the difference # between the background and downgradient means in a future study using eight # quarterly samples at each well varies with assumed value of the pooled standard # deviation and the observed difference between the sample means. #-------------------------------------------------------------------------------- # Our current estimate of the pooled standard deviation is 24 ppm: summary(lm(Sulfate.ppm ~ Well.type, data = EPA.09.Ex.16.1.sulfate.df))$sigma #[1] 23.57759 # We can see that if this is overly optimistic and in our next study the # pooled standard deviation is around 50 ppm, then if the observed difference # between the means is 50 ppm, the lower end of the confidence interval for # the difference between the two means will include 0, so we may want to # increase our sample size. ciTableMean(n1 = 8, n2 = 8, diff = c(100, 50, 0), SD = c(15, 25, 50), digits = 0) # Diff=100 Diff=50 Diff=0 #SD=15 [ 84, 116] [ 34, 66] [-16, 16] #SD=25 [ 73, 127] [ 23, 77] [-27, 27] #SD=50 [ 46, 154] [ -4, 104] [-54, 54] #========== # Clean up #--------- rm(Sulfate.back, Sulfate.down)
Create a table of confidence intervals for probability of "success" for a binomial distribution or the difference between two proportions following Bacchetti (2010), by varying the estimated proportion or differene between the two estimated proportions given the sample size(s).
ciTableProp(n1 = 10, p1.hat = c(0.1, 0.2, 0.3), n2 = n1, p2.hat.minus.p1.hat = c(0.2, 0.1, 0), sample.type = "two.sample", ci.type = "two.sided", conf.level = 0.95, digits = 2, ci.method = "score", correct = TRUE, tol = 10^-(digits + 1))
ciTableProp(n1 = 10, p1.hat = c(0.1, 0.2, 0.3), n2 = n1, p2.hat.minus.p1.hat = c(0.2, 0.1, 0), sample.type = "two.sample", ci.type = "two.sided", conf.level = 0.95, digits = 2, ci.method = "score", correct = TRUE, tol = 10^-(digits + 1))
n1 |
positive integer greater than 1 specifying the sample size when |
p1.hat |
numeric vector of values between 0 and 1 indicating the estimated proportion
( |
n2 |
positive integer greater than 1 specifying the sample size for group 2 when
|
p2.hat.minus.p1.hat |
numeric vector indicating the assumed difference between the two sample proportions
when |
sample.type |
character string specifying whether to create confidence intervals for the difference
between two proportions ( |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
digits |
positive integer indicating how many decimal places to display in the table. The
default value is |
ci.method |
character string indicating the method to use to construct the confidence interval.
The default value is |
correct |
logical scalar indicating whether to use the correction for continuity when |
tol |
numeric scalar indicating how close the values of the adjusted elements of |
One-Sample Case (sample.type="one.sample"
)
For the one-sample case, the function ciTableProp
calls the R function
prop.test
when ci.method="score"
, and calls the R function
binom.test
, when ci.method="exact"
. To ensure that the
user-supplied values of p1.hat
are valid for the given user-supplied values
of n1
, values for the argument x
to the function
prop.test
or binom.test
are computed using the formula
x <- unique(round((p1.hat * n1), 0))
and the argument p.hat
is then adjusted using the formula
p.hat <- x/n1
Two-Sample Case (sample.type="two.sample"
)
For the two-sample case, the function ciTableProp
calls the R function
prop.test
. To ensure that the user-supplied values of p1.hat
are valid for the given user-supplied values of n1
, the values for the
first component of the argument x
to the function
prop.test
are computed using the formula
x1 <- unique(round((p1.hat * n1), 0))
and the argument p1.hat
is then adjusted using the formula
p1.hat <- x1/n1
Next, the estimated proportions from group 2 are computed by adding together all
possible combinations from the elements of p1.hat
and
p2.hat.minus.p1.hat
. These estimated proportions from group 2 are then
adjusted using the formulas:
x2.rep <- round((p2.hat.rep * n2), 0)
p2.hat.rep <- x2.rep/n2
If any of these adjusted proportions from group 2 are or
the function terminates with a message indicating that impossible
values have been supplied.
In cases where the sample sizes are small there may be instances where the
user-supplied values of p1.hat
and/or p2.hat.minus.p1.hat
are not
attainable. The argument tol
is used to determine whether to return
the table in conventional form or whether it is necessary to modify the table
to include twice as many columns (see EXAMPLES section below).
a data frame with elements that are character strings indicating the confidence intervals.
When sample.type="two.sample"
, a data frame with the rows varying
the estimated proportion for group 1 (i.e., the values of p1.hat
) and
the columns varying the estimated difference between the proportions from
group 2 and group 1 (i.e., the values of p2.hat.minus.p1.hat
). In cases
where the sample sizes are small, it may not be possible to obtain certain
differences for given values of p1.hat
, in which case the returned
data frame contains twice as many columns indicating the actual difference
in one column and the compute confidence interval next to it (see EXAMPLES
section below).
When sample.type="one.sample"
, a 1-row data frame with the columns
varying the estimated proportion (i.e., the values of p1.hat
).
Bacchetti (2010) presents strong arguments against the current convention in scientific research for computing sample size that is based on formulas that use a fixed Type I error (usually 5%) and a fixed minimal power (often 80%) without regard to costs. He notes that a key input to these formulas is a measure of variability (usually a standard deviation) that is difficult to measure accurately "unless there is so much preliminary data that the study isn't really needed." Also, study designers often avoid defining what a scientifically meaningful difference is by presenting sample size results in terms of the effect size (i.e., the difference of interest divided by the elusive standard deviation). Bacchetti (2010) encourages study designers to use simple tables in a sensitivity analysis to see what results of a study may look like for low, moderate, and high rates of variability and large, intermediate, and no underlying differences in the populations or processes being studied.
Steven P. Millard ([email protected])
Bacchetti, P. (2010). Current sample size conventions: Flaws, Harms, and Alternatives. BMC Medicine 8, 17–23.
Also see the references in the help files for prop.test
and
binom.test
.
prop.test
, binom.test
, ciTableMean
,
ciBinomHalfWidth
, ciBinomN
,
plotCiBinomDesign
.
# Reproduce Table 1 in Bacchetti (2010). This involves planning a study with # n1 = n2 = 935 subjects per group, where Group 1 is the control group and # Group 2 is the treatment group. The outcome in the study is proportion of # subjects with serious outcomes or death. A negative value for the difference # in proportions between groups (Group 2 proportion - Group 1 proportion) # indicates the treatment group has a better outcome. In this table, the # proportion of subjects in Group 1 with serious outcomes or death is set # to 3%, 6.5%, and 12%, and the difference in proportions between the two # groups is set to -2.8 percentage points, -1.4 percentage points, and 0. ciTableProp(n1 = 935, p1.hat = c(0.03, 0.065, 0.12), n2 = 935, p2.hat.minus.p1.hat = c(-0.028, -0.014, 0), digits = 3) # Diff=-0.028 Diff=-0.014 Diff=0 #P1.hat=0.030 [-0.040, -0.015] [-0.029, 0.001] [-0.015, 0.015] #P1.hat=0.065 [-0.049, -0.007] [-0.036, 0.008] [-0.022, 0.022] #P1.hat=0.120 [-0.057, 0.001] [-0.044, 0.016] [-0.029, 0.029] #========== # Show how the returned data frame has to be modified for cases of small # sample sizes where not all user-supplied differenes are possible. ciTableProp(n1 = 5, n2 = 5, p1.hat = c(0.3, 0.6, 0.12), p2.hat = c(0.2, 0.1, 0)) # Diff CI Diff CI Diff CI #P1.hat=0.4 0.2 [-0.61, 1.00] 0.0 [-0.61, 0.61] 0 [-0.61, 0.61] #P1.hat=0.6 0.2 [-0.55, 0.95] 0.2 [-0.55, 0.95] 0 [-0.61, 0.61] #P1.hat=0.2 0.2 [-0.55, 0.95] 0.2 [-0.55, 0.95] 0 [-0.50, 0.50] #========== # Suppose we are planning a study to compare the proportion of nondetects at # a background and downgradient well, and we can use ciTableProp to look how # the confidence interval for the difference between the two proportions using # say 36 quarterly samples at each well varies with the observed estimated # proportions. Here we'll let the argument "p1.hat" denote the proportion of # nondetects observed at the downgradient well and set this equal to # 20%, 40% and 60%. The argument "p2.hat.minus.p1.hat" represents the proportion # of nondetects at the background well minus the proportion of nondetects at the # downgradient well. ciTableProp(n1 = 36, p1.hat = c(0.2, 0.4, 0.6), n2 = 36, p2.hat.minus.p1.hat = c(0.3, 0.15, 0)) # Diff=0.31 Diff=0.14 Diff=0 #P1.hat=0.19 [ 0.07, 0.54] [-0.09, 0.37] [-0.18, 0.18] #P1.hat=0.39 [ 0.06, 0.55] [-0.12, 0.39] [-0.23, 0.23] #P1.hat=0.61 [ 0.09, 0.52] [-0.10, 0.38] [-0.23, 0.23] # We see that even if the observed difference in the proportion of nondetects # is about 15 percentage points, all of the confidence intervals for the # difference between the proportions of nondetects at the two wells contain 0, # so if a difference of 15 percentage points is important to substantiate, we # may need to increase our sample sizes.
# Reproduce Table 1 in Bacchetti (2010). This involves planning a study with # n1 = n2 = 935 subjects per group, where Group 1 is the control group and # Group 2 is the treatment group. The outcome in the study is proportion of # subjects with serious outcomes or death. A negative value for the difference # in proportions between groups (Group 2 proportion - Group 1 proportion) # indicates the treatment group has a better outcome. In this table, the # proportion of subjects in Group 1 with serious outcomes or death is set # to 3%, 6.5%, and 12%, and the difference in proportions between the two # groups is set to -2.8 percentage points, -1.4 percentage points, and 0. ciTableProp(n1 = 935, p1.hat = c(0.03, 0.065, 0.12), n2 = 935, p2.hat.minus.p1.hat = c(-0.028, -0.014, 0), digits = 3) # Diff=-0.028 Diff=-0.014 Diff=0 #P1.hat=0.030 [-0.040, -0.015] [-0.029, 0.001] [-0.015, 0.015] #P1.hat=0.065 [-0.049, -0.007] [-0.036, 0.008] [-0.022, 0.022] #P1.hat=0.120 [-0.057, 0.001] [-0.044, 0.016] [-0.029, 0.029] #========== # Show how the returned data frame has to be modified for cases of small # sample sizes where not all user-supplied differenes are possible. ciTableProp(n1 = 5, n2 = 5, p1.hat = c(0.3, 0.6, 0.12), p2.hat = c(0.2, 0.1, 0)) # Diff CI Diff CI Diff CI #P1.hat=0.4 0.2 [-0.61, 1.00] 0.0 [-0.61, 0.61] 0 [-0.61, 0.61] #P1.hat=0.6 0.2 [-0.55, 0.95] 0.2 [-0.55, 0.95] 0 [-0.61, 0.61] #P1.hat=0.2 0.2 [-0.55, 0.95] 0.2 [-0.55, 0.95] 0 [-0.50, 0.50] #========== # Suppose we are planning a study to compare the proportion of nondetects at # a background and downgradient well, and we can use ciTableProp to look how # the confidence interval for the difference between the two proportions using # say 36 quarterly samples at each well varies with the observed estimated # proportions. Here we'll let the argument "p1.hat" denote the proportion of # nondetects observed at the downgradient well and set this equal to # 20%, 40% and 60%. The argument "p2.hat.minus.p1.hat" represents the proportion # of nondetects at the background well minus the proportion of nondetects at the # downgradient well. ciTableProp(n1 = 36, p1.hat = c(0.2, 0.4, 0.6), n2 = 36, p2.hat.minus.p1.hat = c(0.3, 0.15, 0)) # Diff=0.31 Diff=0.14 Diff=0 #P1.hat=0.19 [ 0.07, 0.54] [-0.09, 0.37] [-0.18, 0.18] #P1.hat=0.39 [ 0.06, 0.55] [-0.12, 0.39] [-0.23, 0.23] #P1.hat=0.61 [ 0.09, 0.52] [-0.10, 0.38] [-0.23, 0.23] # We see that even if the observed difference in the proportion of nondetects # is about 15 percentage points, all of the confidence intervals for the # difference between the proportions of nondetects at the two wells contain 0, # so if a difference of 15 percentage points is important to substantiate, we # may need to increase our sample sizes.
Compute the sample coefficient of variation.
cv(x, method = "moments", sd.method = "sqrt.unbiased", l.moment.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), na.rm = FALSE)
cv(x, method = "moments", sd.method = "sqrt.unbiased", l.moment.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), na.rm = FALSE)
x |
numeric vector of observations. |
method |
character string specifying what method to use to compute the sample coefficient
of variation. The possible values are |
sd.method |
character string specifying what method to use to compute the sample standard
deviation when |
l.moment.method |
character string specifying what method to use to compute the
|
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for
the plotting positions when |
na.rm |
logical scalar indicating whether to remove missing values from |
Let denote a random sample of
observations from
some distribution with mean
and standard deviation
.
Product Moment Coefficient of Variation (method="moments"
)
The coefficient of variation (sometimes denoted CV) of a distribution is
defined as the ratio of the standard deviation to the mean. That is:
The coefficient of variation measures how spread out the distribution is relative to the size of the mean. It is usually used to characterize positive, right-skewed distributions such as the lognormal distribution.
When sd.method="sqrt.unbiased"
, the coefficient of variation is estimated
using the sample mean and the square root of the unbaised estimator of variance:
where
Note that the estimator of standard deviation in equation (4) is not unbiased.
When sd.method="moments"
, the coefficient of variation is estimated using
the sample mean and the square root of the method of moments estimator of variance:
L-Moment Coefficient of Variation (method="l.moments"
)
Hosking (1990) defines an -moment analog of the
coefficient of variation (denoted the
-CV) as:
that is, the second -moment divided by the first
-moment.
He shows that for a positive-valued random variable, the
-CV lies in the
interval (0, 1).
When l.moment.method="unbiased"
, the -CV is estimated by:
that is, the unbiased estimator of the second -moment divided by
the unbiased estimator of the first
-moment.
When l.moment.method="plotting.position"
, the -CV is estimated by:
that is, the plotting-position estimator of the second -moment divided by
the plotting-position estimator of the first
-moment.
See the help file for lMoment
for more information on
estimating -moments.
A numeric scalar – the sample coefficient of variation.
Traditionally, the coefficient of variation has been estimated using
product moment estimators. Hosking (1990) introduced the idea of
-moments and the
-CV. Vogel and Fennessey (1993) argue that
-moment ratios should replace product moment ratios because of their
superior performance (they are nearly unbiased and better for discriminating
between distributions).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
Taylor, J.K. (1990). Statistical Techniques for Data Analysis. Lewis Publishers, Boca Raton, FL.
Vogel, R.M., and N.M. Fennessey. (1993). Moment Diagrams Should Replace
Product Moment Diagrams. Water Resources Research 29(6), 1745–1752.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
Summary Statistics, summaryFull
, var
,
sd
, skewness
, kurtosis
.
# Generate 20 observations from a lognormal distribution with # parameters mean=10 and cv=1, and estimate the coefficient of variation. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 1) cv(dat) #[1] 0.5077981 cv(dat, sd.method = "moments") #[1] 0.4949403 cv(dat, method = "l.moments") #[1] 0.2804148 #---------- # Clean up rm(dat)
# Generate 20 observations from a lognormal distribution with # parameters mean=10 and cv=1, and estimate the coefficient of variation. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 1) cv(dat) #[1] 0.5077981 cv(dat, sd.method = "moments") #[1] 0.4949403 cv(dat, method = "l.moments") #[1] 0.2804148 #---------- # Clean up rm(dat)
Determine the detection limit based on using a calibration line (or curve) and inverse regression.
detectionLimitCalibrate(object, coverage = 0.99, simultaneous = TRUE)
detectionLimitCalibrate(object, coverage = 0.99, simultaneous = TRUE)
object |
an object of class |
coverage |
optional numeric scalar between 0 and 1 indicating the confidence level associated with
the prediction intervals used in determining the detection limit.
The default value is |
simultaneous |
optional logical scalar indicating whether to base the prediction intervals on
simultaneous or non-simultaneous prediction limits. The default value is |
The idea of a decision limit and detection limit is directly related to calibration and
can be framed in terms of a hypothesis test, as shown in the table below.
The null hypothesis is that the chemical is not present in the physical sample, i.e.,
, where C denotes the concentration.
Your Decision | True ( ) |
False ( ) |
Reject |
Type I Error | |
(Declare Chemical Present) | (Probability = ) |
|
Do Not Reject |
Type II Error | |
(Declare Chemical Absent) | (Probability = ) |
|
Ideally, you would like to minimize both the Type I and Type II error rates.
Just as we use critical values to compare against the test statistic for a hypothesis test,
we need to use a critical signal level called the decision limit to decide
whether the chemical is present or absent. If the signal is less than or equal to
we will declare the chemical is absent, and if the signal is greater than
we will
declare the chemical is present.
First, suppose no chemical is present (i.e., the null hypothesis is true).
If we want to guard against the mistake of declaring that the chemical is present when in fact it is
absent (Type I error), then we should choose so that the probability of this happening is
some small value
. Thus, the value of
depends on what we want to use for
(the Type I error rate), and the true (but unknown) value of
(the standard deviation of the errors assuming a constant standard deviation)
(Massart et al., 1988, p. 111).
When the true concentration is 0, the decision limit is the (1-)100th percentile of the
distribution of the signal S. Note that the decision limit is on the scale of and in units
of the signal S.
Now suppose that in fact the chemical is present in some concentration C
(i.e., the null hypothesis is false). If we want to guard against the mistake of
declaring that the chemical is absent when in fact it is present (Type II error),
then we need to determine a minimal concentration called the detection limit (DL)
that we know will yield a signal less than the decision limit
only a small fraction of the
time (
).
In practice we do not know the true value of the standard deviation of the errors (),
so we cannot compute the true decision limit. Also, we do not know the true values of the
intercept and slope of the calibration line, so we cannot compute the true detection limit.
Instead, we usually set
and estimate the decision and detection limits
by computing prediction limits for the calibration line and using inverse regression.
The estimated detection limit corresponds to the upper confidence bound on concentration given that the signal is equal to the estimated decision limit. Currie (1997) discusses other ways to define the detection limit, and Glaser et al. (1981) define a quantity called the method detection limit.
A numeric vector of length 2 indicating the signal detection limit and the concentration
detection limit. This vector has two attributes called coverage
and simultaneous
indicating the values of these arguments that were used in the
call to detectionLimitCalibrate
.
Perhaps no other topic in environmental statistics has generated as much confusion or controversy as the topic of detection limits. After decades of disparate terminology, ISO and IUPAC provided harmonized guidance on the topic in 1995 (Currie, 1997). Intuitively, the idea of a detection limit is simple to grasp: the detection limit is “the smallest amount or concentration of a particular substance that can be reliably detected in a given type of sample or medium by a specific measurement process” (Currie, 1997, p. 152). Unfortunately, because of the exceedingly complex nature of measuring chemical concentrations, this simple idea is difficult to apply in practice.
Detection and quantification capabilities are fundamental performance characteristics of the Chemical Measurement Process (CMP) (Currie, 1996, 1997). In this help file we discuss some currently accepted definitions of the terms decision, detection, and quantification limits. For more details, the reader should consult the references listed in this help file.
The quantification limit is defined as the concentration C at which the coefficient of variation (also called relative standard deviation or RSD) for the distribution of the signal S is some small value, usually taken to be 10% (Currie, 1968, 1997). In practice the quantification limit is difficult to estimate because we have to estimate both the mean and the standard deviation of the signal S for any particular concentration, and usually the standard deviation varies with concentration. Variations of the quantification limit include the quantitation limit (Keith, 1991, p. 109), minimum level (USEPA, 1993), and alternative minimum level (Gibbons et al., 1997a).
Steven P. Millard ([email protected])
Clark, M.J.R., and P.H. Whitfield. (1994). Conflicting Perspectives About Detection Limits and About the Censoring of Environmental Data. Water Resources Bulletin 30(6), 1063–1079.
Clayton, C.A., J.W. Hines, and P.D. Elkins. (1987). Detection Limits with Specified Assurance Probabilities. Analytical Chemistry 59, 2506–2514.
Code of Federal Regulations. (1996). Definition and Procedure for the Determination of the Method Detection Limit–Revision 1.11. Title 40, Part 136, Appendix B, 7-1-96 Edition, pp.265–267.
Currie, L.A. (1968). Limits for Qualitative Detection and Quantitative Determination: Application to Radiochemistry. Annals of Chemistry 40, 586–593.
Currie, L.A. (1988). Detection in Analytical Chemistry: Importance, Theory, and Practice. American Chemical Society, Washington, D.C.
Currie, L.A. (1995). Nomenclature in Evaluation of Analytical Methods Including Detection and Quantification Capabilities. Pure & Applied Chemistry 67(10), 1699-1723.
Currie, L.A. (1996). Foundations and Future of Detection and Quantification Limits. Proceedings of the Section on Statistics and the Environment, American Statistical Association, Alexandria, VA.
Currie, L.A. (1997). Detection: International Update, and Some Emerging Di-Lemmas Involving Calibration, the Blank, and Multiple Detection Decisions. Chemometrics and Intelligent Laboratory Systems 37, 151-181.
Davis, C.B. (1994). Environmental Regulatory Statistics. In Patil, G.P., and C.R. Rao, eds., Handbook of Statistics, Vol. 12: Environmental Statistics. North-Holland, Amsterdam, a division of Elsevier, New York, NY, Chapter 26, 817–865.
Davis, C.B. (1997). Challenges in Regulatory Environmetrics. Chemometrics and Intelligent Laboratory Systems 37, 43–53.
Gibbons, R.D. (1995). Some Statistical and Conceptual Issues in the Detection of Low-Level Environmental Pollutants (with Discussion). Environmetrics 2, 125-167.
Gibbons, R.D., D.E. Coleman, and R.F. Maddalone. (1997a). An Alternative Minimum Level Definition for Analytical Quantification. Environmental Science & Technology 31(7), 2071–2077. Comments and Discussion in Volume 31(12), 3727–3731, and Volume 32(15), 2346–2353.
Gibbons, R.D., D.E. Coleman, and R.F. Maddalone. (1997b). Response to Comment on “An Alternative Minimum Level Definition for Analytical Quantification”. Environmental Science and Technology 31(12), 3729–3731.
Gibbons, R.D., D.E. Coleman, and R.F. Maddalone. (1998). Response to Comment on “An Alternative Minimum Level Definition for Analytical Quantification”. Environmental Science and Technology 32(15), 2349–2353.
Gibbons, R.D., N.E. Grams, F.H. Jarke, and K.P. Stoub. (1992). Practical Quantitation Limits. Chemometrics Intelligent Laboratory Systems 12, 225–235.
Gibbons, R.D., F.H. Jarke, and K.P. Stoub. (1991). Detection Limits: For Linear Calibration Curves with Increasing Variance and Multiple Future Detection Decisions. In Tatsch, D.E., editor. Waste Testing and Quality Assurance: Volume 3. American Society for Testing and Materials, Philadelphi, PA.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring. Second Edition. John Wiley & Sons, Hoboken. Chapter 6, p. 111.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey. Chapter 3, p. 22.
Glasser, J.A., D.L. Foerst, G.D. McKee, S.A. Quave, and W.L. Budde. (1981). Trace Analyses for Wastewaters. Environmental Science and Technology 15, 1426–1435.
Hubaux, A., and G. Vos. (1970). Decision and Detection Limits for Linear Calibration Curves. Annals of Chemistry 42, 849–855.
Kahn, H.D., C.E. White, K. Stralka, and R. Kuznetsovski. (1997). Alternative Estimates of Detection. Proceedings of the Twentieth Annual EPA Conference on Analysis of Pollutants in the Environment, May 7-8, Norfolk, VA. U.S. Environmental Protection Agency, Washington, D.C.
Kahn, H.D., W.A. Telliard, and C.E. White. (1998). Comment on “An Alternative Minimum Level Definition for Analytical Quantification” (with Response). Environmental Science & Technology 32(5), 2346–2353.
Kaiser, H. (1965). Zum Problem der Nachweisgrenze. Fresenius' Z. Anal. Chem. 209, 1.
Keith, L.H. (1991). Environmental Sampling and Analysis: A Practical Guide. Lewis Publishers, Boca Raton, FL, Chapter 10.
Kimbrough, D.E. (1997). Comment on “An Alternative Minimum Level Definition for Analytical Quantification” (with Response). Environmental Science & Technology 31(12), 3727–3731.
Lambert, D., B. Peterson, and I. Terpenning. (1991). Nondetects, Detection Limits, and the Probability of Detection. Journal of the American Statistical Association 86(414), 266–277.
Massart, D.L., B.G.M. Vandeginste, S.N. Deming, Y. Michotte, and L. Kaufman. (1988). Chemometrics: A Textbook. Elsevier, New York, Chapter 7.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Porter, P.S., R.C. Ward, and H.F. Bell. (1988). The Detection Limit. Environmental Science & Technology 22(8), 856–861.
Rocke, D.M., and S. Lorenzato. (1995). A Two-Component Model for Measurement Error in Analytical Chemistry. Technometrics 37(2), 176–184.
Singh, A. (1993). Multivariate Decision and Detection Limits. Analytica Chimica Acta 277, 205-214.
Spiegelman, C.H. (1997). A Discussion of Issues Raised by Lloyd Currie and a Cross Disciplinary View of Detection Limits and Estimating Parameters That Are Often At or Near Zero. Chemometrics and Intelligent Laboratory Systems 37, 183–188.
USEPA. (1987c). List (Phase 1) of Hazardous Constituents for Ground-Water Monitoring; Final Rule. Federal Register 52(131), 25942–25953 (July 9, 1987).
Zorn, M.E., R.D. Gibbons, and W.C. Sonzogni. (1997). Weighted Least-Squares Approach to Calculating Limits of Detection and Quantification by Modeling Variability as a Function of Concentration. Analytical Chemistry 69, 3069–3075.
calibrate
, inversePredictCalibrate
, pointwise
.
# The data frame EPA.97.cadmium.111.df contains calibration # data for cadmium at mass 111 (ng/L) that appeared in # Gibbons et al. (1997b) and were provided to them by the U.S. EPA. # # The Example section in the help file for calibrate shows how to # plot these data along with the fitted calibration line and 99% # non-simultaneous prediction limits. # # For the current example, we will compute the decision limit (7.68) # and detection limit (12.36 ng/L) based on using alpha = beta = 0.01 # and a linear calibration line with constant variance. See # Millard and Neerchal (2001, pp.566-575) for more details on this # example. calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) detectionLimitCalibrate(calibrate.list, simultaneous = FALSE) # Decision Limit (Signal) Detection Limit (Concentration) # 7.677842 12.364670 #attr(,"coverage") #[1] 0.99 #attr(,"simultaneous") #[1] FALSE #---------- # Clean up #--------- rm(calibrate.list)
# The data frame EPA.97.cadmium.111.df contains calibration # data for cadmium at mass 111 (ng/L) that appeared in # Gibbons et al. (1997b) and were provided to them by the U.S. EPA. # # The Example section in the help file for calibrate shows how to # plot these data along with the fitted calibration line and 99% # non-simultaneous prediction limits. # # For the current example, we will compute the decision limit (7.68) # and detection limit (12.36 ng/L) based on using alpha = beta = 0.01 # and a linear calibration line with constant variance. See # Millard and Neerchal (2001, pp.566-575) for more details on this # example. calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) detectionLimitCalibrate(calibrate.list, simultaneous = FALSE) # Decision Limit (Signal) Detection Limit (Concentration) # 7.677842 12.364670 #attr(,"coverage") #[1] 0.99 #attr(,"simultaneous") #[1] FALSE #---------- # Clean up #--------- rm(calibrate.list)
Perform a series of goodness-of-fit tests from a (possibly user-specified) set of candidate probability distributions to determine which probability distribution provides the best fit for a data set.
distChoose(y, ...) ## S3 method for class 'formula' distChoose(y, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: distChoose(y, alpha = 0.05, method = "sw", choices = c("norm", "gamma", "lnorm"), est.arg.list = NULL, warn = TRUE, keep.data = TRUE, data.name = NULL, parent.of.data = NULL, subset.expression = NULL, ...)
distChoose(y, ...) ## S3 method for class 'formula' distChoose(y, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: distChoose(y, alpha = 0.05, method = "sw", choices = c("norm", "gamma", "lnorm"), est.arg.list = NULL, warn = TRUE, keep.data = TRUE, data.name = NULL, parent.of.data = NULL, subset.expression = NULL, ...)
y |
an object containing data for the goodness-of-fit tests. In the default
method, the argument |
data |
specifies an optional data frame, list or environment (or object coercible
by |
subset |
specifies an optional vector specifying a subset of observations to be used. |
na.action |
specifies a function which indicates what should happen when the data contain |
alpha |
numeric scalar between 0 and 1 specifying the Type I error associated with each
goodness-of-fit test. When |
method |
character string defining which method to use. Possible values are:
See the DETAILS section below. |
choices |
a character vector denoting the distribution abbreviations of the candidate
distributions. See the help file for This argument is ignored when |
est.arg.list |
a list containing one or more lists of arguments to be passed to the
function(s) estimating the distribution parameters. The name(s) of
the components of the list must be equal to or a subset of the values of the
argument When testing for some form of normality (i.e., Normal, Lognormal, Three-Parameter Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta)), the estimated parameters are provided in the output merely for information, and the choice of the method of estimation has no effect on the goodness-of-fit test statistics or p-values. This argument is ignored when |
warn |
logical scalar indicating whether to print a warning message when
observations with |
keep.data |
logical scalar indicating whether to return the original data. The
default value is |
data.name |
optional character string indicating the name of the data used for argument |
parent.of.data |
character string indicating the source of the data used for the goodness-of-fit test. |
subset.expression |
character string indicating the expression used to subset the data. |
... |
additional arguments affecting the goodness-of-fit test. |
The function distChoose
returns a list with information on the goodness-of-fit
tests for various distributions and which distribution appears to best fit the
data based on the p-values from the goodness-of-fit tests. This function was written in
order to compare ProUCL's way of choosing the best-fitting distribution (USEPA, 2015) with
other ways of choosing the best-fitting distribution.
Method Based on Shapiro-Wilk, Shapiro-Francia, or Probability Plot Correlation Test
(method="sw"
, method="sf"
, or method="ppcc"
)
For each value of the argument choices
, the function distChoose
runs the goodness-of-fit test using the data in y
assuming that particular
distribution. For example, if choices=c("norm", "gamma", "lnorm")
,
indicating the Normal, Gamma, and Lognormal distributions, and
method="sw"
, then the usual Shapiro-Wilk test is performed for the Normal
and Lognormal distributions, and the extension of the Shapiro-Wilk test is performed
for the Gamma distribution (see the section
Testing Goodness-of-Fit for Any Continuous Distribution in the help
file for gofTest
for an explanation of the latter). The distribution associated
with the largest p-value is the chosen distribution. In the case when all p-values are
less than the value of the argument alpha
, the distribution “Nonparametric” is chosen.
Method Based on ProUCL Algorithm (method="proucl"
)
When method="proucl"
, the function distChoose
uses the
algorithm that ProUCL (USEPA, 2015) uses to determine the best fitting
distribution. The candidate distributions are the
Normal, Gamma, and Lognormal distributions. The algorithm
used by ProUCL is as follows:
Perform the Shapiro-Wilk and Lilliefors goodness-of-fit tests for the
Normal distribution, i.e., call the function gofTest
with
distribution = "norm", test="sw"
and distribution = "norm", test="lillie"
.
If either or both of the associated p-values are greater than or equal to the user-supplied value
of alpha
, then choose the Normal distribution. Otherwise, proceed to the next step.
Perform the “ProUCL Anderson-Darling” and “ProUCL Kolmogorov-Smirnov” goodness-of-fit
tests for the Gamma distribution,
i.e., call the function gofTest
with distribution="gamma", test="proucl.ad.gamma"
and distribution="gamma", test="proucl.ks.gamma"
.
If either or both of the associated p-values are greater than or equal to the user-supplied value
of alpha
, then choose the Gamma distribution. Otherwise, proceed to the next step.
Perform the Shapiro-Wilk and Lilliefors goodness-of-fit tests for the
Lognormal distribution, i.e., call the function gofTest
with
distribution="lnorm", test="sw"
and distribution="lnorm", test="lillie"
.
If either or both of the associated p-values are greater than or equal to the user-supplied value
of alpha
, then choose the Lognormal distribution. Otherwise, proceed to the next step.
If none of the goodness-of-fit tests above yields a p-value greater than or equal to the user-supplied value
of alpha
, then choose the “Nonparametric” distribution.
a list of class "distChoose"
containing the results of the goodness-of-fit tests.
Objects of class "distChoose"
have a special printing method.
See the help files for distChoose.object
for details.
In practice, almost any goodness-of-fit test will not reject the null hypothesis
if the number of observations is relatively small. Conversely, almost any goodness-of-fit
test will reject the null hypothesis if the number of observations is very large,
since “real” data are never distributed according to any theoretical distribution
(Conover, 1980, p.367). For most cases, however, the distribution of “real” data
is close enough to some theoretical distribution that fairly accurate results may be
provided by assuming that particular theoretical distribution. One way to asses the
goodness of the fit is to use goodness-of-fit tests. Another way is to look at
quantile-quantile (Q-Q) plots (see qqPlot
).
Steven P. Millard ([email protected])
Birnbaum, Z.W., and F.H. Tingey. (1951). One-Sided Confidence Contours for Probability Distribution Functions. Annals of Mathematical Statistics 22, 592-596.
Blom, G. (1958). Statistical Estimates and Transformed Beta Variables. John Wiley and Sons, New York.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Dallal, G.E., and L. Wilkinson. (1986). An Analytic Approximation to the Distribution of Lilliefor's Test for Normality. The American Statistician 40, 294-296.
D'Agostino, R.B. (1970). Transformation to Normality of the Null Distribution of .
Biometrika 57, 679-681.
D'Agostino, R.B. (1971). An Omnibus Test of Normality for Moderate and Large Size Samples. Biometrika 58, 341-348.
D'Agostino, R.B. (1986b). Tests for the Normal Distribution. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York.
D'Agostino, R.B., and E.S. Pearson (1973). Tests for Departures from Normality.
Empirical Results for the Distributions of and
.
Biometrika 60(3), 613-622.
D'Agostino, R.B., and G.L. Tietjen (1973). Approaches to the Null Distribution of .
Biometrika 60(1), 169-173.
Fisher, R.A. (1950). Statistical Methods for Research Workers. 11'th Edition. Hafner Publishing Company, New York, pp.99-100.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Kendall, M.G., and A. Stuart. (1991). The Advanced Theory of Statistics, Volume 2: Inference and Relationship. Fifth Edition. Oxford University Press, New York.
Kim, P.J., and R.I. Jennrich. (1973). Tables of the Exact Sampling Distribution of the Two Sample Kolmogorov-Smirnov Criterion. In Harter, H.L., and D.B. Owen, eds. Selected Tables in Mathematical Statistics, Vol. 1. American Mathematical Society, Providence, Rhode Island, pp.79-170.
Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell' Istituto Italiano degle Attuari 4, 83-91.
Marsaglia, G., W.W. Tsang, and J. Wang. (2003). Evaluating Kolmogorov's distribution. Journal of Statistical Software, 8(18). doi:10.18637/jss.v008.i18.
Moore, D.S. (1986). Tests of Chi-Squared Type. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, pp.63-95.
Pomeranz, J. (1973). Exact Cumulative Distribution of the Kolmogorov-Smirnov Statistic for Small Samples (Algorithm 487). Collected Algorithms from ACM ??, ???-???.
Royston, J.P. (1992a). Approximating the Shapiro-Wilk W-Test for Non-Normality. Statistics and Computing 2, 117-119.
Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897-912.
Royston, J.P. (1992c). A Pocket-Calculator Algorithm for the Shapiro-Francia Test of Non-Normality: An Application to Medicine. Statistics in Medicine 12, 181-184.
Royston, P. (1993). A Toolkit for Testing for Non-Normality in Complete and Censored Samples. The Statistician 42, 37-43.
Ryan, T., and B. Joiner. (1973). Normal Probability Plots and Tests for Normality. Technical Report, Pennsylvannia State University, Department of Statistics.
Shapiro, S.S., and R.S. Francia. (1972). An Approximate Analysis of Variance Test for Normality. Journal of the American Statistical Association 67(337), 215-219.
Shapiro, S.S., and M.B. Wilk. (1965). An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52, 591-611.
Smirnov, N.V. (1939). Estimate of Deviation Between Empirical Distribution Functions in Two Independent Samples. Bulletin Moscow University 2(2), 3-16.
Smirnov, N.V. (1948). Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics 19, 279-281.
Stephens, M.A. (1970). Use of the Kolmogorov-Smirnov, Cramer-von Mises and Related Statistics Without Extensive Tables. Journal of the Royal Statistical Society, Series B, 32, 115-122.
Stephens, M.A. (1986a). Tests Based on EDF Statistics. In D'Agostino, R. B., and M.A. Stevens, eds. Goodness-of-Fit Techniques. Marcel Dekker, New York.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.
Verrill, S., and R.A. Johnson. (1987). The Asymptotic Equivalence of Some Modified Shapiro-Wilk Statistics – Complete and Censored Sample Cases. The Annals of Statistics 15(1), 413-419.
Verrill, S., and R.A. Johnson. (1988). Tables and Large-Sample Distribution Theory for Censored-Data Correlation Statistics for Testing Normality. Journal of the American Statistical Association 83, 1192-1197.
Weisberg, S., and C. Bingham. (1975). An Approximate Analysis of Variance Test for Non-Normality Suitable for Machine Calculation. Technometrics 17, 133-134.
Wilk, M.B., and S.S. Shapiro. (1968). The Joint Assessment of Normality of Several Independent Samples. Technometrics, 10(4), 825-839.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
gofTest
, distChoose.object
, print.distChoose
.
# Generate 20 observations from a gamma distribution with # parameters shape = 2 and scale = 3 and: # # 1) Call distChoose using the Shapiro-Wilk method. # # 2) Call distChoose using the Shapiro-Wilk method and specify # the bias-corrected method of estimating shape for the Gamma # distribution. # # 3) Compare the results in 2) above with the results using the # ProUCL method. # # Notes: The call to set.seed lets you reproduce this example. # # The ProUCL method chooses the Normal distribution, whereas the # Shapiro-Wilk method chooses the Gamma distribution. set.seed(47) dat <- rgamma(20, shape = 2, scale = 3) # 1) Call distChoose using the Shapiro-Wilk method. #-------------------------------------------------- distChoose(dat) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: MLE # #Data: dat # #Sample Size: 20 # #Test Results: # # Normal # Test Statistic: W = 0.9097488 # P-value: 0.06303695 # # Gamma # Test Statistic: W = 0.9834958 # P-value: 0.970903 # # Lognormal # Test Statistic: W = 0.9185006 # P-value: 0.09271768 #-------------------- # 2) Call distChoose using the Shapiro-Wilk method and specify # the bias-corrected method of estimating shape for the Gamma # distribution. #--------------------------------------------------------------- distChoose(dat, method = "sw", est.arg.list = list(gamma = list(method = "bcmle"))) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 1.656376 # scale = 4.676680 # #Estimation Method: Bias-Corrected MLE # #Data: dat # #Sample Size: 20 # #Test Results: # # Normal # Test Statistic: W = 0.9097488 # P-value: 0.06303695 # # Gamma # Test Statistic: W = 0.9834346 # P-value: 0.9704046 # # Lognormal # Test Statistic: W = 0.9185006 # P-value: 0.09271768 #-------------------- # 3) Compare the results in 2) above with the results using the # ProUCL method. #--------------------------------------------------------------- distChoose(dat, method = "proucl") #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: ProUCL # #Type I Error per Test: 0.05 # #Decision: Normal # #Estimated Parameter(s): mean = 7.746340 # sd = 5.432175 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Test Results: # # Normal # Shapiro-Wilk GOF # Test Statistic: W = 0.9097488 # P-value: 0.06303695 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.1547851 # P-value: 0.238092 # # Gamma # ProUCL Anderson-Darling Gamma GOF # Test Statistic: A = 0.1853826 # P-value: >= 0.10 # ProUCL Kolmogorov-Smirnov Gamma GOF # Test Statistic: D = 0.0988692 # P-value: >= 0.10 # # Lognormal # Shapiro-Wilk GOF # Test Statistic: W = 0.9185006 # P-value: 0.09271768 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.149317 # P-value: 0.2869177 #-------------------- # Clean up #--------- rm(dat) #==================================================================== # Example 10-2 of USEPA (2009, page 10-14) gives an example of # using the Shapiro-Wilk test to test the assumption of normality # for nickel concentrations (ppb) in groundwater collected over # 4 years. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #4 8 Well.1 56.0 #5 10 Well.1 8.7 #6 1 Well.2 19.0 #7 3 Well.2 81.5 #8 6 Well.2 331.0 #9 8 Well.2 14.0 #10 10 Well.2 64.4 #11 1 Well.3 39.0 #12 3 Well.3 151.0 #13 6 Well.3 27.0 #14 8 Well.3 21.4 #15 10 Well.3 578.0 #16 1 Well.4 3.1 #17 3 Well.4 942.0 #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 # Use distChoose with the probability plot correlation method, # and for the lognormal distribution specify the # mean and CV parameterization: #------------------------------------------------------------ distChoose(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df, choices = c("norm", "gamma", "lnormAlt"), method = "ppcc") #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: PPCC # #Type I Error per Test: 0.05 # #Decision: Lognormal # #Estimated Parameter(s): mean = 213.415628 # cv = 2.809377 # #Estimation Method: mvue # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Results: # # Normal # Test Statistic: r = 0.8199825 # P-value: 5.753418e-05 # # Gamma # Test Statistic: r = 0.9749044 # P-value: 0.317334 # # Lognormal # Test Statistic: r = 0.9912528 # P-value: 0.9187852 #-------------------- # Repeat the above example using the ProUCL method. #-------------------------------------------------- distChoose(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df, method = "proucl") #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: ProUCL # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 0.5198727 # scale = 326.0894272 # #Estimation Method: MLE # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Results: # # Normal # Shapiro-Wilk GOF # Test Statistic: W = 0.6788888 # P-value: 2.17927e-05 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.3267052 # P-value: 5.032807e-06 # # Gamma # ProUCL Anderson-Darling Gamma GOF # Test Statistic: A = 0.5076725 # P-value: >= 0.10 # ProUCL Kolmogorov-Smirnov Gamma GOF # Test Statistic: D = 0.1842904 # P-value: >= 0.10 # # Lognormal # Shapiro-Wilk GOF # Test Statistic: W = 0.978946 # P-value: 0.9197735 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.08405167 # P-value: 0.9699648 #==================================================================== ## Not run: # 1) Simulate 1000 trials where for each trial you: # a) Generate 20 observations from a Gamma distribution with # parameters mean = 10 and CV = 1. # b) Use distChoose with the Shapiro-Wilk method. # c) Use distChoose with the ProUCL method. # # 2) Compare the proportion of times the # Normal vs. Gamma vs. Lognormal vs. Nonparametric distribution # is chosen for b) and c) above. #------------------------------------------------------------------ set.seed(58) N <- 1000 Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric")) Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac) for(i in 1:N) { dat <- rgammaAlt(20, mean = 10, cv = 1) Choose.df[i, "SW"] <- distChoose(dat, method = "sw")$decision Choose.df[i, "ProUCL"] <- distChoose(dat, method = "proucl")$decision } summaryStats(Choose.df, digits = 0) # ProUCL(N) ProUCL(Pct) SW(N) SW(Pct) #Normal 443 44 41 4 #Gamma 546 55 733 73 #Lognormal 9 1 215 22 #Nonparametric 2 0 11 1 #Combined 1000 100 1000 100 #-------------------- # Repeat above example for the Lognormal Distribution with mean=10 and CV = 1. #----------------------------------------------------------------------------- set.seed(297) N <- 1000 Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric")) Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac) for(i in 1:N) { dat <- rlnormAlt(20, mean = 10, cv = 1) Choose.df[i, "SW"] <- distChoose(dat, method = "sw")$decision Choose.df[i, "ProUCL"] <- distChoose(dat, method = "proucl")$decision } summaryStats(Choose.df, digits = 0) # ProUCL(N) ProUCL(Pct) SW(N) SW(Pct) #Normal 313 31 15 2 #Gamma 556 56 254 25 #Lognormal 121 12 706 71 #Nonparametric 10 1 25 2 #Combined 1000 100 1000 100 #-------------------- # Clean up #--------- rm(N, Choose.fac, Choose.df, i, dat) ## End(Not run)
# Generate 20 observations from a gamma distribution with # parameters shape = 2 and scale = 3 and: # # 1) Call distChoose using the Shapiro-Wilk method. # # 2) Call distChoose using the Shapiro-Wilk method and specify # the bias-corrected method of estimating shape for the Gamma # distribution. # # 3) Compare the results in 2) above with the results using the # ProUCL method. # # Notes: The call to set.seed lets you reproduce this example. # # The ProUCL method chooses the Normal distribution, whereas the # Shapiro-Wilk method chooses the Gamma distribution. set.seed(47) dat <- rgamma(20, shape = 2, scale = 3) # 1) Call distChoose using the Shapiro-Wilk method. #-------------------------------------------------- distChoose(dat) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: MLE # #Data: dat # #Sample Size: 20 # #Test Results: # # Normal # Test Statistic: W = 0.9097488 # P-value: 0.06303695 # # Gamma # Test Statistic: W = 0.9834958 # P-value: 0.970903 # # Lognormal # Test Statistic: W = 0.9185006 # P-value: 0.09271768 #-------------------- # 2) Call distChoose using the Shapiro-Wilk method and specify # the bias-corrected method of estimating shape for the Gamma # distribution. #--------------------------------------------------------------- distChoose(dat, method = "sw", est.arg.list = list(gamma = list(method = "bcmle"))) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 1.656376 # scale = 4.676680 # #Estimation Method: Bias-Corrected MLE # #Data: dat # #Sample Size: 20 # #Test Results: # # Normal # Test Statistic: W = 0.9097488 # P-value: 0.06303695 # # Gamma # Test Statistic: W = 0.9834346 # P-value: 0.9704046 # # Lognormal # Test Statistic: W = 0.9185006 # P-value: 0.09271768 #-------------------- # 3) Compare the results in 2) above with the results using the # ProUCL method. #--------------------------------------------------------------- distChoose(dat, method = "proucl") #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: ProUCL # #Type I Error per Test: 0.05 # #Decision: Normal # #Estimated Parameter(s): mean = 7.746340 # sd = 5.432175 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Test Results: # # Normal # Shapiro-Wilk GOF # Test Statistic: W = 0.9097488 # P-value: 0.06303695 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.1547851 # P-value: 0.238092 # # Gamma # ProUCL Anderson-Darling Gamma GOF # Test Statistic: A = 0.1853826 # P-value: >= 0.10 # ProUCL Kolmogorov-Smirnov Gamma GOF # Test Statistic: D = 0.0988692 # P-value: >= 0.10 # # Lognormal # Shapiro-Wilk GOF # Test Statistic: W = 0.9185006 # P-value: 0.09271768 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.149317 # P-value: 0.2869177 #-------------------- # Clean up #--------- rm(dat) #==================================================================== # Example 10-2 of USEPA (2009, page 10-14) gives an example of # using the Shapiro-Wilk test to test the assumption of normality # for nickel concentrations (ppb) in groundwater collected over # 4 years. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #4 8 Well.1 56.0 #5 10 Well.1 8.7 #6 1 Well.2 19.0 #7 3 Well.2 81.5 #8 6 Well.2 331.0 #9 8 Well.2 14.0 #10 10 Well.2 64.4 #11 1 Well.3 39.0 #12 3 Well.3 151.0 #13 6 Well.3 27.0 #14 8 Well.3 21.4 #15 10 Well.3 578.0 #16 1 Well.4 3.1 #17 3 Well.4 942.0 #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 # Use distChoose with the probability plot correlation method, # and for the lognormal distribution specify the # mean and CV parameterization: #------------------------------------------------------------ distChoose(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df, choices = c("norm", "gamma", "lnormAlt"), method = "ppcc") #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: PPCC # #Type I Error per Test: 0.05 # #Decision: Lognormal # #Estimated Parameter(s): mean = 213.415628 # cv = 2.809377 # #Estimation Method: mvue # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Results: # # Normal # Test Statistic: r = 0.8199825 # P-value: 5.753418e-05 # # Gamma # Test Statistic: r = 0.9749044 # P-value: 0.317334 # # Lognormal # Test Statistic: r = 0.9912528 # P-value: 0.9187852 #-------------------- # Repeat the above example using the ProUCL method. #-------------------------------------------------- distChoose(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df, method = "proucl") #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: ProUCL # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 0.5198727 # scale = 326.0894272 # #Estimation Method: MLE # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Results: # # Normal # Shapiro-Wilk GOF # Test Statistic: W = 0.6788888 # P-value: 2.17927e-05 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.3267052 # P-value: 5.032807e-06 # # Gamma # ProUCL Anderson-Darling Gamma GOF # Test Statistic: A = 0.5076725 # P-value: >= 0.10 # ProUCL Kolmogorov-Smirnov Gamma GOF # Test Statistic: D = 0.1842904 # P-value: >= 0.10 # # Lognormal # Shapiro-Wilk GOF # Test Statistic: W = 0.978946 # P-value: 0.9197735 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.08405167 # P-value: 0.9699648 #==================================================================== ## Not run: # 1) Simulate 1000 trials where for each trial you: # a) Generate 20 observations from a Gamma distribution with # parameters mean = 10 and CV = 1. # b) Use distChoose with the Shapiro-Wilk method. # c) Use distChoose with the ProUCL method. # # 2) Compare the proportion of times the # Normal vs. Gamma vs. Lognormal vs. Nonparametric distribution # is chosen for b) and c) above. #------------------------------------------------------------------ set.seed(58) N <- 1000 Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric")) Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac) for(i in 1:N) { dat <- rgammaAlt(20, mean = 10, cv = 1) Choose.df[i, "SW"] <- distChoose(dat, method = "sw")$decision Choose.df[i, "ProUCL"] <- distChoose(dat, method = "proucl")$decision } summaryStats(Choose.df, digits = 0) # ProUCL(N) ProUCL(Pct) SW(N) SW(Pct) #Normal 443 44 41 4 #Gamma 546 55 733 73 #Lognormal 9 1 215 22 #Nonparametric 2 0 11 1 #Combined 1000 100 1000 100 #-------------------- # Repeat above example for the Lognormal Distribution with mean=10 and CV = 1. #----------------------------------------------------------------------------- set.seed(297) N <- 1000 Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric")) Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac) for(i in 1:N) { dat <- rlnormAlt(20, mean = 10, cv = 1) Choose.df[i, "SW"] <- distChoose(dat, method = "sw")$decision Choose.df[i, "ProUCL"] <- distChoose(dat, method = "proucl")$decision } summaryStats(Choose.df, digits = 0) # ProUCL(N) ProUCL(Pct) SW(N) SW(Pct) #Normal 313 31 15 2 #Gamma 556 56 254 25 #Lognormal 121 12 706 71 #Nonparametric 10 1 25 2 #Combined 1000 100 1000 100 #-------------------- # Clean up #--------- rm(N, Choose.fac, Choose.df, i, dat) ## End(Not run)
Objects of S3 class "distChoose"
are returned by the EnvStats function
distChoose
.
Objects of S3 class "distChoose"
are lists that contain
information about the candidate distributions, the estimated distribution
parameters for each candidate distribution, and the test statistics and
p-values associated with each candidate distribution.
Required Components
The following components must be included in a legitimate list of
class "distChoose"
.
choices |
a character vector containing the full names
of the candidate distributions. (see |
method |
a character string denoting which method was used. |
decision |
a character vector containing the full name of the chosen distribution. |
alpha |
a numeric scalar between 0 and 1 specifying the Type I error associated with each goodness-of-fit test. |
distribution.parameters |
a numeric vector containing the estimated parameters associated with the chosen distribution. |
estimation.method |
a character string indicating the method
used to compute the estimated parameters associated with the chosen
distribution. The value of this component will depend on the
available estimation methods (see |
sample.size |
a numeric scalar containing the number of non-missing observations in the sample used for the goodness-of-fit tests. |
test.results |
a list with the same number of components as the number
of elements in the component |
data.name |
character string indicating the name of the data object used for the goodness-of-fit tests. |
Optional Components
The following component is included in the result of
calling distChoose
when the argument keep.data=TRUE
:
data |
numeric vector containing the data actually used for the goodness-of-fit tests (i.e., the original data without any missing or infinite values). |
The following component is included in the result of
calling distChoose
when missing (NA
),
undefined (NaN
) and/or infinite (Inf
, -Inf
)
values are present:
bad.obs |
numeric scalar indicating the number of missing ( |
Generic functions that have methods for objects of class
"distChoose"
include: print
.
Since objects of class "distChoose"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
distChoose
, print.distChoose
,
Goodness-of-Fit Tests,
Distribution.df
.
# Create an object of class "distChoose", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(47) dat <- rgamma(20, shape = 2, scale = 3) distChoose.obj <- distChoose(dat) mode(distChoose.obj) #[1] "list" class(distChoose.obj) #[1] "distChoose" names(distChoose.obj) #[1] "choices" "method" #[3] "decision" "alpha" #[5] "distribution.parameters" "estimation.method" #[7] "sample.size" "test.results" #[9] "data" "data.name" distChoose.obj #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: MLE # #Data: dat # #Sample Size: 20 # #Test Results: # # Normal # Test Statistic: W = 0.9097488 # P-value: 0.06303695 # # Gamma # Test Statistic: W = 0.9834958 # P-value: 0.970903 # # Lognormal # Test Statistic: W = 0.9185006 # P-value: 0.09271768 #========== # Extract the choices #-------------------- distChoose.obj$choices #[1] "Normal" "Gamma" "Lognormal" #========== # Clean up #--------- rm(dat, distChoose.obj)
# Create an object of class "distChoose", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(47) dat <- rgamma(20, shape = 2, scale = 3) distChoose.obj <- distChoose(dat) mode(distChoose.obj) #[1] "list" class(distChoose.obj) #[1] "distChoose" names(distChoose.obj) #[1] "choices" "method" #[3] "decision" "alpha" #[5] "distribution.parameters" "estimation.method" #[7] "sample.size" "test.results" #[9] "data" "data.name" distChoose.obj #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: MLE # #Data: dat # #Sample Size: 20 # #Test Results: # # Normal # Test Statistic: W = 0.9097488 # P-value: 0.06303695 # # Gamma # Test Statistic: W = 0.9834958 # P-value: 0.970903 # # Lognormal # Test Statistic: W = 0.9185006 # P-value: 0.09271768 #========== # Extract the choices #-------------------- distChoose.obj$choices #[1] "Normal" "Gamma" "Lognormal" #========== # Clean up #--------- rm(dat, distChoose.obj)
Perform a series of goodness-of-fit tests for censored data from a (possibly user-specified) set of candidate probability distributions to determine which probability distribution provides the best fit for a data set.
distChooseCensored(x, censored, censoring.side = "left", alpha = 0.05, method = "sf", choices = c("norm", "gamma", "lnorm"), est.arg.list = NULL, prob.method = "hirsch-stedinger", plot.pos.con = 0.375, warn = TRUE, keep.data = TRUE, data.name = NULL, censoring.name = NULL)
distChooseCensored(x, censored, censoring.side = "left", alpha = 0.05, method = "sf", choices = c("norm", "gamma", "lnorm"), est.arg.list = NULL, prob.method = "hirsch-stedinger", plot.pos.con = 0.375, warn = TRUE, keep.data = TRUE, data.name = NULL, censoring.name = NULL)
x |
a numeric vector containing data for the goodness-of-fit tests.
Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
alpha |
numeric scalar between 0 and 1 specifying the Type I error associated with each
goodness-of-fit test. When |
method |
character string defining which method to use. Possible values are:
The Shapiro-Wilk method is only available for singly censored data. See the DETAILS section for more information. |
choices |
a character vector denoting the distribution abbreviations of the candidate
distributions. See the help file for This argument is ignored when |
est.arg.list |
a list containing one or more lists of arguments to be passed to the
function(s) estimating the distribution parameters. The name(s) of
the components of the list must be equal to or a subset of the values of the
argument In the course of testing for some form of normality (i.e., Normal, Lognormal),
the estimated parameters are saved in the This argument is ignored when |
prob.method |
character string indicating what method to use to compute the plotting positions
(empirical probabilities) when
The default value is The |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant to use when |
warn |
logical scalar indicating whether to print a warning message when
observations with |
keep.data |
logical scalar indicating whether to return the original data. The
default value is |
data.name |
optional character string indicating the name of the data used for argument |
censoring.name |
optional character string indicating the name for the data used for argument |
The function distChooseCensored
returns a list with information on the goodness-of-fit
tests for various distributions and which distribution appears to best fit the
data based on the p-values from the goodness-of-fit tests. This function was written in
order to compare ProUCL's way of choosing the best-fitting distribution (USEPA, 2015) with
other ways of choosing the best-fitting distribution.
Method Based on Shapiro-Wilk, Shapiro-Francia, or Probability Plot Correlation Test
(method="sw"
, method="sf"
, or method="ppcc"
)
For each value of the argument choices
, the function distChooseCensored
runs the goodness-of-fit test using the data in x
assuming that particular
distribution. For example, if choices=c("norm", "gamma", "lnorm")
,
indicating the Normal, Gamma, and Lognormal distributions, and
method="sf"
, then the usual Shapiro-Francia test is performed for the Normal
and Lognormal distributions, and the extension of the Shapiro-Francia test is performed
for the Gamma distribution (see the section
Testing Goodness-of-Fit for Any Continuous Distribution in the help
file for gofTestCensored
for an explanation of the latter). The distribution associated
with the largest p-value is the chosen distribution. In the case when all p-values are
less than the value of the argument alpha
, the distribution “Nonparametric” is chosen.
Method Based on ProUCL Algorithm (method="proucl"
)
When method="proucl"
, the function distChooseCensored
uses the
algorithm that ProUCL (USEPA, 2015) uses to determine the best fitting
distribution. The candidate distributions are the
Normal, Gamma, and Lognormal distributions. The algorithm
used by ProUCL is as follows:
Remove all censored observations and use only the uncensored observations.
Perform the Shapiro-Wilk and Lilliefors goodness-of-fit tests for the Normal distribution,
i.e., call the function gofTest
with distribution="norm", test="sw"
and distribution = "norm", test="lillie"
.
If either or both of the associated p-values are greater than or equal to the user-supplied value
of alpha
, then choose the Normal distribution. Otherwise, proceed to the next step.
Perform the “ProUCL Anderson-Darling” and
“ProUCL Kolmogorov-Smirnov” goodness-of-fit tests for the Gamma distribution,
i.e., call the function gofTest
with distribution="gamma", test="proucl.ad.gamma"
and distribution="gamma", test="proucl.ks.gamma"
.
If either or both of the associated p-values are greater than or equal to the user-supplied value
of alpha
, then choose the Gamma distribution. Otherwise, proceed to the next step.
Perform the Shapiro-Wilk and Lilliefors goodness-of-fit tests for the
Lognormal distribution, i.e., call the function gofTest
with distribution = "lnorm", test="sw"
and distribution = "lnorm", test="lillie"
.
If either or both of the associated p-values are greater than or equal to the user-supplied value
of alpha
, then choose the Lognormal distribution. Otherwise, proceed to the next step.
If none of the goodness-of-fit tests above yields a p-value greater than or equal to the user-supplied value
of alpha
, then choose the “Nonparametric” distribution.
a list of class "distChooseCensored"
containing the results of the goodness-of-fit tests.
Objects of class "distChooseCensored"
have a special printing method.
See the help file for distChooseCensored.object
for details.
In practice, almost any goodness-of-fit test will not reject the null hypothesis
if the number of observations is relatively small. Conversely, almost any goodness-of-fit
test will reject the null hypothesis if the number of observations is very large,
since “real” data are never distributed according to any theoretical distribution
(Conover, 1980, p.367). For most cases, however, the distribution of “real” data
is close enough to some theoretical distribution that fairly accurate results may be
provided by assuming that particular theoretical distribution. One way to asses the
goodness of the fit is to use goodness-of-fit tests. Another way is to look at
quantile-quantile (Q-Q) plots (see qqPlotCensored
).
Steven P. Millard ([email protected])
Birnbaum, Z.W., and F.H. Tingey. (1951). One-Sided Confidence Contours for Probability Distribution Functions. Annals of Mathematical Statistics 22, 592-596.
Blom, G. (1958). Statistical Estimates and Transformed Beta Variables. John Wiley and Sons, New York.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Dallal, G.E., and L. Wilkinson. (1986). An Analytic Approximation to the Distribution of Lilliefor's Test for Normality. The American Statistician 40, 294-296.
D'Agostino, R.B. (1970). Transformation to Normality of the Null Distribution of .
Biometrika 57, 679-681.
D'Agostino, R.B. (1971). An Omnibus Test of Normality for Moderate and Large Size Samples. Biometrika 58, 341-348.
D'Agostino, R.B. (1986b). Tests for the Normal Distribution. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York.
D'Agostino, R.B., and E.S. Pearson (1973). Tests for Departures from Normality.
Empirical Results for the Distributions of and
.
Biometrika 60(3), 613-622.
D'Agostino, R.B., and G.L. Tietjen (1973). Approaches to the Null Distribution of .
Biometrika 60(1), 169-173.
Fisher, R.A. (1950). Statistical Methods for Research Workers. 11'th Edition. Hafner Publishing Company, New York, pp.99-100.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Kendall, M.G., and A. Stuart. (1991). The Advanced Theory of Statistics, Volume 2: Inference and Relationship. Fifth Edition. Oxford University Press, New York.
Kim, P.J., and R.I. Jennrich. (1973). Tables of the Exact Sampling Distribution of the Two Sample Kolmogorov-Smirnov Criterion. In Harter, H.L., and D.B. Owen, eds. Selected Tables in Mathematical Statistics, Vol. 1. American Mathematical Society, Providence, Rhode Island, pp.79-170.
Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell' Istituto Italiano degle Attuari 4, 83-91.
Marsaglia, G., W.W. Tsang, and J. Wang. (2003). Evaluating Kolmogorov's distribution. Journal of Statistical Software, 8(18). doi:10.18637/jss.v008.i18.
Moore, D.S. (1986). Tests of Chi-Squared Type. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, pp.63-95.
Pomeranz, J. (1973). Exact Cumulative Distribution of the Kolmogorov-Smirnov Statistic for Small Samples (Algorithm 487). Collected Algorithms from ACM ??, ???-???.
Royston, J.P. (1992a). Approximating the Shapiro-Wilk W-Test for Non-Normality. Statistics and Computing 2, 117-119.
Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897-912.
Royston, J.P. (1992c). A Pocket-Calculator Algorithm for the Shapiro-Francia Test of Non-Normality: An Application to Medicine. Statistics in Medicine 12, 181-184.
Royston, P. (1993). A Toolkit for Testing for Non-Normality in Complete and Censored Samples. The Statistician 42, 37-43.
Ryan, T., and B. Joiner. (1973). Normal Probability Plots and Tests for Normality. Technical Report, Pennsylvannia State University, Department of Statistics.
Shapiro, S.S., and R.S. Francia. (1972). An Approximate Analysis of Variance Test for Normality. Journal of the American Statistical Association 67(337), 215-219.
Shapiro, S.S., and M.B. Wilk. (1965). An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52, 591-611.
Smirnov, N.V. (1939). Estimate of Deviation Between Empirical Distribution Functions in Two Independent Samples. Bulletin Moscow University 2(2), 3-16.
Smirnov, N.V. (1948). Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics 19, 279-281.
Stephens, M.A. (1970). Use of the Kolmogorov-Smirnov, Cramer-von Mises and Related Statistics Without Extensive Tables. Journal of the Royal Statistical Society, Series B, 32, 115-122.
Stephens, M.A. (1986a). Tests Based on EDF Statistics. In D'Agostino, R. B., and M.A. Stevens, eds. Goodness-of-Fit Techniques. Marcel Dekker, New York.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.
Verrill, S., and R.A. Johnson. (1987). The Asymptotic Equivalence of Some Modified Shapiro-Wilk Statistics – Complete and Censored Sample Cases. The Annals of Statistics 15(1), 413-419.
Verrill, S., and R.A. Johnson. (1988). Tables and Large-Sample Distribution Theory for Censored-Data Correlation Statistics for Testing Normality. Journal of the American Statistical Association 83, 1192-1197.
Weisberg, S., and C. Bingham. (1975). An Approximate Analysis of Variance Test for Non-Normality Suitable for Machine Calculation. Technometrics 17, 133-134.
Wilk, M.B., and S.S. Shapiro. (1968). The Joint Assessment of Normality of Several Independent Samples. Technometrics, 10(4), 825-839.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
gofTestCensored
, distChooseCensored.object
,
print.distChooseCensored
, distChoose
.
# Generate 30 observations from a gamma distribution with # parameters mean=10 and cv=1 and then censor observations less than 5. # Then: # # 1) Call distChooseCensored using the Shapiro-Wilk method and specify # choices of the # normal, # gamma (alternative parameterzation), and # lognormal (alternative parameterization) # distributions. # # 2) Compare the results in 1) above with the results using the # ProUCL method. # # Notes: The call to set.seed lets you reproduce this example. # # The ProUCL method chooses the Normal distribution, whereas the # Shapiro-Wilk method chooses the Gamma distribution. set.seed(598) dat <- sort(rgammaAlt(30, mean = 10, cv = 1)) dat # [1] 0.5313509 1.4741833 1.9936208 2.7980636 3.4509840 # [6] 3.7987348 4.5542952 5.5207531 5.5253596 5.7177872 #[11] 5.7513827 9.1086375 9.8444090 10.6247123 10.9304922 #[16] 11.7925398 13.3432689 13.9562777 14.6029065 15.0563342 #[21] 15.8730642 16.0039936 16.6910715 17.0288922 17.8507891 #[26] 19.1105522 20.2657141 26.3815970 30.2912797 42.8726101 dat.censored <- dat censored <- dat.censored < 5 dat.censored[censored] <- 5 # 1) Call distChooseCensored using the Shapiro-Wilk method. #---------------------------------------------------------- distChooseCensored(dat.censored, censored, method = "sw", choices = c("norm", "gammaAlt", "lnormAlt")) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): mean = 12.4911448 # cv = 0.7617343 # #Estimation Method: MLE # #Data: dat.censored # #Sample Size: 30 # #Censoring Side: left # #Censoring Variable: censored # #Censoring Level(s): 5 # #Percent Censored: 23.33333% # #Test Results: # # Normal # Test Statistic: W = 0.9372741 # P-value: 0.1704876 # # Gamma # Test Statistic: W = 0.9613711 # P-value: 0.522329 # # Lognormal # Test Statistic: W = 0.9292406 # P-value: 0.114511 #-------------------- # 2) Compare the results in 1) above with the results using the # ProUCL method. #--------------------------------------------------------------- distChooseCensored(dat.censored, censored, method = "proucl") #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: ProUCL # #Type I Error per Test: 0.05 # #Decision: Normal # #Estimated Parameter(s): mean = 15.397584 # sd = 8.688302 # #Estimation Method: mvue # #Data: dat.censored # #Sample Size: 30 # #Censoring Side: left # #Censoring Variable: censored # #Censoring Level(s): 5 # #Percent Censored: 23.33333% # #ProUCL Sample Size: 23 # #Test Results: # # Normal # Shapiro-Wilk GOF # Test Statistic: W = 0.861652 # P-value: 0.004457924 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.1714435 # P-value: 0.07794315 # # Gamma # ProUCL Anderson-Darling Gamma GOF # Test Statistic: A = 0.3805556 # P-value: >= 0.10 # ProUCL Kolmogorov-Smirnov Gamma GOF # Test Statistic: D = 0.1035271 # P-value: >= 0.10 # # Lognormal # Shapiro-Wilk GOF # Test Statistic: W = 0.9532604 # P-value: 0.3414187 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.115588 # P-value: 0.5899259 #-------------------- # Clean up #--------- rm(dat, censored, dat.censored) #==================================================================== # Check the assumption that the silver data stored in Helsel.Cohn.88.silver.df # follows a lognormal distribution. # Note that the small p-value and the shape of the Q-Q plot # (an inverted S-shape) suggests that the log transformation is not quite strong # enough to "bring in" the tails (i.e., the log-transformed silver data has tails # that are slightly too long relative to a normal distribution). # Helsel and Cohn (1988, p.2002) note that the gross outlier of 560 mg/L tends to # make the shape of the data resemble a gamma distribution, but # the distChooseCensored function decision is neither Gamma nor Lognormal, # but instead Nonparametric. # First create a lognormal Q-Q plot #---------------------------------- dev.new() with(Helsel.Cohn.88.silver.df, qqPlotCensored(Ag, Censored, distribution = "lnorm", points.col = "blue", add.line = TRUE)) #---------- # Now call the distChooseCensored function using the default settings. #--------------------------------------------------------------------- with(Helsel.Cohn.88.silver.df, distChooseCensored(Ag, Censored)) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Francia # #Type I Error per Test: 0.05 # #Decision: Nonparametric # #Data: Ag # #Sample Size: 56 # #Censoring Side: left # #Censoring Variable: Censored # #Censoring Level(s): 0.1 0.2 0.3 0.5 1.0 2.0 2.5 5.0 6.0 10.0 20.0 25.0 # #Percent Censored: 60.71429% # #Test Results: # # Normal # Test Statistic: W = 0.3065529 # P-value: 8.346126e-08 # # Gamma # Test Statistic: W = 0.6254148 # P-value: 1.884155e-05 # # Lognormal # Test Statistic: W = 0.8957198 # P-value: 0.03490314 #---------- # Clean up #--------- graphics.off() #==================================================================== # Chapter 15 of USEPA (2009) gives several examples of looking # at normal Q-Q plots and estimating the mean and standard deviation # for manganese concentrations (ppb) in groundwater at five background # wells (USEPA, 2009, p. 15-10). The Q-Q plot shown in Figure 15-4 # on page 15-13 clearly indicates that the Lognormal distribution # is a good fit for these data. # In EnvStats these data are stored in the data frame EPA.09.Ex.15.1.manganese.df. # Here we will call the distChooseCensored function to determine # whether the data appear to come from a normal, gamma, or lognormal # distribution. # # Note that using the Probability Plot Correlation Coefficient method # (equivalent to using the Shapiro-Francia method) yields a decision of # Lognormal, but using the ProUCL method yields a decision of Gamma. #---------------------------------------------------------------------- # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Use distChooseCensored with the probability plot correlation method, # and for the gamma and lognormal distribution specify the # mean and CV parameterization: #------------------------------------------------------------ with(EPA.09.Ex.15.1.manganese.df, distChooseCensored(Manganese.ppb, Censored, choices = c("norm", "gamma", "lnormAlt"), method = "ppcc")) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: PPCC # #Type I Error per Test: 0.05 # #Decision: Lognormal # #Estimated Parameter(s): mean = 23.003987 # cv = 2.300772 # #Estimation Method: MLE # #Data: Manganese.ppb # #Sample Size: 25 # #Censoring Side: left # #Censoring Variable: Censored # #Censoring Level(s): 2 5 # #Percent Censored: 24% # #Test Results: # # Normal # Test Statistic: r = 0.9147686 # P-value: 0.004662658 # # Gamma # Test Statistic: r = 0.9844875 # P-value: 0.6836625 # # Lognormal # Test Statistic: r = 0.9931982 # P-value: 0.9767731 #-------------------- # Repeat the above example using the ProUCL method. #-------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, distChooseCensored(Manganese.ppb, Censored, method = "proucl")) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: ProUCL # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 1.284882 # scale = 19.813413 # #Estimation Method: MLE # #Data: Manganese.ppb # #Sample Size: 25 # #Censoring Side: left # #Censoring Variable: Censored # #Censoring Level(s): 2 5 # #Percent Censored: 24% # #ProUCL Sample Size: 19 # #Test Results: # # Normal # Shapiro-Wilk GOF # Test Statistic: W = 0.7423947 # P-value: 0.0001862975 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.2768771 # P-value: 0.0004771155 # # Gamma # ProUCL Anderson-Darling Gamma GOF # Test Statistic: A = 0.6857121 # P-value: 0.05 <= p < 0.10 # ProUCL Kolmogorov-Smirnov Gamma GOF # Test Statistic: D = 0.1830034 # P-value: >= 0.10 # # Lognormal # Shapiro-Wilk GOF # Test Statistic: W = 0.969805 # P-value: 0.7725528 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.138547 # P-value: 0.4385195 #==================================================================== ## Not run: # 1) Simulate 1000 trials where for each trial you: # a) Generate 30 observations from a Gamma distribution with # parameters mean = 10 and CV = 1. # b) Censor observations less than 5 (the 39th percentile). # c) Use distChooseCensored with the Shapiro-Francia method. # d) Use distChooseCensored with the ProUCL method. # # 2) Compare the proportion of times the # Normal vs. Gamma vs. Lognormal vs. Nonparametric distribution # is chosen for c) and d) above. #------------------------------------------------------------------ set.seed(58) N <- 1000 Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric")) Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac) for(i in 1:N) { dat <- rgammaAlt(30, mean = 10, cv = 1) censored <- dat < 5 dat[censored] <- 5 Choose.df[i, "SW"] <- distChooseCensored(dat, censored, method = "sw")$decision Choose.df[i, "ProUCL"] <- distChooseCensored(dat, censored, method = "proucl")$decision } summaryStats(Choose.df, digits = 0) # ProUCL(N) ProUCL(Pct) SW(N) SW(Pct) #Normal 520 52 398 40 #Gamma 336 34 375 38 #Lognormal 105 10 221 22 #Nonparametric 39 4 6 1 #Combined 1000 100 1000 100 #-------------------- # Repeat above example for the Lognormal Distribution with mean=10 and CV = 1. # In this case, 5 is the 34th percentile. #----------------------------------------------------------------------------- set.seed(297) N <- 1000 Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric")) Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac) for(i in 1:N) { dat <- rlnormAlt(30, mean = 10, cv = 1) censored <- dat < 5 dat[censored] <- 5 Choose.df[i, "SW"] <- distChooseCensored(dat, censored, method = "sf")$decision Choose.df[i, "ProUCL"] <- distChooseCensored(dat, censored, method = "proucl")$decision } summaryStats(Choose.df, digits = 0) # ProUCL(N) ProUCL(Pct) SW(N) SW(Pct) #Normal 277 28 92 9 #Gamma 393 39 231 23 #Lognormal 190 19 624 62 #Nonparametric 140 14 53 5 #Combined 1000 100 1000 100 #-------------------- # Clean up #--------- rm(N, Choose.fac, Choose.df, i, dat, censored) ## End(Not run)
# Generate 30 observations from a gamma distribution with # parameters mean=10 and cv=1 and then censor observations less than 5. # Then: # # 1) Call distChooseCensored using the Shapiro-Wilk method and specify # choices of the # normal, # gamma (alternative parameterzation), and # lognormal (alternative parameterization) # distributions. # # 2) Compare the results in 1) above with the results using the # ProUCL method. # # Notes: The call to set.seed lets you reproduce this example. # # The ProUCL method chooses the Normal distribution, whereas the # Shapiro-Wilk method chooses the Gamma distribution. set.seed(598) dat <- sort(rgammaAlt(30, mean = 10, cv = 1)) dat # [1] 0.5313509 1.4741833 1.9936208 2.7980636 3.4509840 # [6] 3.7987348 4.5542952 5.5207531 5.5253596 5.7177872 #[11] 5.7513827 9.1086375 9.8444090 10.6247123 10.9304922 #[16] 11.7925398 13.3432689 13.9562777 14.6029065 15.0563342 #[21] 15.8730642 16.0039936 16.6910715 17.0288922 17.8507891 #[26] 19.1105522 20.2657141 26.3815970 30.2912797 42.8726101 dat.censored <- dat censored <- dat.censored < 5 dat.censored[censored] <- 5 # 1) Call distChooseCensored using the Shapiro-Wilk method. #---------------------------------------------------------- distChooseCensored(dat.censored, censored, method = "sw", choices = c("norm", "gammaAlt", "lnormAlt")) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): mean = 12.4911448 # cv = 0.7617343 # #Estimation Method: MLE # #Data: dat.censored # #Sample Size: 30 # #Censoring Side: left # #Censoring Variable: censored # #Censoring Level(s): 5 # #Percent Censored: 23.33333% # #Test Results: # # Normal # Test Statistic: W = 0.9372741 # P-value: 0.1704876 # # Gamma # Test Statistic: W = 0.9613711 # P-value: 0.522329 # # Lognormal # Test Statistic: W = 0.9292406 # P-value: 0.114511 #-------------------- # 2) Compare the results in 1) above with the results using the # ProUCL method. #--------------------------------------------------------------- distChooseCensored(dat.censored, censored, method = "proucl") #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: ProUCL # #Type I Error per Test: 0.05 # #Decision: Normal # #Estimated Parameter(s): mean = 15.397584 # sd = 8.688302 # #Estimation Method: mvue # #Data: dat.censored # #Sample Size: 30 # #Censoring Side: left # #Censoring Variable: censored # #Censoring Level(s): 5 # #Percent Censored: 23.33333% # #ProUCL Sample Size: 23 # #Test Results: # # Normal # Shapiro-Wilk GOF # Test Statistic: W = 0.861652 # P-value: 0.004457924 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.1714435 # P-value: 0.07794315 # # Gamma # ProUCL Anderson-Darling Gamma GOF # Test Statistic: A = 0.3805556 # P-value: >= 0.10 # ProUCL Kolmogorov-Smirnov Gamma GOF # Test Statistic: D = 0.1035271 # P-value: >= 0.10 # # Lognormal # Shapiro-Wilk GOF # Test Statistic: W = 0.9532604 # P-value: 0.3414187 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.115588 # P-value: 0.5899259 #-------------------- # Clean up #--------- rm(dat, censored, dat.censored) #==================================================================== # Check the assumption that the silver data stored in Helsel.Cohn.88.silver.df # follows a lognormal distribution. # Note that the small p-value and the shape of the Q-Q plot # (an inverted S-shape) suggests that the log transformation is not quite strong # enough to "bring in" the tails (i.e., the log-transformed silver data has tails # that are slightly too long relative to a normal distribution). # Helsel and Cohn (1988, p.2002) note that the gross outlier of 560 mg/L tends to # make the shape of the data resemble a gamma distribution, but # the distChooseCensored function decision is neither Gamma nor Lognormal, # but instead Nonparametric. # First create a lognormal Q-Q plot #---------------------------------- dev.new() with(Helsel.Cohn.88.silver.df, qqPlotCensored(Ag, Censored, distribution = "lnorm", points.col = "blue", add.line = TRUE)) #---------- # Now call the distChooseCensored function using the default settings. #--------------------------------------------------------------------- with(Helsel.Cohn.88.silver.df, distChooseCensored(Ag, Censored)) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Francia # #Type I Error per Test: 0.05 # #Decision: Nonparametric # #Data: Ag # #Sample Size: 56 # #Censoring Side: left # #Censoring Variable: Censored # #Censoring Level(s): 0.1 0.2 0.3 0.5 1.0 2.0 2.5 5.0 6.0 10.0 20.0 25.0 # #Percent Censored: 60.71429% # #Test Results: # # Normal # Test Statistic: W = 0.3065529 # P-value: 8.346126e-08 # # Gamma # Test Statistic: W = 0.6254148 # P-value: 1.884155e-05 # # Lognormal # Test Statistic: W = 0.8957198 # P-value: 0.03490314 #---------- # Clean up #--------- graphics.off() #==================================================================== # Chapter 15 of USEPA (2009) gives several examples of looking # at normal Q-Q plots and estimating the mean and standard deviation # for manganese concentrations (ppb) in groundwater at five background # wells (USEPA, 2009, p. 15-10). The Q-Q plot shown in Figure 15-4 # on page 15-13 clearly indicates that the Lognormal distribution # is a good fit for these data. # In EnvStats these data are stored in the data frame EPA.09.Ex.15.1.manganese.df. # Here we will call the distChooseCensored function to determine # whether the data appear to come from a normal, gamma, or lognormal # distribution. # # Note that using the Probability Plot Correlation Coefficient method # (equivalent to using the Shapiro-Francia method) yields a decision of # Lognormal, but using the ProUCL method yields a decision of Gamma. #---------------------------------------------------------------------- # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Use distChooseCensored with the probability plot correlation method, # and for the gamma and lognormal distribution specify the # mean and CV parameterization: #------------------------------------------------------------ with(EPA.09.Ex.15.1.manganese.df, distChooseCensored(Manganese.ppb, Censored, choices = c("norm", "gamma", "lnormAlt"), method = "ppcc")) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: PPCC # #Type I Error per Test: 0.05 # #Decision: Lognormal # #Estimated Parameter(s): mean = 23.003987 # cv = 2.300772 # #Estimation Method: MLE # #Data: Manganese.ppb # #Sample Size: 25 # #Censoring Side: left # #Censoring Variable: Censored # #Censoring Level(s): 2 5 # #Percent Censored: 24% # #Test Results: # # Normal # Test Statistic: r = 0.9147686 # P-value: 0.004662658 # # Gamma # Test Statistic: r = 0.9844875 # P-value: 0.6836625 # # Lognormal # Test Statistic: r = 0.9931982 # P-value: 0.9767731 #-------------------- # Repeat the above example using the ProUCL method. #-------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, distChooseCensored(Manganese.ppb, Censored, method = "proucl")) #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: ProUCL # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): shape = 1.284882 # scale = 19.813413 # #Estimation Method: MLE # #Data: Manganese.ppb # #Sample Size: 25 # #Censoring Side: left # #Censoring Variable: Censored # #Censoring Level(s): 2 5 # #Percent Censored: 24% # #ProUCL Sample Size: 19 # #Test Results: # # Normal # Shapiro-Wilk GOF # Test Statistic: W = 0.7423947 # P-value: 0.0001862975 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.2768771 # P-value: 0.0004771155 # # Gamma # ProUCL Anderson-Darling Gamma GOF # Test Statistic: A = 0.6857121 # P-value: 0.05 <= p < 0.10 # ProUCL Kolmogorov-Smirnov Gamma GOF # Test Statistic: D = 0.1830034 # P-value: >= 0.10 # # Lognormal # Shapiro-Wilk GOF # Test Statistic: W = 0.969805 # P-value: 0.7725528 # Lilliefors (Kolmogorov-Smirnov) GOF # Test Statistic: D = 0.138547 # P-value: 0.4385195 #==================================================================== ## Not run: # 1) Simulate 1000 trials where for each trial you: # a) Generate 30 observations from a Gamma distribution with # parameters mean = 10 and CV = 1. # b) Censor observations less than 5 (the 39th percentile). # c) Use distChooseCensored with the Shapiro-Francia method. # d) Use distChooseCensored with the ProUCL method. # # 2) Compare the proportion of times the # Normal vs. Gamma vs. Lognormal vs. Nonparametric distribution # is chosen for c) and d) above. #------------------------------------------------------------------ set.seed(58) N <- 1000 Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric")) Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac) for(i in 1:N) { dat <- rgammaAlt(30, mean = 10, cv = 1) censored <- dat < 5 dat[censored] <- 5 Choose.df[i, "SW"] <- distChooseCensored(dat, censored, method = "sw")$decision Choose.df[i, "ProUCL"] <- distChooseCensored(dat, censored, method = "proucl")$decision } summaryStats(Choose.df, digits = 0) # ProUCL(N) ProUCL(Pct) SW(N) SW(Pct) #Normal 520 52 398 40 #Gamma 336 34 375 38 #Lognormal 105 10 221 22 #Nonparametric 39 4 6 1 #Combined 1000 100 1000 100 #-------------------- # Repeat above example for the Lognormal Distribution with mean=10 and CV = 1. # In this case, 5 is the 34th percentile. #----------------------------------------------------------------------------- set.seed(297) N <- 1000 Choose.fac <- factor(rep("", N), levels = c("Normal", "Gamma", "Lognormal", "Nonparametric")) Choose.df <- data.frame(SW = Choose.fac, ProUCL = Choose.fac) for(i in 1:N) { dat <- rlnormAlt(30, mean = 10, cv = 1) censored <- dat < 5 dat[censored] <- 5 Choose.df[i, "SW"] <- distChooseCensored(dat, censored, method = "sf")$decision Choose.df[i, "ProUCL"] <- distChooseCensored(dat, censored, method = "proucl")$decision } summaryStats(Choose.df, digits = 0) # ProUCL(N) ProUCL(Pct) SW(N) SW(Pct) #Normal 277 28 92 9 #Gamma 393 39 231 23 #Lognormal 190 19 624 62 #Nonparametric 140 14 53 5 #Combined 1000 100 1000 100 #-------------------- # Clean up #--------- rm(N, Choose.fac, Choose.df, i, dat, censored) ## End(Not run)
Objects of S3 class "distChooseCensored"
are returned by the EnvStats function
distChooseCensored
.
Objects of S3 class "distChooseCensored"
are lists that contain
information about the candidate distributions, the estimated distribution
parameters for each candidate distribution, and the test statistics and
p-values associated with each candidate distribution.
Required Components
The following components must be included in a legitimate list of
class "distChooseCensored"
.
choices |
a character vector containing the full names
of the candidate distributions. (see |
method |
a character string denoting which method was used. |
decision |
a character vector containing the full name of the chosen distribution. |
alpha |
a numeric scalar between 0 and 1 specifying the Type I error associated with each goodness-of-fit test. |
distribution.parameters |
a numeric vector containing the estimated parameters associated with the chosen distribution. |
estimation.method |
a character string indicating the method
used to compute the estimated parameters associated with the chosen
distribution. The value of this component will depend on the
available estimation methods (see |
sample.size |
a numeric scalar containing the number of non-missing observations in the sample used for the goodness-of-fit tests. |
censoring.side |
character string indicating whether the data are left- or right-censored. |
censoring.levels |
numeric scalar or vector indicating the censoring level(s). |
percent.censored |
numeric scalar indicating the percent of non-missing observations that are censored. |
test.results |
a list with the same number of components as the number
of elements in the component |
data.name |
character string indicating the name of the data object used for the goodness-of-fit tests. |
censoring.name |
character string indicating the name of the data object used to identify which values are censored. |
Optional Components
The following components are included in the result of
calling distChooseCensored
when the argument keep.data=TRUE
:
data |
numeric vector containing the data actually used for the goodness-of-fit tests (i.e., the original data without any missing or infinite values). |
censored |
logical vector containing the censoring status for the data actually used for the goodness-of-fit tests (i.e., the original data without any missing or infinite values). |
The following component is included in the result of
calling distChooseCensored
when missing (NA
),
undefined (NaN
) and/or infinite (Inf
, -Inf
)
values are present:
bad.obs |
numeric scalar indicating the number of missing ( |
Generic functions that have methods for objects of class
"distChooseCensored"
include: print
.
Since objects of class "distChooseCensored"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
distChooseCensored
, print.distChooseCensored
,
Censored Data,
Goodness-of-Fit Tests,
Distribution.df
.
# Create an object of class "distChooseCensored", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(598) dat <- rgammaAlt(30, mean = 10, cv = 1) censored <- dat < 5 dat[censored] <- 5 distChooseCensored.obj <- distChooseCensored(dat, censored, method = "sw", choices = c("norm", "gammaAlt", "lnormAlt")) mode(distChooseCensored.obj) #[1] "list" class(distChooseCensored.obj) #[1] "distChooseCensored" names(distChooseCensored.obj) # [1] "choices" "method" # [3] "decision" "alpha" # [5] "distribution.parameters" "estimation.method" # [7] "sample.size" "censoring.side" # [9] "censoring.levels" "percent.censored" #[11] "test.results" "data" #[13] "censored" "data.name" #[15] "censoring.name" distChooseCensored.obj #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): mean = 12.4911448 # cv = 0.7617343 # #Estimation Method: MLE # #Data: dat.censored # #Sample Size: 30 # #Censoring Side: left # #Censoring Variable: censored # #Censoring Level(s): 5 # #Percent Censored: 23.33333% # #Test Results: # # Normal # Test Statistic: W = 0.9372741 # P-value: 0.1704876 # # Gamma # Test Statistic: W = 0.9613711 # P-value: 0.522329 # # Lognormal # Test Statistic: W = 0.9292406 # P-value: 0.114511 #========== # Extract the choices #-------------------- distChooseCensored.obj$choices #[1] "Normal" "Gamma" "Lognormal" #========== # Clean up #--------- rm(dat, censored, distChooseCensored.obj)
# Create an object of class "distChooseCensored", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(598) dat <- rgammaAlt(30, mean = 10, cv = 1) censored <- dat < 5 dat[censored] <- 5 distChooseCensored.obj <- distChooseCensored(dat, censored, method = "sw", choices = c("norm", "gammaAlt", "lnormAlt")) mode(distChooseCensored.obj) #[1] "list" class(distChooseCensored.obj) #[1] "distChooseCensored" names(distChooseCensored.obj) # [1] "choices" "method" # [3] "decision" "alpha" # [5] "distribution.parameters" "estimation.method" # [7] "sample.size" "censoring.side" # [9] "censoring.levels" "percent.censored" #[11] "test.results" "data" #[13] "censored" "data.name" #[15] "censoring.name" distChooseCensored.obj #Results of Choosing Distribution #-------------------------------- # #Candidate Distributions: Normal # Gamma # Lognormal # #Choice Method: Shapiro-Wilk # #Type I Error per Test: 0.05 # #Decision: Gamma # #Estimated Parameter(s): mean = 12.4911448 # cv = 0.7617343 # #Estimation Method: MLE # #Data: dat.censored # #Sample Size: 30 # #Censoring Side: left # #Censoring Variable: censored # #Censoring Level(s): 5 # #Percent Censored: 23.33333% # #Test Results: # # Normal # Test Statistic: W = 0.9372741 # P-value: 0.1704876 # # Gamma # Test Statistic: W = 0.9613711 # P-value: 0.522329 # # Lognormal # Test Statistic: W = 0.9292406 # P-value: 0.114511 #========== # Extract the choices #-------------------- distChooseCensored.obj$choices #[1] "Normal" "Gamma" "Lognormal" #========== # Clean up #--------- rm(dat, censored, distChooseCensored.obj)
Data frame summarizing information about available probability distributions in R and the EnvStats package, and which distributions have associated functions for estimating distribution parameters.
Distribution.df
Distribution.df
A data frame with 35 rows corresponding to 35 different available probability distributions, and 25 columns containing information associated with these probability distributions.
Name
a character vector containing the name of the probability distribution (see the column labeled Name in the table below).
Type
a character vector indicating the type of
distribution (see the column labeled Type in the table below).
Possible values are "Finite Discrete"
, "Discrete"
,
"Continuous"
, and "Mixed"
.
Support.Min
a character vector indicating the minimum value
the random variable can assume (see the column labeled Range in
the table below). The reason this is a character vector instead of a
numeric vector is because some distributions have a lower bound that
depends on the value of a distribution parameter. For example,
the minimum value for a Uniform distribution is given by the
value of the parameter min
.
Support.Max
a character vector indicating the maximum value
the random variable can assume (see the column labeled Range in
the table below). The reason this is a character vector instead of a
numeric vector is because some distributions have an upper bound that
depends on the value of a distribution parameter. For example,
the maximum value for a Uniform distribution is given by the value
of the parameter max
.
Estimation.Method(s)
a character vector indicating the
names of the methods available to estimate the distribution parameter(s)
(see the column labeled Estimation Method(s) in the table below).
Possible values include "mle"
(maximum likelihood), "mme"
(method of moments), "mmue"
(method of moments based on the
unbiased estimate of variance), "mvue"
(minimum variance unbiased),
"qmle"
(quasi-mle), etc., or some combination of these. In
cases where an estimator is more than one kind, a slash (/
) is
used to denote all methods covered by the single estimator. For example,
for the Binomial distribution, the sample proportion is the maximum
likelihood, method of moments, and minimum variance unbiased estimator,
so this method is denoted as "mle/mme/mvue"
. See the help files
for the specific function listed under
Estimating Distribution Parameters for an
explanation of each of these estimation methods.
Quantile.Estimation.Method(s)
a character vector indicating
the names of the methods available to estimate the distribution
quantiles. For many distributions, these are the same as
Estimation.Method(s)
. See the help files for the specific
function listed under
Estimating Distribution Quantiles for an
explanation of each of these estimation methods.
Prediction.Interval.Method(s)
a character vector indicating the names of the methods available to create prediction intervals. See the help files for the specific function listed under Prediction Intervals for an explanation of each of these estimation methods.
Singly.Censored.Estimation.Method(s)
a character vector indicating the names of the methods available to estimate the distribution parameter(s) for Type I singly-censored data. See the help files for the specific function listed under Estimating Distribution Parameters in the help file for Censored Data for an explanation of each of these estimation methods.
Multiply.Censored.Estimation.Method(s)
a character vector indicating the names of the methods available to estimate the distribution parameter(s) for Type I multiply-censored data. See the help files for the specific function listed under Estimating Distribution Parameters in the help file for Censored Data for an explanation of each of these estimation methods.
Number.parameters
a numeric vector indicating the number of parameters associated with the distribution (see the column labeled Parameters in the table below).
Parameter.1
the columns labeled
Parameter.1
, Parameter.2
, ..., Parameter.5
are
character vectors containing the names of the distribution parameters
(see the column labeled Parameters in the table below). If a
distribution has parameters and
, then the columns
labeled
Parameter.n+1
, ..., Parameter.5
are empty. For
example, the Normal distribution has only two parameters
associated with it (mean
and sd
), so the fields in
Parameter.3
, Parameter.4
, and Parameter.5
are
empty.
Parameter.2
see Parameter.1
Parameter.3
see Parameter.1
Parameter.4
see Parameter.1
Parameter.5
see Parameter.1
Parameter.1.Min
the columns labeled Parameter.1.Min
,
Parameter.2.Min
, ..., Parameter.5.Min
are character
vectors containing the minimum values that can be assumed by the
distribution parameters (see the column labeled Parameter Range(s)
in the table below).
The reason these are character vectors instead of numeric vectors is
because some parameters have a lower bound of 0
but must be
strictly bigger than 0
(e.g., the parameter sd
for the
Normal distribution), in which case the lower bound is
.Machine$double.eps
, which may vary from machine to machine.
Also, some parameters have a lower bound that depends on the value of
another parameter. For example, the parameter max
for a
Uniform distribution is bounded below by the value of the
parameter min
.
If a distribution has parameters and
, then the
columns labeled
Parameter.n+1.Min
, ..., Parameter.5.Min
have the missing value code (NA
). For example, the Normal
distribution has only two parameters associated with it (mean
and sd
) so the fields in Parameter.3.Min
, Parameter.4.Min
, and Parameter.5.Min
have NA
s in them.
Parameter.2.Min
see Parameter.1.Min
Parameter.3.Min
see Parameter.1.Min
Parameter.4.Min
see Parameter.1.Min
Parameter.5.Min
see Parameter.1.Min
Parameter.1.Max
the columns labeled Parameter.1.Max
,
Parameter.2.Max
, ..., Parameter.5.Max
are character
vectors containing the maximum values that can be assumed by the
distribution parameters (see the column labeled Parameter Range(s)
in the table below).
The reason these are character vectors instead of numeric vectors is
because some parameters have an upper bound that depends on the value
of another parameter. For example, the parameter min
for a
Uniform distribution is bounded above by the value of the
parameter max
.
If a distribution has parameters and
, then the
columns labeled
Parameter.n+1.Max
, ..., Parameter.5.Max
have the missing value code (NA
). For example, the Normal
distribution has only two parameters associated with it (mean
and sd
) so the fields in Parameter.3.Max
, Parameter.4.Max
, and Parameter.5.Max
have NA
s in them.
Parameter.2.Max
see Parameter.1.Max
Parameter.3.Max
see Parameter.1.Max
Parameter.4.Max
see Parameter.1.Max
Parameter.5.Max
see Parameter.1.Max
The table below summarizes the probability distributions available in
R and EnvStats. For each distribution, there are four
associated functions for computing density values, percentiles, quantiles,
and random numbers. The form of the names of these functions are
d
abb, p
abb, q
abb, and
r
abb, where abb is the abbreviated name of the
distribution (see table below). These functions are described in the
help file with the name of the distribution (see the first column of the
table below). For example, the help file for Beta describes the
behavior of dbeta
, pbeta
, qbeta
,
and rbeta
.
For most distributions, there is also an associated function for
estimating the distribution parameters, and the form of the names of
these functions is e
abb, where abb is the
abbreviated name of the distribution (see table below). All of these
functions are listed in the help file
Estimating Distribution Parameters. For example,
the function ebeta
estimates the shape parameters of a
Beta distribution based on a random sample of observations from
this distribution.
For some distributions, there are functions to estimate distribution
parameters based on Type I censored data. The form of the names of
these functions is e
abbSinglyCensored
for
singly censored data and e
abbMultiplyCensored
for
multiply censored data. All of these functions are listed under the heading
Estimating Distribution Parameters in the help file
Censored Data.
Table 1a. Available Distributions: Name, Abbreviation, Type, and Range
Name | Abbreviation | Type | Range |
Beta | beta |
Continuous | |
Binomial | binom |
Finite | |
Discrete | (integer) | ||
Cauchy | cauchy |
Continuous | |
Chi | chi |
Continuous | |
Chi-square | chisq |
Continuous | |
Exponential | exp |
Continuous | |
Extreme | evd |
Continuous | |
Value | |||
F | f |
Continuous | |
Gamma | gamma |
Continuous | |
Gamma | gammaAlt |
Continuous | |
(Alternative) | |||
Generalized | gevd |
Continuous | |
Extreme | for |
||
Value | |||
|
|||
for |
|||
|
|||
for |
|||
Geometric | geom |
Discrete | |
(integer) | |||
Hypergeometric | hyper |
Finite | |
Discrete | (integer) | ||
Logistic | logis |
Continuous | |
Lognormal | lnorm |
Continuous | |
Lognormal | lnormAlt |
Continuous | |
(Alternative) | |||
Lognormal | lnormMix |
Continuous | |
Mixture | |||
Lognormal | lnormMixAlt |
Continuous | |
Mixture | |||
(Alternative) | |||
Three- | lnorm3 |
Continuous | |
Parameter | |||
Lognormal | |||
Truncated | lnormTrunc |
Continuous | |
Lognormal | |||
Truncated | lnormTruncAlt |
Continuous | |
Lognormal | |||
(Alternative) | |||
Negative | nbinom |
Discrete | |
Binomial | (integer) | ||
Normal | norm |
Continuous | |
Normal | normMix |
Continuous | |
Mixture | |||
Truncated | normTrunc |
Continuous | |
Normal | |||
Pareto | pareto |
Continuous | |
Poisson | pois |
Discrete | |
(integer) | |||
Student's t | t |
Continuous | |
Triangular | tri |
Continuous | |
Uniform | unif |
Continuous | |
Weibull | weibull |
Continuous | |
Wilcoxon | wilcox |
Finite | |
Rank Sum | Discrete | (integer) | |
Zero-Modified | zmlnorm |
Mixed | |
Lognormal | |||
(Delta) | |||
Zero-Modified | zmlnormAlt |
Mixed | |
Lognormal | |||
(Delta) | |||
(Alternative) | |||
Zero-Modified | zmnorm |
Mixed | |
Normal | |||
Table 1b. Available Distributions: Name, Parameters, Parameter Default Values, Parameter Ranges, Estimation Method(s)
Default | Parameter | Estimation | ||
Name | Parameter(s) | Value(s) | Range(s) | Method(s) |
Beta | shape1 |
|
mle, mme, mmue | |
shape2 |
|
|||
ncp |
0 |
|
||
Binomial | size |
|
mle/mme/mvue | |
prob |
|
|||
Cauchy | location |
0 |
|
|
scale |
1 |
|
||
Chi | df |
|
||
Chi-square | df |
|
||
ncp |
0 |
|
||
Exponential | rate |
1 |
|
mle/mme |
Extreme | location |
0 |
|
mle, mme, mmue, pwme |
Value | scale |
1 |
|
|
F | df1 |
|
||
df2 |
|
|||
ncp |
0 |
|
||
Gamma | shape |
|
mle, bcmle, mme, mmue | |
scale |
1 |
|
||
Gamma | mean |
|
mle, bcmle, mme, mmue | |
(Alternative) | cv |
1 |
|
|
Generalized | location |
0 |
|
mle, pwme, tsoe |
Extreme | scale |
1 |
|
|
Value | shape |
0 |
|
|
Geometric | prob |
|
mle/mme, mvue | |
Hypergeometric | m |
|
mle, mvue | |
n |
|
|||
k |
|
|||
Logistic | location |
0 |
|
mle, mme, mmue |
scale |
1 |
|
||
Lognormal | meanlog |
0 |
|
mle/mme, mvue |
sdlog |
1 |
|
||
Lognormal | mean |
exp(1/2) |
|
mle, mme, mmue, |
(Alternative) | cv |
sqrt(exp(1)-1) |
|
mvue, qmle |
Lognormal | meanlog1 |
0 |
|
|
Mixture | sdlog1 |
1 |
|
|
meanlog2 |
0 |
|
||
sdlog2 |
1 |
|
||
p.mix |
0.5 |
|
||
Lognormal | mean1 |
exp(1/2) |
|
|
Mixture | cv1 |
sqrt(exp(1)-1) |
|
|
(Alternative) | mean2 |
exp(1/2) |
|
|
cv2 |
sqrt(exp(1)-1) |
|
||
p.mix |
0.5 |
|
||
Three- | meanlog |
0 |
|
lmle, mme, |
Parameter | sdlog |
1 |
|
mmue, mmme, |
Lognormal | threshold |
0 |
|
royston.skew, |
zero.skew | ||||
Truncated | meanlog |
0 |
|
|
Lognormal | sdlog |
1 |
|
|
min |
0 |
|
||
max |
Inf |
|
||
Truncated | mean |
exp(1/2) |
|
|
Lognormal | cv |
sqrt(exp(1)-1) |
|
|
(Alternative) | min |
0 |
|
|
max |
Inf |
|
||
Negative | size |
|
mle/mme, mvue | |
Binomial | prob |
|
||
mu |
|
|||
Normal | mean |
0 |
|
mle/mme, mvue |
sd |
1 |
|
||
Normal | mean1 |
0 |
|
|
Mixture | sd1 |
1 |
|
|
mean2 |
0 |
|
||
sd2 |
1 |
|
||
p.mix |
0.5 |
|
||
Truncated | mean |
0 |
|
|
Normal | sd |
1 |
|
|
min |
-Inf |
|
||
max |
Inf |
|
||
Pareto | location |
|
lse, mle | |
shape |
1 |
|
||
Poisson | lambda |
|
mle/mme/mvue | |
Student's t | df |
|
||
ncp |
0 |
|
||
Triangular | min |
0 |
|
|
max |
1 |
|
||
mode |
0.5 |
|
||
Uniform | min |
0 |
|
mle, mme, mmue |
max |
1 |
|
||
Weibull | shape |
|
mle, mme, mmue | |
scale |
1 |
|
||
Wilcoxon | m |
|
||
Rank Sum | n |
|
||
Zero-Modified | meanlog |
0 |
|
mvue |
Lognormal | sdlog |
1 |
|
|
(Delta) | p.zero |
0.5 |
|
|
Zero-Modified | mean |
exp(1/2) |
|
mvue |
Lognormal | cv |
sqrt(exp(1)-1) |
|
|
(Delta) | p.zero |
0.5 |
|
|
(Alternative) | ||||
Zero-Modified | mean |
0 |
|
mvue |
Normal | sd |
1 |
|
|
p.zero |
0.5 |
|
||
The EnvStats package.
Millard, S.P. (2013). EnvStats: An R Package for Environmental Statistics. Springer, New York. https://link.springer.com/book/10.1007/978-1-4614-8456-1.
Estimate the shape parameters of a beta distribution.
ebeta(x, method = "mle")
ebeta(x, method = "mle")
x |
numeric vector of observations. All observations must be between greater than 0 and less than 1. |
method |
character string specifying the method of estimation. The possible values are
|
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations
from a beta distribution with parameters
shape1=
and
shape2=
.
Maximum Likelihood Estimation (method="mle"
)
The maximum likelihood estimators (mle's) of the shape parameters and
are the solutions of the simultaneous equations:
where is the digamma function (Forbes et al., 2011).
Method of Moments Estimators (method="mme"
)
The method of moments estimators (mme's) of the shape parameters and
are given by (Forbes et al., 2011):
where
Method of Moments Estimators Based on the Unbiased Estimator of Variance (method="mmue"
)
These estimators are the same as the method of moments estimators except that
the method of moments estimator of variance is replaced with the unbiased estimator
of variance:
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The beta distribution takes real values between 0 and 1. Special cases of the
beta are the Uniform[0,1] when shape1=1
and
shape2=1
, and the arcsin distribution when shape1=0.5
and shape2=0.5
. The arcsin distribution appears in the theory of random walks.
The beta distribution is used in Bayesian analyses as a conjugate to the binomial
distribution.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Beta.
# Generate 20 observations from a beta distribution with parameters # shape1=2 and shape2=4, then estimate the parameters via # maximum likelihood. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rbeta(20, shape1 = 2, shape2 = 4) ebeta(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Beta # #Estimated Parameter(s): shape1 = 5.392221 # shape2 = 11.823233 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 #========== # Repeat the above, but use the method of moments estimators: ebeta(dat, method = "mme") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Beta # #Estimated Parameter(s): shape1 = 5.216311 # shape2 = 11.461341 # #Estimation Method: mme # #Data: dat # #Sample Size: 20 #========== # Clean up #--------- rm(dat)
# Generate 20 observations from a beta distribution with parameters # shape1=2 and shape2=4, then estimate the parameters via # maximum likelihood. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rbeta(20, shape1 = 2, shape2 = 4) ebeta(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Beta # #Estimated Parameter(s): shape1 = 5.392221 # shape2 = 11.823233 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 #========== # Repeat the above, but use the method of moments estimators: ebeta(dat, method = "mme") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Beta # #Estimated Parameter(s): shape1 = 5.216311 # shape2 = 11.461341 # #Estimation Method: mme # #Data: dat # #Sample Size: 20 #========== # Clean up #--------- rm(dat)
Estimate (the probability of “success”) for a binomial distribution,
and optionally construct a confidence interval for
.
ebinom(x, size = NULL, method = "mle/mme/mvue", ci = FALSE, ci.type = "two-sided", ci.method = "score", correct = TRUE, var.denom = "n", conf.level = 0.95, warn = TRUE)
ebinom(x, size = NULL, method = "mle/mme/mvue", ci = FALSE, ci.type = "two-sided", ci.method = "score", correct = TRUE, var.denom = "n", conf.level = 0.95, warn = TRUE)
x |
numeric or logical vector of observations. When |
size |
positive integer indicating the of number of trials; |
method |
character string specifying the method of estimation. The only possible value is
|
ci |
logical scalar indicating whether to compute a confidence interval for the mean. The default value
is |
ci.type |
character string indicating what kind of confidence interval to compute. The possible values are
|
ci.method |
character string indicating which method to use to construct the confidence interval. Possible values
are |
correct |
logical scalar indicating whether to use the continuity correction when |
var.denom |
character string indicating what value to use in the denominator of the variance estimator when
|
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval. The default
value is |
warn |
a logical scalar indicating whether to issue a waning in the case when |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to performing the estimation.
If is a vector of
observations from a binomial distribution with
parameters
size=
and
prob=
, then the sum of all the values in
is an observation from a binomial distribution with parameters
size=
and
prob=
.
If is an observation from a binomial distribution with parameters
size=
and
prob=
, the maximum likelihood estimator (mle), method of moments estimator (mme),
and minimum variance unbiased estimator (mvue) of
is simply
.
Confidence Intervals.
ci.method="score"
The confidence interval for based on the
score method was developed by Wilson (1927) and is discussed by Newcombe (1998a),
Agresti and Coull (1998), and Agresti and Caffo (2000). When
ci=TRUE
and
ci.method="score"
, the function ebinom
calls the R function
prop.test
to compute the confidence interval. This method
has been shown to provide the best performance (in terms of actual coverage matching assumed
coverage) of all the methods provided here, although unlike the exact method, the actual
coverage can fall below the assumed coverage.
ci.method="exact"
The confidence interval for based on the
exact (Clopper-Pearson) method is discussed by Newcombe (1998a), Agresti and Coull (1998),
and Zar (2010, pp.543-547). This is the method used in the R function
binom.test
. This method ensures the actual coverage is greater than or
equal to the assumed coverage.
ci.method="Wald"
The confidence interval for based on the Wald method
(with or without a correction for continuity) is the usual “normal approximation”
method and is discussed by Newcombe (1998a), Agresti and Coull (1998), Agresti and Caffo (2000),
and Zar (2010, pp.543-547). This method is never recommended but is included
for historical purposes.
ci.method="adjusted Wald"
The confidence interval for based on the
adjusted Wald method is discussed by Agresti and Coull (1998), Agresti and Caffo (2000), and
Zar (2010, pp.543-547). This is a simple modification of the Wald method and
performs surpringly well.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The binomial distribution is used to model processes with binary (Yes-No, Success-Failure,
Heads-Tails, etc.) outcomes. It is assumed that the outcome of any one trial is independent
of any other trial, and that the probability of “success”, , is the same on
each trial. A binomial discrete random variable
is the number of “successes” in
independent trials. A special case of the binomial distribution occurs when
,
in which case
is also called a Bernoulli random variable.
In the context of environmental statistics, the binomial distribution is sometimes used to model
the proportion of times a chemical concentration exceeds a set standard in a given period of
time (e.g., Gilbert, 1987, p.143). The binomial distribution is also used to compute an upper
bound on the overall Type I error rate for deciding whether a facility or location is in
compliance with some set standard. Assume the null hypothesis is that the facility is in compliance.
If a test of hypothesis is conducted periodically over time to test compliance and/or several tests
are performed during each time period, and the facility or location is always in compliance, and
each single test has a Type I error rate of , and the result of each test is
independent of the result of any other test (usually not a reasonable assumption), then the number
of times the facility is declared out of compliance when in fact it is in compliance is a
binomial random variable with probability of “success”
being the
probability of being declared out of compliance (see USEPA, 2009).
Steven P. Millard ([email protected])
Agresti, A., and B.A. Coull. (1998). Approximate is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126.
Agresti, A., and B. Caffo. (2000). Simple and Effective Confidence Intervals for Proportions and Differences of Proportions Result from Adding Two Successes and Two Failures. The American Statistician, 54(4), 280–288.
Berthouex, P.M., and L.C. Brown. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, FL, Chapters 2 and 15.
Cochran, W.G. (1977). Sampling Techniques. John Wiley and Sons, New York, Chapter 3.
Fisher, R.A., and F. Yates. (1963). Statistical Tables for Biological, Agricultural, and Medical Research. 6th edition. Hafner, New York, 146pp.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. Second Edition. John Wiley and Sons, New York, Chapters 1-2.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, Chapter 11.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 3.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Newcombe, R.G. (1998a). Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods. Statistics in Medicine, 17, 857–872.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL, Chapter 4.
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.6-38.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 24.
Binomial, prop.test
, binom.test
,
ciBinomHalfWidth
, ciBinomN
,
plotCiBinomDesign
.
# Generate 20 observations from a binomial distribution with # parameters size=1 and prob=0.2, then estimate the 'prob' parameter. # (Note: the call to set.seed simply allows you to reproduce this # example. Also, the only parameter estimated is 'prob'; 'size' is # specified in the call to ebinom. The parameter 'size' is printed # inorder to show all of the parameters associated with the # distribution.) set.seed(251) dat <- rbinom(20, size = 1, prob = 0.2) ebinom(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 20.0 # prob = 0.1 # #Estimation Method: mle/mme/mvue for 'prob' # #Data: dat # #Sample Size: 20 #---------------------------------------------------------------- # Generate one observation from a binomial distribution with # parameters size=20 and prob=0.2, then estimate the "prob" # parameter and compute a confidence interval: set.seed(763) dat <- rbinom(1, size=20, prob=0.2) ebinom(dat, size = 20, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 20.00 # prob = 0.35 # #Estimation Method: mle/mme/mvue for 'prob' # #Data: dat # #Sample Size: 20 # #Confidence Interval for: prob # #Confidence Interval Method: Score normal approximation # (With continuity correction) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.1630867 # UCL = 0.5905104 #---------------------------------------------------------------- # Using the data from the last example, compare confidence # intervals based on the various methods ebinom(dat, size = 20, ci = TRUE, ci.method = "score", correct = TRUE)$interval$limits # LCL UCL #0.1630867 0.5905104 ebinom(dat, size = 20, ci = TRUE, ci.method = "score", correct = FALSE)$interval$limits # LCL UCL #0.1811918 0.5671457 ebinom(dat, size = 20, ci = TRUE, ci.method = "exact")$interval$limits # LCL UCL #0.1539092 0.5921885 ebinom(dat, size = 20, ci = TRUE, ci.method = "adjusted Wald")$interval$limits # LCL UCL #0.1799264 0.5684112 ebinom(dat, size = 20, ci = TRUE, ci.method = "Wald", correct = TRUE)$interval$limits # LCL UCL #0.1159627 0.5840373 ebinom(dat, size = 20, ci = TRUE, ci.method = "Wald", correct = FALSE)$interval$limits # LCL UCL #0.1409627 0.5590373 #---------------------------------------------------------------- # Use the cadmium data on page 8-6 of USEPA (1989b) to compute # two-sided 95% confidence intervals for the probability of # detection at background and compliance wells. The data are # stored in EPA.89b.cadmium.df. EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background #... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance attach(EPA.89b.cadmium.df) # Probability of detection at Background well: #-------------------------------------------- ebinom(!Censored[Well.type=="Background"], ci=TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 24.0000000 # prob = 0.3333333 # #Estimation Method: mle/mme/mvue for 'prob' # #Data: !Censored[Well.type == "Background"] # #Sample Size: 24 # #Confidence Interval for: prob # #Confidence Interval Method: Score normal approximation # (With continuity correction) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.1642654 # UCL = 0.5530745 # Probability of detection at Compliance well: #-------------------------------------------- ebinom(!Censored[Well.type=="Compliance"], ci=TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 64.000 # prob = 0.375 # #Estimation Method: mle/mme/mvue for 'prob' # #Data: !Censored[Well.type == "Compliance"] # #Sample Size: 64 # #Confidence Interval for: prob # #Confidence Interval Method: Score normal approximation # (With continuity correction) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.2597567 # UCL = 0.5053034 #---------------------------------------------------------------- # Clean up rm(dat) detach("EPA.89b.cadmium.df")
# Generate 20 observations from a binomial distribution with # parameters size=1 and prob=0.2, then estimate the 'prob' parameter. # (Note: the call to set.seed simply allows you to reproduce this # example. Also, the only parameter estimated is 'prob'; 'size' is # specified in the call to ebinom. The parameter 'size' is printed # inorder to show all of the parameters associated with the # distribution.) set.seed(251) dat <- rbinom(20, size = 1, prob = 0.2) ebinom(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 20.0 # prob = 0.1 # #Estimation Method: mle/mme/mvue for 'prob' # #Data: dat # #Sample Size: 20 #---------------------------------------------------------------- # Generate one observation from a binomial distribution with # parameters size=20 and prob=0.2, then estimate the "prob" # parameter and compute a confidence interval: set.seed(763) dat <- rbinom(1, size=20, prob=0.2) ebinom(dat, size = 20, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 20.00 # prob = 0.35 # #Estimation Method: mle/mme/mvue for 'prob' # #Data: dat # #Sample Size: 20 # #Confidence Interval for: prob # #Confidence Interval Method: Score normal approximation # (With continuity correction) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.1630867 # UCL = 0.5905104 #---------------------------------------------------------------- # Using the data from the last example, compare confidence # intervals based on the various methods ebinom(dat, size = 20, ci = TRUE, ci.method = "score", correct = TRUE)$interval$limits # LCL UCL #0.1630867 0.5905104 ebinom(dat, size = 20, ci = TRUE, ci.method = "score", correct = FALSE)$interval$limits # LCL UCL #0.1811918 0.5671457 ebinom(dat, size = 20, ci = TRUE, ci.method = "exact")$interval$limits # LCL UCL #0.1539092 0.5921885 ebinom(dat, size = 20, ci = TRUE, ci.method = "adjusted Wald")$interval$limits # LCL UCL #0.1799264 0.5684112 ebinom(dat, size = 20, ci = TRUE, ci.method = "Wald", correct = TRUE)$interval$limits # LCL UCL #0.1159627 0.5840373 ebinom(dat, size = 20, ci = TRUE, ci.method = "Wald", correct = FALSE)$interval$limits # LCL UCL #0.1409627 0.5590373 #---------------------------------------------------------------- # Use the cadmium data on page 8-6 of USEPA (1989b) to compute # two-sided 95% confidence intervals for the probability of # detection at background and compliance wells. The data are # stored in EPA.89b.cadmium.df. EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background #... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance attach(EPA.89b.cadmium.df) # Probability of detection at Background well: #-------------------------------------------- ebinom(!Censored[Well.type=="Background"], ci=TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 24.0000000 # prob = 0.3333333 # #Estimation Method: mle/mme/mvue for 'prob' # #Data: !Censored[Well.type == "Background"] # #Sample Size: 24 # #Confidence Interval for: prob # #Confidence Interval Method: Score normal approximation # (With continuity correction) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.1642654 # UCL = 0.5530745 # Probability of detection at Compliance well: #-------------------------------------------- ebinom(!Censored[Well.type=="Compliance"], ci=TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 64.000 # prob = 0.375 # #Estimation Method: mle/mme/mvue for 'prob' # #Data: !Censored[Well.type == "Compliance"] # #Sample Size: 64 # #Confidence Interval for: prob # #Confidence Interval Method: Score normal approximation # (With continuity correction) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.2597567 # UCL = 0.5053034 #---------------------------------------------------------------- # Clean up rm(dat) detach("EPA.89b.cadmium.df")
Produce an empirical cumulative distribution function plot.
ecdfPlot(x, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = 0.375, plot.it = TRUE, add = FALSE, ecdf.col = "black", ecdf.lwd = 3 * par("cex"), ecdf.lty = 1, curve.fill = FALSE, curve.fill.col = "cyan", ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
ecdfPlot(x, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = 0.375, plot.it = TRUE, add = FALSE, ecdf.col = "black", ecdf.lwd = 3 * par("cex"), ecdf.lty = 1, curve.fill = FALSE, curve.fill.col = "cyan", ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
x |
numeric vector of observations. Missing ( |
discrete |
logical scalar indicating whether the assumed parent distribution of |
prob.method |
character string indicating what method to use to compute the plotting positions (empirical probabilities).
Possible values are |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position constant.
The default value is |
plot.it |
logical scalar indicating whether to produce a plot or add to the current plot (see |
add |
logical scalar indicating whether to add the empirical cdf to the current plot ( |
ecdf.col |
a numeric scalar or character string determining the color of the empirical cdf line or points.
The default value is |
ecdf.lwd |
a numeric scalar determining the width of the empirical cdf line. The default value is
|
ecdf.lty |
a numeric scalar determining the line type of the empirical cdf line. The default value is
|
curve.fill |
a logical scalar indicating whether to fill in the area below the empirical cdf curve with the
color specified by |
curve.fill.col |
a numeric scalar or character string indicating what color to use to fill in the area below the
empirical cdf curve. The default value is |
type , main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters (see |
The cumulative distribution function (cdf) of a random variable
is the function
such that
for all values of . That is, if
, then
is the
proportion of the population that is less than or equal to
, and
is called the
'th quantile, or the 100
'th
percentile. A plot of quantiles
on the
-axis (i.e., the possible value for the random variable
) vs.
the fraction of the population less than or equal to that number on the
-axis is called the cumulative distribution function plot, and
the
-axis is usually labeled as the
“cumulative probability” or “cumulative frequency”.
When we have a sample of data from some population, we usually do not
know what percentiles our observations correspond to because we do not
know the form of the cumulative distribution function , so we
have to use the sample data to estimate the cdf
. An
emprical cumulative distribution function (ecdf) plot,
also called a quantile plot, is a plot of the observed
quantiles (i.e., the ordered observations) on the
-axis vs.
the estimated cumulative probabilities on the
-axis
(Chambers et al., 1983, pp. 11-19; Cleveland, 1993, pp. 17-20;
Cleveland, 1994, pp. 136-139; Helsel and Hirsch, 1992, pp. 21-24).
(Note: Some authors (e.g., Chambers et al., 1983, pp.11-16; Cleveland, 1993, pp.17-20)
reverse the axes on a quantile plot, i.e., the observed order statistics from the
random sample are on the -axis and the estimated cumulative probabilities
are on the
-axis.)
The empirical cumulative distribution function (ecdf)
is an estimate of the cdf based on a random sample of observations
from the distribution. Let
denote the
observations, and let
denote the ordered
observations (i.e., the order statistics). The cdf is usually estimated by either
the empirical probabilities estimator or the
plotting-position estimator. The empirical probabilities estimator
is given by:
where denotes the number of observations less than
or equal to
. The plotting-position estimator is given by:
where (Cleveland, 1993, p. 18; D'Agostino, 1986a, pp. 8,25).
For any value such that
, the ecdf is usually defined as either a step function:
(e.g., D'Agostino, 1986a), or linear interpolation between order statistics is used:
where
(e.g., Chambers et al., 1983). For the step function version, the ecdf stays flat until it hits a
value on the -axis corresponding to one of the order statistics, then it makes a jump.
For the linear interpolation version, the ecdf plot looks like lines connecting the points.
By default, the function
ecdfPlot
uses the step function version when discrete=TRUE
, and
the linear interpolation version when discrete=FALSE
. The user may override these defaults by
supplying the graphics parameter type
(type="s"
for a step function, type="l"
for linear interpolation, type="p"
for points only, etc.).
The empirical probabilities estimator is intuitively appealing. This is the estimator used when
prob.method="emp.probs"
. The disadvantage of this estimator is that it implies the largest
observed value is the maximum possible value of the distribution (i.e., the 100'th percentile). This
may be satisfactory if the underlying distribution is known to be discrete, but it is usually not
satisfactory if the underlying distribution is known to be continuous.
The plotting-position estimator with various values of is often used when the goal is
to produce a probability plot (see
qqPlot
) rather than an empirical cdf plot. It is used
to compute the estimated expected values or medians of the order statistics for a probability plot.
This is the estimator used when prob.method="plot.pos"
. The argument plot.pos.con
refers
to the variable . Based on certain principles from statistical theory, certain
values of the constant
make sense for specific underlying distributions (see
the help file for
qqPlot
for more information).
Because is a random sample, the emprical cdf changes from sample to sample and the variability
in these estimates can be dramatic for small sample sizes.
ecdfPlot
invisibly returns a list with the following components:
Order.Statistics |
numeric vector of the ordered observations. |
Cumulative.Probabilities |
numeric vector of the associated plotting positions. |
An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data. It is easy to determine quartiles and the minimum and maximum values from such a plot. Also, ecdf plots allow you to assess local density: a higher density of observations occurs where the slope is steep.
Chambers et al. (1983, pp.11-16) plot the observed order statistics on the
-axis vs. the ecdf on the
-axis and call this a quantile plot.
Empirical cumulative distribution function (ecdf) plots are often plotted with
theoretical cdf plots (see cdfPlot
and cdfCompare
) to
graphically assess whether a sample of observations comes from a particular
distribution. The Kolmogorov-Smirnov goodness-of-fit test
(see gofTest
) is the statistical companion of this kind of
comparison; it is based on the maximum vertical distance between the empirical
cdf plot and the theoretical cdf plot. More often, however,
quantile-quantile (Q-Q) plots are used instead of ecdf plots to graphically assess
departures from an assumed distribution (see qqPlot
).
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.
ppoints
, cdfPlot
, cdfCompare
,
qqPlot
, ecdfPlotCensored
.
# Generate 20 observations from a normal distribution with # mean=0 and sd=1 and create an ecdf plot. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rnorm(20) dev.new() ecdfPlot(x) #---------- # Repeat the above example, but fill in the area under the # empirical cdf curve. dev.new() ecdfPlot(x, curve.fill = TRUE) #---------- # Repeat the above example, but plot only the points. dev.new() ecdfPlot(x, type = "p") #---------- # Repeat the above example, but force a step function. dev.new() ecdfPlot(x, type = "s") #---------- # Clean up rm(x) #------------------------------------------------------------------------------------- # The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # Create an empirical CDF plot for the reference area data. dev.new() with(EPA.94b.tccb.df, ecdfPlot(TcCB[Area == "Reference"], xlab = "TcCB (ppb)")) #========== # Clean up #--------- graphics.off()
# Generate 20 observations from a normal distribution with # mean=0 and sd=1 and create an ecdf plot. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rnorm(20) dev.new() ecdfPlot(x) #---------- # Repeat the above example, but fill in the area under the # empirical cdf curve. dev.new() ecdfPlot(x, curve.fill = TRUE) #---------- # Repeat the above example, but plot only the points. dev.new() ecdfPlot(x, type = "p") #---------- # Repeat the above example, but force a step function. dev.new() ecdfPlot(x, type = "s") #---------- # Clean up rm(x) #------------------------------------------------------------------------------------- # The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # Create an empirical CDF plot for the reference area data. dev.new() with(EPA.94b.tccb.df, ecdfPlot(TcCB[Area == "Reference"], xlab = "TcCB (ppb)")) #========== # Clean up #--------- graphics.off()
Produce an empirical cumulative distribution function plot for Type I left-censored or right-censored data.
ecdfPlotCensored(x, censored, censoring.side = "left", discrete = FALSE, prob.method = "michael-schucany", plot.pos.con = 0.375, plot.it = TRUE, add = FALSE, ecdf.col = 1, ecdf.lwd = 3 * par("cex"), ecdf.lty = 1, include.cen = FALSE, cen.pch = ifelse(censoring.side == "left", 6, 2), cen.cex = par("cex"), cen.col = 4, ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
ecdfPlotCensored(x, censored, censoring.side = "left", discrete = FALSE, prob.method = "michael-schucany", plot.pos.con = 0.375, plot.it = TRUE, add = FALSE, ecdf.col = 1, ecdf.lwd = 3 * par("cex"), ecdf.lty = 1, include.cen = FALSE, cen.pch = ifelse(censoring.side == "left", 6, 2), cen.cex = par("cex"), cen.col = 4, ..., type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible values are
|
discrete |
logical scalar indicating whether the assumed parent distribution of |
prob.method |
character string indicating what method to use to compute the plotting positions (empirical probabilities).
Possible values are The |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position constant.
The default value is |
plot.it |
logical scalar indicating whether to produce a plot or add to the current plot (see |
add |
logical scalar indicating whether to add the empirical cdf to the current plot ( |
ecdf.col |
a numeric scalar or character string determining the color of the empirical cdf line or points.
The default value is |
ecdf.lwd |
a numeric scalar determining the width of the empirical cdf line. The default value is
|
ecdf.lty |
a numeric scalar determining the line type of the empirical cdf line. The default value is
|
include.cen |
logical scalar indicating whether to include censored values in the plot. The default value is
|
cen.pch |
numeric scalar or character string indicating the plotting character to use to plot censored values.
The default value is |
cen.cex |
numeric scalar that determines the size of the plotting character used to plot censored values.
The default value is the current value of the cex graphics parameter. See the entry for |
cen.col |
numeric scalar or character string that determines the color of the plotting character used to
plot censored values. The default value is |
type , main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters (see |
The function ecdfPlotCensored
does exactly the same thing as
ecdfPlot
, except it calls the function ppointsCensored
to compute the plotting positions (estimated cumulative probabilities) for the
uncensored observations.
If plot.it=TRUE
, the estimated cumulative probabilities for the uncensored
observations are plotted against the uncensored observations. By default, the
function ecdfPlotCensored
plots a step function when discrete=TRUE
,
and plots a straight line between points when discrete=FALSE
. The user may
override these defaults by supplying the graphics parameter
type (type="s"
for a step function, type="l"
for linear interpolation,
type="p"
for points only, etc.).
If include.cen=TRUE
, censored observations are included on the plot as points. The arguments
cen.pch
, cen.cex
, and cen.col
control the appearance of these points.
In cases where x
is a random sample, the emprical cdf will change from sample to sample and
the variability in these estimates can be dramatic for small sample sizes. Caution must be used in
interpreting the empirical cdf when a large percentage of the observations are censored.
ecdfPlotCensored
returns a list with the following components:
Order.Statistics |
numeric vector of the “ordered” observations. |
Cumulative.Probabilities |
numeric vector of the associated plotting positions. |
Censored |
logical vector indicating which of the ordered observations are censored. |
Censoring.Side |
character string indicating whether the data are left- or right-censored.
This is same value as the argument |
Prob.Method |
character string indicating what method was used to compute the plotting positions.
This is the same value as the argument |
Optional Component (only present when prob.method="michael-schucany"
or prob.method="hirsch-stedinger"
):
Plot.Pos.Con |
numeric scalar containing the value of the plotting position constant that was used.
This is the same as the argument |
An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data.
Censored observations complicate the procedures used to graphically explore data. Techniques from
survival analysis and life testing have been developed to generalize the procedures for constructing
plotting positions, empirical cdf plots, and q-q plots to data sets with censored observations
(see ppointsCensored
).
Empirical cumulative distribution function (ecdf) plots are often plotted with theoretical cdf plots
(see cdfPlot
and cdfCompareCensored
) to graphically assess whether a
sample of observations comes from a particular distribution. More often, however, quantile-quantile
(Q-Q) plots are used instead (see qqPlot
and qqPlotCensored
).
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.
Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997-2004.
Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715-727.
Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.
Lee, E.T., and J.W. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley & Sons, Hoboken, New Jersey, 513pp.
Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461-496.
Nelson, W. (1972). Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14, 945-966.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
ppoints
, ppointsCensored
, ecdfPlot
,
qqPlot
, qqPlotCensored
, cdfPlot
,
cdfCompareCensored
.
# Generate 20 observations from a normal distribution with mean=20 and sd=5, # censor all observations less than 18, then generate an empirical cdf plot # for the complete data set and the censored data set. Note that the empirical # cdf plot for the censored data set starts at the first ordered uncensored # observation, and that for values of x > 18 the two emprical cdf plots are # exactly the same. This is because there is only one censoring level and # no uncensored observations fall below the censored observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(333) x <- rnorm(20, mean=20, sd=5) censored <- x < 18 sum(censored) #[1] 7 new.x <- x new.x[censored] <- 18 dev.new() ecdfPlot(x, xlim = range(pretty(x)), main = "Empirical CDF Plot for\nComplete Data Set") dev.new() ecdfPlotCensored(new.x, censored, xlim = range(pretty(x)), main="Empirical CDF Plot for\nCensored Data Set") # Clean up #--------- rm(x, censored, new.x) #------------------------------------------------------------------------------------ # Example 15-1 of USEPA (2009, page 15-10) gives an example of # computing plotting positions based on censored manganese # concentrations (ppb) in groundwater collected at 5 monitoring # wells. The data for this example are stored in # EPA.09.Ex.15.1.manganese.df. Here we will create an empirical # CDF plot based on the Kaplan-Meier method. EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #4 4 Well.1 21.6 21.6 FALSE #5 5 Well.1 <2 2.0 TRUE #... #21 1 Well.5 17.9 17.9 FALSE #22 2 Well.5 22.7 22.7 FALSE #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE dev.new() with(EPA.09.Ex.15.1.manganese.df, ecdfPlotCensored(Manganese.ppb, Censored, prob.method = "kaplan-meier", ecdf.col = "blue", main = "Empirical CDF of Manganese Data\nBased on Kaplan-Meier")) #========== # Clean up #--------- graphics.off()
# Generate 20 observations from a normal distribution with mean=20 and sd=5, # censor all observations less than 18, then generate an empirical cdf plot # for the complete data set and the censored data set. Note that the empirical # cdf plot for the censored data set starts at the first ordered uncensored # observation, and that for values of x > 18 the two emprical cdf plots are # exactly the same. This is because there is only one censoring level and # no uncensored observations fall below the censored observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(333) x <- rnorm(20, mean=20, sd=5) censored <- x < 18 sum(censored) #[1] 7 new.x <- x new.x[censored] <- 18 dev.new() ecdfPlot(x, xlim = range(pretty(x)), main = "Empirical CDF Plot for\nComplete Data Set") dev.new() ecdfPlotCensored(new.x, censored, xlim = range(pretty(x)), main="Empirical CDF Plot for\nCensored Data Set") # Clean up #--------- rm(x, censored, new.x) #------------------------------------------------------------------------------------ # Example 15-1 of USEPA (2009, page 15-10) gives an example of # computing plotting positions based on censored manganese # concentrations (ppb) in groundwater collected at 5 monitoring # wells. The data for this example are stored in # EPA.09.Ex.15.1.manganese.df. Here we will create an empirical # CDF plot based on the Kaplan-Meier method. EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #4 4 Well.1 21.6 21.6 FALSE #5 5 Well.1 <2 2.0 TRUE #... #21 1 Well.5 17.9 17.9 FALSE #22 2 Well.5 22.7 22.7 FALSE #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE dev.new() with(EPA.09.Ex.15.1.manganese.df, ecdfPlotCensored(Manganese.ppb, Censored, prob.method = "kaplan-meier", ecdf.col = "blue", main = "Empirical CDF of Manganese Data\nBased on Kaplan-Meier")) #========== # Clean up #--------- graphics.off()
Estimate the location and scale parameters of an extreme value distribution, and optionally construct a confidence interval for one of the parameters.
eevd(x, method = "mle", pwme.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), ci = FALSE, ci.parameter = "location", ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95)
eevd(x, method = "mle", pwme.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), ci = FALSE, ci.parameter = "location", ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95)
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Possible values are
|
pwme.method |
character string specifying what method to use to compute the
probability-weighted moments when |
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for the
plotting positions when |
ci |
logical scalar indicating whether to compute a confidence interval for the
location or scale parameter. The default value is |
ci.parameter |
character string indicating the parameter for which the confidence interval is
desired. The possible values are |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence interval
for the location or scale parameter. Currently, the only possible value is
|
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from an extreme value distribution with
parameters
location=
and
scale=
.
Estimation
Maximum Likelihood Estimation (method="mle"
)
The maximum likelihood estimators (mle's) of and
are
the solutions of the simultaneous equations (Forbes et al., 2011):
where
.
Method of Moments Estimation (method="mme"
)
The method of moments estimators (mme's) of and
are
given by (Johnson et al., 1995, p.27):
where denotes Euler's constant and
denotes the square root of the method of moments estimator of variance:
Method of Moments Estimators Based on the Unbiased Estimator of Variance (method="mmue"
)
These estimators are the same as the method of moments estimators except that
the method of moments estimator of variance is replaced with the unbiased estimator
of variance:
Probability-Weighted Moments Estimation (method="pwme"
)
Greenwood et al. (1979) show that the relationship between the distribution
parameters and
and the probability-weighted moments
is given by:
where denotes the
'th probability-weighted moment and
denotes Euler's constant.
The probability-weighted moment estimators (pwme's) of
and
are computed by simply replacing the
's in the
above two equations with estimates of the
's (and for the
estimate of
, replacing
with its estimated value).
See the help file for
pwMoment
for more information on how to
estimate the 's. Also, see Landwehr et al. (1979) for an example
of this method of estimation using the unbiased (U-statistic type)
probability-weighted moment estimators. Hosking et al. (1985) note that this
method of estimation using the U-statistic type probability-weighted moments
is equivalent to Downton's (1966) linear estimates with linear coefficients.
Confidence Intervals
When ci=TRUE
, an approximate 100% confidence intervals
for
can be constructed assuming the distribution of the estimator of
is approximately normally distributed. A two-sided confidence
interval is constructed as:
where is the
'th quantile of
Student's t-distribution with
degrees of freedom, and the quantity
denotes the estimated asymptotic standard deviation of the estimator of .
Similarly, a two-sided confidence interval for is constructed as:
One-sided confidence intervals for and
are computed in
a similar fashion.
Maximum Likelihood (method="mle"
)
Downton (1966) shows that the estimated asymptotic variances of the mle's of
and
are given by:
where denotes Euler's constant.
Method of Moments (method="mme"
or method="mmue"
)
Tiago de Oliveira (1963) and Johnson et al. (1995, p.27) show that the
estimated asymptotic variance of the mme's of and
are given by:
where the quantities
denote the skew and kurtosis of the distribution, and
denotes Euler's constant.
The estimated asymptotic variances of the mmue's of and
are the same, except replace the mme of
in the above equations with
the mmue of
.
Probability-Weighted Moments (method="pwme"
)
As stated above, Hosking et al. (1985) note that this method of estimation using
the U-statistic type probability-weighted moments is equivalent to
Downton's (1966) linear estimates with linear coefficients. Downton (1966)
provides exact values of the variances of the estimates of location and scale
parameters for the smallest extreme value distribution. For the largest extreme
value distribution, the formula for the estimate of scale is the same, but the
formula for the estimate of location must be modified. Thus, Downton's (1966)
equation (3.4) is modified to:
where denotes Euler's constant, and
and
are defined in Downton (1966, p.8). Using
Downton's (1966) equations (3.9)-(3.12), the exact variance of the pwme of
can be derived. Note that when
method="pwme"
and
pwme.method="plotting.position"
, these are only the asymptotically correct
variances.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
There are three families of extreme value distributions. The one
described here is the Type I, also called the Gumbel extreme value
distribution or simply Gumbel distribution. The name
“extreme value” comes from the fact that this distribution is
the limiting distribution (as approaches infinity) of the
greatest value among
independent random variables each
having the same continuous distribution.
The Gumbel extreme value distribution is related to the
exponential distribution as follows.
Let be an exponential random variable
with parameter
rate=
. Then
has an extreme value distribution with parameters
location=
and
scale=
.
The distribution described above and assumed by eevd
is the
largest extreme value distribution. The smallest extreme value
distribution is the limiting distribution (as approaches infinity)
of the smallest value among
independent random variables each having the same continuous distribution.
If
has a largest extreme value distribution with parameters
location=
and
scale=
, then
has a smallest extreme value distribution with parameters
location=
and
scale=
. The smallest
extreme value distribution is related to the Weibull distribution
as follows. Let
be a Weibull random variable with
parameters
shape=
and
scale=
. Then
has a smallest extreme value distribution with parameters
location=
and
scale=
.
The extreme value distribution has been used extensively to model the distribution of streamflow, flooding, rainfall, temperature, wind speed, and other meteorological variables, as well as material strength and life data.
Steven P. Millard ([email protected])
Castillo, E. (1988). Extreme Value Theory in Engineering. Academic Press, New York, pp.184–198.
Downton, F. (1966). Linear Estimates of Parameters in the Extreme Value Distribution. Technometrics 8(1), 3–17.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Greenwood, J.A., J.M. Landwehr, N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments: Definition and Relation to Parameters of Several Distributions Expressible in Inverse Form. Water Resources Research 15(5), 1049–1054.
Hosking, J.R.M., J.R. Wallis, and E.F. Wood. (1985). Estimation of the Generalized Extreme-Value Distribution by the Method of Probability-Weighted Moments. Technometrics 27(3), 251–261.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Landwehr, J.M., N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments Compared With Some Traditional Techniques in Estimating Gumbel Parameters and Quantiles. Water Resources Research 15(5), 1055–1064.
Tiago de Oliveira, J. (1963). Decision Results for the Parameters of the Extreme Value (Gumbel) Distribution Based on the Mean and Standard Deviation. Trabajos de Estadistica 14, 61–81.
Extreme Value Distribution, Euler's Constant.
# Generate 20 observations from an extreme value distribution with # parameters location=2 and scale=1, then estimate the parameters # and construct a 90% confidence interval for the location parameter. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- revd(20, location = 2) eevd(dat, ci = TRUE, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Extreme Value # #Estimated Parameter(s): location = 1.9684093 # scale = 0.7481955 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: location # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = 1.663809 # UCL = 2.273009 #---------- #Compare the values of the different types of estimators: eevd(dat, method = "mle")$parameters # location scale #1.9684093 0.7481955 eevd(dat, method = "mme")$parameters # location scale #1.9575980 0.8339256 eevd(dat, method = "mmue")$parameters # location scale #1.9450932 0.8555896 eevd(dat, method = "pwme")$parameters # location scale #1.9434922 0.8583633 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from an extreme value distribution with # parameters location=2 and scale=1, then estimate the parameters # and construct a 90% confidence interval for the location parameter. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- revd(20, location = 2) eevd(dat, ci = TRUE, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Extreme Value # #Estimated Parameter(s): location = 1.9684093 # scale = 0.7481955 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: location # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = 1.663809 # UCL = 2.273009 #---------- #Compare the values of the different types of estimators: eevd(dat, method = "mle")$parameters # location scale #1.9684093 0.7481955 eevd(dat, method = "mme")$parameters # location scale #1.9575980 0.8339256 eevd(dat, method = "mmue")$parameters # location scale #1.9450932 0.8555896 eevd(dat, method = "pwme")$parameters # location scale #1.9434922 0.8583633 #---------- # Clean up #--------- rm(dat)
Estimate the rate parameter of an exponential distribution, and optionally construct a confidence interval for the rate parameter.
eexp(x, method = "mle/mme", ci = FALSE, ci.type = "two-sided", ci.method = "exact", conf.level = 0.95)
eexp(x, method = "mle/mme", ci = FALSE, ci.type = "two-sided", ci.method = "exact", conf.level = 0.95)
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Currently the only
possible value is |
ci |
logical scalar indicating whether to compute a confidence interval for the
location or scale parameter. The default value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence interval
for the location or scale parameter. Currently, the only possible value is
|
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from an exponential distribution with
parameter
rate=
.
Estimation
The maximum likelihood estimator (mle) of is given by:
where
(Forbes et al., 2011). That is, the mle is the reciprocal of the sample mean.
Sometimes the exponential distribution is parameterized with a scale parameter instead of a rate parameter. The scale parameter is the reciprocal of the rate parameter, and the sample mean is both the mle and the minimum variance unbiased estimator (mvue) of the scale parameter.
Confidence Interval
When ci=TRUE
, an exact confidence intervals for
can be constructed based on the relationship between the
exponential distribution, the gamma distribution, and
the chi-square distribution. An exponential distribution
with parameter
rate=
is equivalent to a gamma distribution
with parameters
shape=1
and scale=
. The sum of
iid gamma random variables with parameters
shape=1
and
scale=
is a gamma random variable with parameters
shape=
and
scale=
. Finally, a gamma
distribution with parameters
shape=
and
scale=
is equivalent to 0.5 times a chi-square distribution with degrees of freedom
df=
. Thus, the quantity
has a chi-square
distribution with degrees of freedom
df=
.
A two-sided confidence interval for
is
therefore constructed as:
where is the
'th quantile of a
chi-square distribution with
degrees of freedom.
One-sided confidence intervals are computed in a similar fashion.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The exponential distribution is a special case of the gamma distribution, and takes on positive real values. A major use of the exponential distribution is in life testing where it is used to model the lifetime of a product, part, person, etc.
The exponential distribution is the only continuous distribution with a
“lack of memory” property. That is, if the lifetime of a part follows
the exponential distribution, then the distribution of the time until failure
is the same as the distribution of the time until failure given that the part
has survived to time .
The exponential distribution is related to the double exponential (also called Laplace) distribution, and to the extreme value distribution.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
# Generate 20 observations from an exponential distribution with parameter # rate=2, then estimate the parameter and construct a 90% confidence interval. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rexp(20, rate = 2) eexp(dat, ci=TRUE, conf = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Exponential # #Estimated Parameter(s): rate = 2.260587 # #Estimation Method: mle/mme # #Data: dat # #Sample Size: 20 # #Confidence Interval for: rate # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = 1.498165 # UCL = 3.151173 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from an exponential distribution with parameter # rate=2, then estimate the parameter and construct a 90% confidence interval. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rexp(20, rate = 2) eexp(dat, ci=TRUE, conf = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Exponential # #Estimated Parameter(s): rate = 2.260587 # #Estimation Method: mle/mme # #Data: dat # #Sample Size: 20 # #Confidence Interval for: rate # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = 1.498165 # UCL = 3.151173 #---------- # Clean up #--------- rm(dat)
Estimate the shape and scale parameters (or the mean and coefficient of variation) of a Gamma distribution.
egamma(x, method = "mle", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", normal.approx.transform = "kulkarni.powar", conf.level = 0.95) egammaAlt(x, method = "mle", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", normal.approx.transform = "kulkarni.powar", conf.level = 0.95)
egamma(x, method = "mle", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", normal.approx.transform = "kulkarni.powar", conf.level = 0.95) egammaAlt(x, method = "mle", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", normal.approx.transform = "kulkarni.powar", conf.level = 0.95)
x |
numeric vector of non-negative observations.
Missing ( |
method |
character string specifying the method of estimation. The possible values are: |
ci |
logical scalar indicating whether to compute a confidence interval for the mean.
The default value is |
ci.type |
character string indicating what kind of confidence interval to compute.
The possible values are
|
ci.method |
character string indicating which method to use to construct the confidence interval.
Possible values are |
normal.approx.transform |
character string indicating which power transformation to use when |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval. The default
value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let denote a random sample of
observations from a gamma distribution
with parameters
shape=
and
scale=
.
The relationship between these parameters and the mean (
mean=
)
and coefficient of variation (
cv=
) of this distribution is given by:
The function egamma
returns estimates of the shape and scale parameters.
The function egammaAlt
returns estimates of the mean () and
coefficient of variation (
) based on the estimates of the shape and
scale parameters.
Estimation
Maximum Likelihood Estimation (method="mle"
)
The maximum likelihood estimators (mle's) of the shape and scale parameters
and
are solutions of the simultaneous equations:
where denotes the
digamma function
,
and denotes the sample mean:
(Forbes et al., 2011, chapter 22; Johnson et al., 1994, chapter 17).
Bias-Corrected Maximum Likelihood Estimation (method="bcmle"
)
The “bias-corrected” maximum likelihood estimator of
the shape parameter is based on the suggestion of Anderson and Ray (1975;
see also Johnon et al., 1994, p.366 and Singh et al., 2010b, p.48), who noted that
the bias of the maximum likelihood estimator of the shape parameter can be
considerable when the sample size is small. This estimator is given by:
When method="bcmle"
, Equation (6) above is modified so that the estimate of the
scale paramter is based on the “bias-corrected” maximum likelihood estimator
of the shape parameter:
Method of Moments Estimation (method="mme"
)
The method of moments estimators (mme's) of the shape and scale parameters
and
are:
where denotes the method of moments estimator of variance:
Method of Moments Estimation Based on the Unbiased Estimator of Variance (method="mmue"
)
The method of moments estimators based on the unbiased estimator of variance
are exactly the same as the method of moments estimators,
except that the method of moments estimator of variance is replaced with the
unbiased estimator of variance:
where denotes the unbiased estimator of variance:
Confidence Intervals
This section discusses how confidence intervals for the mean are computed.
Normal Approximation (ci.method="normal.approx"
)
The normal approximation method is based on the method of Kulkarni and Powar (2010),
who use a power transformation of the the original data to approximate a sample
from a normal distribuiton, compute the confidence interval for the mean on the
transformed scale using the usual formula for a confidence interval for the
mean of a normal distribuiton, and then tranform the limits back to the original
space using equations based on the expected value of a gamma random variable
raised to a power.
The particular power used for the normal approximation is defined by the argument normal.approx.transform
. The value
normal.approx.transform="cube.root"
uses the cube root transformation
suggested by Wilson and Hilferty (1931), and the value "fourth.root"
uses the fourth root transformation suggested
by Hawkins and Wixley (1986). The default value "kulkarni.powar"
uses the “Optimum Power Normal Approximation Method” of Kulkarni and Powar
(2010), who show this method performs the best in terms of maintining coverage
and minimizing confidence interval width compared to eight other methods.
The “optimum” power is determined by:
|
if |
|
|
if |
(16) |
where denotes the estimate of the shape parameter.
Kulkarni and Powar (2010)
derived this equation by determining what power transformation yields a skew closest to 0 and
a kurtosis closest to 3 for a gamma random variable with a given shape parameter.
Although Kulkarni and Powar (2010) use the maximum likelihood estimate of shape to
determine the power to use to induce approximate normality, for the functions
egamma
and egammaAlt
the power is based on whatever estimate of
shape is used (e.g., method="mle"
, method="bcmle"
, etc.).
Likelihood Profile (ci.method="profile.likelihood"
)
This method was proposed by Cox (1970, p.88), and Venzon and Moolgavkar (1988)
introduced an efficient method of computation. This method is also discussed by
Stryhn and Christensen (2003) and Royston (2007).
The idea behind this method is to invert the likelihood-ratio test to obtain a
confidence interval for the mean while treating the coefficient of
variation
as a nuisance parameter.
The likelihood function is given by:
where ,
,
, and
are defined in
Equations (1)-(4) above, and
denotes the Gamma function evaluated at
.
Following Stryhn and Christensen (2003), denote the maximum likelihood estimates
of the mean and coefficient of variation by .
The likelihood ratio test statistic (
) of the hypothesis
(where
is a fixed value) equals the
drop in
between the “full” model and the reduced model with
fixed at
, i.e.,
where is the maximum likelihood estimate of
for the
reduced model (i.e., when
). Under the null hypothesis,
the test statistic
follows a
chi-squared distribution with 1 degree of freedom.
Alternatively, we may
express the test statistic in terms of the profile likelihood function
for the mean
, which is obtained from the usual likelihood function by
maximizing over the parameter
, i.e.,
Then we have
A two-sided confidence interval for the mean
consists of all values of
for which the test is not significant at
level
:
where denotes the
'th quantile of the
chi-squared distribution with
degrees of freedom.
One-sided lower and one-sided upper confidence intervals are computed in a similar
fashion, except that the quantity
in Equation (21) is replaced with
.
Chi-Square Approximation (ci.method="chisq.approx"
)
This method is based on the relationship between the sample mean of the gamma
distribution and the chi-squared distribution (Grice and Bain, 1980):
Therefore, an exact one-sided upper confidence interval
for the mean
is given by:
an exact one-sided lower confidence interval
is given by:
and a two-sided confidence interval is given by:
Because this method is exact only when the shape parameter is known, the
method used here is called the “chi-square approximation” method
because the estimate of the shape parameter,
, is used in place
of
in Equations (23)-(25) above. The Chi-Square Approximation method
is not the method proposed by Grice and Bain (1980) in which the
confidence interval is adjusted based on adjusting for the fact that the shape
parameter
is estimated (see the explanation of the
Chi-Square Adjusted method below). The Chi-Square Approximation method used
by
egamma
and egammaAlt
is equivalent to the “approximate gamma”
method of ProUCL (USEPA, 2015, equation (2-34), p.62).
Chi-Square Adjusted (ci.method="chisq.adj"
)
This is the same method as the Chi-Square Approximation method discussed above,
execpt that the value of is adjusted to account for the fact that
the shape parameter
is estimated rather than known. Grice and Bain (1980)
performed Monte Carlo simulations to determine how to adjust
and the
values in their Table 2 are given in the matrix
Grice.Bain.80.mat
.
This method requires that the sample size is at least 5 and the confidence level
is between 75% and 99.5% (except when
, in which case the confidence level
must be less than 99%). For values of the sample size
and/or
that are not listed in the table, linear interpolation is used (when the sample
size
is greater than 40, linear interpolation on
is used, as
recommended by Grice and Bain (1980)). The Chi-Square Adjusted method used
by
egamma
and egammaAlt
is equivalent to the “adjusted gamma”
method of ProUCL (USEPA, 2015, equation (2-35), p.63).
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
When ci=TRUE
and ci.method="normal.approx"
, it is possible for the
lower confidence limit based on the transformed data to be less than 0.
In this case, the lower confidence limit on the original scale is set to 0 and a warning is
issued stating that the normal approximation is not accurate in this case.
The gamma distribution takes values on the positive real line. Special cases of the gamma are the exponential distribution and the chi-square distributions. Applications of the gamma include life testing, statistical ecology, queuing theory, inventory control, and precipitation processes. A gamma distribution starts to resemble a normal distribution as the shape parameter a tends to infinity.
Some EPA guidance documents (e.g., Singh et al., 2002; Singh et al., 2010a,b) strongly recommend against using a lognormal model for environmental data and recommend trying a gamma distribuiton instead.
Steven P. Millard ([email protected])
Anderson, C.W., and W.D. Ray. (1975). Improved Maximum Likelihood Estimators for the Gamma Distribution. Communications in Statistics, 4, 437–448.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions, Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Grice, J.V., and L.J. Bain. (1980). Inferences Concerning the Mean of the Gamma Distribution. Journal of the American Statistician, 75, 929-933.
Hawkins, D. M., and R.A.J. Wixley. (1986). A Note on the Transformation of Chi-Squared Variables to Normality. The American Statistician, 40, 296–298.
Johnson, N.L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York, Chapter 17.
Kulkarni, H.V., and S.K. Powar. (2010). A New Method for Interval Estimation of the Mean of the Gamma Distribution. Lifetime Data Analysis, 16, 431–447.
Singh, A., A.K. Singh, and R.J. Iaci. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.
Wilson, E.B., and M.M. Hilferty. (1931). The Distribution of Chi-Squares. Proceedings of the National Academy of Sciences, 17, 684–688.
GammaDist
, estimate.object
, eqgamma
,
predIntGamma
, tolIntGamma
.
# Generate 20 observations from a gamma distribution with parameters # shape=3 and scale=2, then estimate the parameters. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rgamma(20, shape = 3, scale = 2) egamma(dat, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.203862 # scale = 2.174928 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: mean # #Confidence Interval Method: Optimum Power Normal Approximation # of Kulkarni & Powar (2010) # using mle of 'shape' # #Normal Transform Power: 0.246 # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 3.361652 # UCL = 6.746794 # Clean up rm(dat) #==================================================================== # Using the reference area TcCB data in EPA.94b.tccb.df, assume a # gamma distribution, estimate the parameters based on the # bias-corrected mle of shape, and compute a one-sided upper 90% # confidence interval for the mean. #---------- # First test to see whether the data appear to follow a gamma # distribution. with(EPA.94b.tccb.df, gofTest(TcCB[Area == "Reference"], dist = "gamma", est.arg.list = list(method = "bcmle")) ) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 4.5695247 # scale = 0.1309788 # #Estimation Method: bcmle # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Test Statistic: W = 0.9703827 # #Test Statistic Parameter: n = 47 # #P-value: 0.2739512 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Now estimate the paramters and compute the upper confidence limit. with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "bcmle", ci = TRUE, ci.type = "upper", conf.level = 0.9) ) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 4.5695247 # scale = 0.1309788 # #Estimation Method: bcmle # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Optimum Power Normal Approximation # of Kulkarni & Powar (2010) # using bcmle of 'shape' # #Normal Transform Power: 0.246 # #Confidence Interval Type: upper # #Confidence Level: 90% # #Confidence Interval: LCL = 0.0000000 # UCL = 0.6561838 #------------------------------------------------------------------ # Repeat the above example but use the alternative parameterization. with(EPA.94b.tccb.df, egammaAlt(TcCB[Area == "Reference"], method = "bcmle", ci = TRUE, ci.type = "upper", conf.level = 0.9) ) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): mean = 0.5985106 # cv = 0.4678046 # #Estimation Method: bcmle of 'shape' # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Optimum Power Normal Approximation # of Kulkarni & Powar (2010) # using bcmle of 'shape' # #Normal Transform Power: 0.246 # #Confidence Interval Type: upper # #Confidence Level: 90% # #Confidence Interval: LCL = 0.0000000 # UCL = 0.6561838 #------------------------------------------------------------------ # Compare the upper confidence limit based on # 1) the default method: # normal approximation method based on Kulkarni and Powar (2010) # 2) Profile Likelihood # 3) Chi-Square Approximation # 4) Chi-Square Adjusted # Default Method #--------------- with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "bcmle", ci = TRUE, ci.type = "upper", conf.level = 0.9)$interval$limits["UCL"] ) # UCL #0.6561838 # Profile Likelihood #------------------- with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "mle", ci = TRUE, ci.type = "upper", conf.level = 0.9, ci.method = "profile.likelihood")$interval$limits["UCL"] ) # UCL #0.6527009 # Chi-Square Approximation #------------------------- with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "mle", ci = TRUE, ci.type = "upper", conf.level = 0.9, ci.method = "chisq.approx")$interval$limits["UCL"] ) # UCL #0.6532188 # Chi-Square Adjusted #-------------------- with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "mle", ci = TRUE, ci.type = "upper", conf.level = 0.9, ci.method = "chisq.adj")$interval$limits["UCL"] ) # UCL #0.65467
# Generate 20 observations from a gamma distribution with parameters # shape=3 and scale=2, then estimate the parameters. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rgamma(20, shape = 3, scale = 2) egamma(dat, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.203862 # scale = 2.174928 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: mean # #Confidence Interval Method: Optimum Power Normal Approximation # of Kulkarni & Powar (2010) # using mle of 'shape' # #Normal Transform Power: 0.246 # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 3.361652 # UCL = 6.746794 # Clean up rm(dat) #==================================================================== # Using the reference area TcCB data in EPA.94b.tccb.df, assume a # gamma distribution, estimate the parameters based on the # bias-corrected mle of shape, and compute a one-sided upper 90% # confidence interval for the mean. #---------- # First test to see whether the data appear to follow a gamma # distribution. with(EPA.94b.tccb.df, gofTest(TcCB[Area == "Reference"], dist = "gamma", est.arg.list = list(method = "bcmle")) ) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 4.5695247 # scale = 0.1309788 # #Estimation Method: bcmle # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Test Statistic: W = 0.9703827 # #Test Statistic Parameter: n = 47 # #P-value: 0.2739512 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Now estimate the paramters and compute the upper confidence limit. with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "bcmle", ci = TRUE, ci.type = "upper", conf.level = 0.9) ) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 4.5695247 # scale = 0.1309788 # #Estimation Method: bcmle # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Optimum Power Normal Approximation # of Kulkarni & Powar (2010) # using bcmle of 'shape' # #Normal Transform Power: 0.246 # #Confidence Interval Type: upper # #Confidence Level: 90% # #Confidence Interval: LCL = 0.0000000 # UCL = 0.6561838 #------------------------------------------------------------------ # Repeat the above example but use the alternative parameterization. with(EPA.94b.tccb.df, egammaAlt(TcCB[Area == "Reference"], method = "bcmle", ci = TRUE, ci.type = "upper", conf.level = 0.9) ) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): mean = 0.5985106 # cv = 0.4678046 # #Estimation Method: bcmle of 'shape' # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Optimum Power Normal Approximation # of Kulkarni & Powar (2010) # using bcmle of 'shape' # #Normal Transform Power: 0.246 # #Confidence Interval Type: upper # #Confidence Level: 90% # #Confidence Interval: LCL = 0.0000000 # UCL = 0.6561838 #------------------------------------------------------------------ # Compare the upper confidence limit based on # 1) the default method: # normal approximation method based on Kulkarni and Powar (2010) # 2) Profile Likelihood # 3) Chi-Square Approximation # 4) Chi-Square Adjusted # Default Method #--------------- with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "bcmle", ci = TRUE, ci.type = "upper", conf.level = 0.9)$interval$limits["UCL"] ) # UCL #0.6561838 # Profile Likelihood #------------------- with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "mle", ci = TRUE, ci.type = "upper", conf.level = 0.9, ci.method = "profile.likelihood")$interval$limits["UCL"] ) # UCL #0.6527009 # Chi-Square Approximation #------------------------- with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "mle", ci = TRUE, ci.type = "upper", conf.level = 0.9, ci.method = "chisq.approx")$interval$limits["UCL"] ) # UCL #0.6532188 # Chi-Square Adjusted #-------------------- with(EPA.94b.tccb.df, egamma(TcCB[Area == "Reference"], method = "mle", ci = TRUE, ci.type = "upper", conf.level = 0.9, ci.method = "chisq.adj")$interval$limits["UCL"] ) # UCL #0.65467
Estimate the mean and coefficient of variation of a gamma distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for the mean.
egammaAltCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", ci.sample.size = sum(!censored))
egammaAltCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", ci.sample.size = sum(!censored))
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
method |
character string specifying the method of estimation. Currently, the only
available method is maximum likelihood ( |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean. The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
n.bootstraps |
numeric scalar indicating how many bootstraps to use to construct the
confidence interval for the mean when |
pivot.statistic |
character string indicating which pivot statistic to use in the construction
of the confidence interval for the mean when |
ci.sample.size |
numeric scalar indicating what sample size to assume to construct the
confidence interval for the mean if |
If x
or censored
contain any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let denote a vector of
observations from a
gamma distribution with parameters
shape=
and
scale=
.
The relationship between these parameters and the mean
and coefficient of variation
of this distribution is given by:
Assume (
) of these
observations are known and
(
) of these observations are
all censored below (left-censored) or all censored above (right-censored) at
fixed censoring levels
For the case when , the data are said to be Type I
multiply censored. For the case when
,
set
. If the data are left-censored
and all
known observations are greater
than or equal to
, or if the data are right-censored and all
known observations are less than or equal to
, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let denote the number of observations censored below or above censoring
level
for
, so that
Let denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
right-censored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For left-censored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity does not necessarily represent
the
'th “largest” observation from the (unknown) complete sample.
Finally, let (omega) denote the set of
subscripts in the
“ordered” sample that correspond to uncensored observations.
Estimation
Maximum Likelihood Estimation (method="mle"
)
For Type I left censored data, the likelihood function is given by:
where and
denote the probability density function (pdf) and
cumulative distribution function (cdf) of the population
(Cohen, 1963; Cohen, 1991, pp.6, 50). That is,
(Johnson et al., 1994, p.343), where and
are defined in
terms of
and
by Equations (1) and (2) above.
For left singly censored data, Equation (7) simplifies to:
Similarly, for Type I right censored data, the likelihood function is given by:
and for right singly censored data this simplifies to:
The maximum likelihood estimators are computed by minimizing the
negative log-likelihood function.
Confidence Intervals
This section explains how confidence intervals for the mean are
computed.
Likelihood Profile (ci.method="profile.likelihood"
)
This method was proposed by Cox (1970, p.88), and Venzon and Moolgavkar (1988)
introduced an efficient method of computation. This method is also discussed by
Stryhn and Christensen (2003) and Royston (2007).
The idea behind this method is to invert the likelihood-ratio test to obtain a
confidence interval for the mean while treating the coefficient of variation
as a nuisance parameter. Equation (7) above
shows the form of the likelihood function
for
multiply left-censored data, where
and
are defined by
Equations (3) and (4), and Equation (10) shows the function for
multiply right-censored data.
Following Stryhn and Christensen (2003), denote the maximum likelihood estimates
of the mean and coefficient of variation by . The likelihood
ratio test statistic (
) of the hypothesis
(where
is a fixed value) equals the drop in
between the
“full” model and the reduced model with
fixed at
, i.e.,
where is the maximum likelihood estimate of
for the
reduced model (i.e., when
). Under the null hypothesis,
the test statistic
follows a
chi-squared distribution with 1 degree of freedom.
Alternatively, we may
express the test statistic in terms of the profile likelihood function
for the mean
, which is obtained from the usual likelihood function by
maximizing over the parameter
, i.e.,
Then we have
A two-sided confidence interval for the mean
consists of all values of
for which the test is not significant at
level
:
where denotes the
'th quantile of the
chi-squared distribution with
degrees of freedom.
One-sided lower and one-sided upper confidence intervals are computed in a similar
fashion, except that the quantity
in Equation (15) is replaced with
.
Normal Approximation (ci.method="normal.approx"
)
This method constructs approximate confidence intervals for
based on the assumption that the estimator of
is
approximately normally distributed. That is, a two-sided
confidence interval for
is constructed as:
where denotes the estimate of
,
denotes the estimated asymptotic standard
deviation of the estimator of
,
denotes the assumed sample
size for the confidence interval, and
denotes the
'th
quantile of Student's t-distribuiton with
degrees of freedom. One-sided confidence intervals are computed in a
similar fashion.
The argument ci.sample.size
determines the value of and by
default is equal to the number of uncensored observations.
This is simply an ad-hoc method of constructing
confidence intervals and is not based on any published theoretical results.
When pivot.statistic="z"
, the 'th quantile from the
standard normal distribution is used in place of the
'th quantile from Student's t-distribution.
The standard deviation of the mle of is
estimated based on the inverse of the Fisher Information matrix.
Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap"
)
The bootstrap is a nonparametric method of estimating the distribution
(and associated distribution parameters and quantiles) of a sample statistic,
regardless of the distribution of the population from which the sample was drawn.
The bootstrap was introduced by Efron (1979) and a general reference is
Efron and Tibshirani (1993).
In the context of deriving an approximate confidence
interval for the population mean
, the bootstrap can be broken down into the
following steps:
Create a bootstrap sample by taking a random sample of size from
the observations in
, where sampling is done with
replacement. Note that because sampling is done with replacement, the same
element of
can appear more than once in the bootstrap
sample. Thus, the bootstrap sample will usually not look exactly like the
original sample (e.g., the number of censored observations in the bootstrap
sample will often differ from the number of censored observations in the
original sample).
Estimate based on the bootstrap sample created in Step 1, using
the same method that was used to estimate
using the original
observations in
. Because the bootstrap sample usually
does not match the original sample, the estimate of
based on the
bootstrap sample will usually differ from the original estimate based on
.
Repeat Steps 1 and 2 times, where
is some large number.
For the function
egammaAltCensored
, the number of bootstraps is
determined by the argument
n.bootstraps
(see the section ARGUMENTS above).
The default value of n.bootstraps
is 1000
.
Use the estimated values of
to compute the empirical
cumulative distribution function of this estimator of
(see
ecdfPlot
), and then create a confidence interval for
based on this estimated cdf.
The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as:
where denotes the empirical cdf evaluated at
and thus
denotes the
'th empirical quantile, that is,
the
'th quantile associated with the empirical cdf. Similarly, a one-sided lower
confidence interval is computed as:
and a one-sided upper confidence interval is computed as:
The function egammaAltCensored
calls the R function quantile
to compute the empirical quantiles used in Equations (17)-(19).
The percentile method bootstrap confidence interval is only first-order
accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability
that the confidence interval will contain the true value of can be
off by
, where
is some constant. Efron and Tibshirani
(1993, pp.184-188) proposed a bias-corrected and accelerated interval that is
second-order accurate, meaning that the probability that the confidence interval
will contain the true value of
may be off by
instead of
. The two-sided bias-corrected and accelerated confidence interval is
computed as:
where
where the quantity denotes the estimate of
using
all the values in
except the
'th one, and
A one-sided lower confidence interval is given by:
and a one-sided upper confidence interval is given by:
where and
are computed as for a two-sided confidence
interval, except
is replaced with
in Equations (21) and (22).
The constant incorporates the bias correction, and the constant
is the acceleration constant. The term “acceleration” refers
to the rate of change of the standard error of the estimate of
with
respect to the true value of
(Efron and Tibshirani, 1993, p.186). For a
normal (Gaussian) distribution, the standard error of the estimate of
does not depend on the value of
, hence the acceleration constant is not
really necessary.
When ci.method="bootstrap"
, the function egammaAltCensored
computes both
the percentile method and bias-corrected and accelerated method bootstrap confidence
intervals.
a list of class "estimateCensored"
containing the estimated parameters
and other information. See estimateCensored.object
for details.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation. Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators for parameters of a normal or lognormal distribution based on censored environmental data.
In practice, it is better to use a confidence interval for the mean or a joint confidence region for the mean and standard deviation (or coefficient of variation), rather than rely on a single point-estimate of the mean. Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and coefficient of variation of a gamma distribution when data are subjected to single or multiple censoring. See, for example, Singh et al. (2006).
Steven P. Millard ([email protected])
Cohen, A.C. (1963). Progressively Censored Samples in Life Testing. Technometrics 5, 327–339
Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, 312pp.
Cox, D.R. (1970). Analysis of Binary Data. Chapman & Hall, London. 142pp.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions, Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Johnson, N.L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York, Chapter 17.
Millard, S.P., P. Dixon, and N.K. Neerchal. (2014; in preparation). Environmental Statistics with R. CRC Press, Boca Raton, Florida.
Nelson, W. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, 634pp.
Royston, P. (2007). Profile Likelihood for Estimation and Confdence Intervals. The Stata Journal 7(3), pp. 376–387.
Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Stryhn, H., and J. Christensen. (2003). Confidence Intervals by the Profile Likelihood Method, with Applications in Veterinary Epidemiology. Contributed paper at ISVEE X (November 2003, Chile). https://gilvanguedes.com/wp-content/uploads/2019/05/Profile-Likelihood-CI.pdf.
Venzon, D.J., and S.H. Moolgavkar. (1988). A Method for Computing Profile-Likelihood-Based Confidence Intervals. Journal of the Royal Statistical Society, Series C (Applied Statistics) 37(1), pp. 87–94.
egammaCensored
, GammaDist, egamma
,
estimateCensored.object
.
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and coefficient of variation # ON THE ORIGINAL SCALE using the MLE and # assuming a gamma distribution. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean and coefficient of variation # using the MLE, and compute a confidence interval # for the mean using the profile-likelihood method. #--------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, egammaAltCensored(Manganese.ppb, Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Gamma # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 19.664797 # cv = 1.252936 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 12.25151 # UCL = 34.35332 #---------- # Compare the confidence interval for the mean # based on assuming a lognormal distribution versus # assuming a gamma distribution. with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.37629 69.87694 with(EPA.09.Ex.15.1.manganese.df, egammaAltCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.25151 34.35332
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and coefficient of variation # ON THE ORIGINAL SCALE using the MLE and # assuming a gamma distribution. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean and coefficient of variation # using the MLE, and compute a confidence interval # for the mean using the profile-likelihood method. #--------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, egammaAltCensored(Manganese.ppb, Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Gamma # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 19.664797 # cv = 1.252936 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 12.25151 # UCL = 34.35332 #---------- # Compare the confidence interval for the mean # based on assuming a lognormal distribution versus # assuming a gamma distribution. with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.37629 69.87694 with(EPA.09.Ex.15.1.manganese.df, egammaAltCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.25151 34.35332
Estimate the shape and scale parameters of a gamma distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for the mean.
egammaCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", ci.sample.size = sum(!censored))
egammaCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", ci.sample.size = sum(!censored))
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
method |
character string specifying the method of estimation. Currently, the only
available method is maximum likelihood ( |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean. The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
n.bootstraps |
numeric scalar indicating how many bootstraps to use to construct the
confidence interval for the mean when |
pivot.statistic |
character string indicating which pivot statistic to use in the construction
of the confidence interval for the mean when |
ci.sample.size |
numeric scalar indicating what sample size to assume to construct the
confidence interval for the mean if |
If x
or censored
contain any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let denote a vector of
observations from a
gamma distribution with parameters
shape=
and
scale=
.
The relationship between these parameters and the mean
and coefficient of variation
of this distribution is given by:
Assume (
) of these
observations are known and
(
) of these observations are
all censored below (left-censored) or all censored above (right-censored) at
fixed censoring levels
For the case when , the data are said to be Type I
multiply censored. For the case when
,
set
. If the data are left-censored
and all
known observations are greater
than or equal to
, or if the data are right-censored and all
known observations are less than or equal to
, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let denote the number of observations censored below or above censoring
level
for
, so that
Let denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
right-censored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For left-censored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity does not necessarily represent
the
'th “largest” observation from the (unknown) complete sample.
Finally, let (omega) denote the set of
subscripts in the
“ordered” sample that correspond to uncensored observations.
Estimation
Maximum Likelihood Estimation (method="mle"
)
For Type I left censored data, the likelihood function is given by:
where and
denote the probability density function (pdf) and
cumulative distribution function (cdf) of the population
(Cohen, 1963; Cohen, 1991, pp.6, 50). That is,
(Johnson et al., 1994, p.343). For left singly censored data, Equation (7) simplifies to:
Similarly, for Type I right censored data, the likelihood function is given by:
and for right singly censored data this simplifies to:
The maximum likelihood estimators are computed by minimizing the
negative log-likelihood function.
Confidence Intervals
This section explains how confidence intervals for the mean are
computed.
Likelihood Profile (ci.method="profile.likelihood"
)
This method was proposed by Cox (1970, p.88), and Venzon and Moolgavkar (1988)
introduced an efficient method of computation. This method is also discussed by
Stryhn and Christensen (2003) and Royston (2007).
The idea behind this method is to invert the likelihood-ratio test to obtain a
confidence interval for the mean while treating the coefficient of variation
as a nuisance parameter. Equation (7) above
shows the form of the likelihood function
for
multiply left-censored data and Equation (10) shows the function for
multiply right-censored data, where
and
are defined by
Equations (3) and (4).
Following Stryhn and Christensen (2003), denote the maximum likelihood estimates
of the mean and coefficient of variation by . The likelihood
ratio test statistic (
) of the hypothesis
(where
is a fixed value) equals the drop in
between the
“full” model and the reduced model with
fixed at
, i.e.,
where is the maximum likelihood estimate of
for the
reduced model (i.e., when
). Under the null hypothesis,
the test statistic
follows a
chi-squared distribution with 1 degree of freedom.
Alternatively, we may
express the test statistic in terms of the profile likelihood function
for the mean
, which is obtained from the usual likelihood function by
maximizing over the parameter
, i.e.,
Then we have
A two-sided confidence interval for the mean
consists of all values of
for which the test is not significant at
level
:
where denotes the
'th quantile of the
chi-squared distribution with
degrees of freedom.
One-sided lower and one-sided upper confidence intervals are computed in a similar
fashion, except that the quantity
in Equation (15) is replaced with
.
Normal Approximation (ci.method="normal.approx"
)
This method constructs approximate confidence intervals for
based on the assumption that the estimator of
is
approximately normally distributed. That is, a two-sided
confidence interval for
is constructed as:
where denotes the estimate of
,
denotes the estimated asymptotic standard
deviation of the estimator of
,
denotes the assumed sample
size for the confidence interval, and
denotes the
'th
quantile of Student's t-distribuiton with
degrees of freedom. One-sided confidence intervals are computed in a
similar fashion.
The argument ci.sample.size
determines the value of and by
default is equal to the number of uncensored observations.
This is simply an ad-hoc method of constructing
confidence intervals and is not based on any published theoretical results.
When pivot.statistic="z"
, the 'th quantile from the
standard normal distribution is used in place of the
'th quantile from Student's t-distribution.
The standard deviation of the mle of is
estimated based on the inverse of the Fisher Information matrix.
Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap"
)
The bootstrap is a nonparametric method of estimating the distribution
(and associated distribution parameters and quantiles) of a sample statistic,
regardless of the distribution of the population from which the sample was drawn.
The bootstrap was introduced by Efron (1979) and a general reference is
Efron and Tibshirani (1993).
In the context of deriving an approximate confidence
interval for the population mean
, the bootstrap can be broken down into the
following steps:
Create a bootstrap sample by taking a random sample of size from
the observations in
, where sampling is done with
replacement. Note that because sampling is done with replacement, the same
element of
can appear more than once in the bootstrap
sample. Thus, the bootstrap sample will usually not look exactly like the
original sample (e.g., the number of censored observations in the bootstrap
sample will often differ from the number of censored observations in the
original sample).
Estimate based on the bootstrap sample created in Step 1, using
the same method that was used to estimate
using the original
observations in
. Because the bootstrap sample usually
does not match the original sample, the estimate of
based on the
bootstrap sample will usually differ from the original estimate based on
.
Repeat Steps 1 and 2 times, where
is some large number.
For the function
egammaCensored
, the number of bootstraps is
determined by the argument
n.bootstraps
(see the section ARGUMENTS above).
The default value of n.bootstraps
is 1000
.
Use the estimated values of
to compute the empirical
cumulative distribution function of this estimator of
(see
ecdfPlot
), and then create a confidence interval for
based on this estimated cdf.
The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as:
where denotes the empirical cdf evaluated at
and thus
denotes the
'th empirical quantile, that is,
the
'th quantile associated with the empirical cdf. Similarly, a one-sided lower
confidence interval is computed as:
and a one-sided upper confidence interval is computed as:
The function egammaCensored
calls the R function quantile
to compute the empirical quantiles used in Equations (17)-(19).
The percentile method bootstrap confidence interval is only first-order
accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability
that the confidence interval will contain the true value of can be
off by
, where
is some constant. Efron and Tibshirani
(1993, pp.184-188) proposed a bias-corrected and accelerated interval that is
second-order accurate, meaning that the probability that the confidence interval
will contain the true value of
may be off by
instead of
. The two-sided bias-corrected and accelerated confidence interval is
computed as:
where
where the quantity denotes the estimate of
using
all the values in
except the
'th one, and
A one-sided lower confidence interval is given by:
and a one-sided upper confidence interval is given by:
where and
are computed as for a two-sided confidence
interval, except
is replaced with
in Equations (21) and (22).
The constant incorporates the bias correction, and the constant
is the acceleration constant. The term “acceleration” refers
to the rate of change of the standard error of the estimate of
with
respect to the true value of
(Efron and Tibshirani, 1993, p.186). For a
normal (Gaussian) distribution, the standard error of the estimate of
does not depend on the value of
, hence the acceleration constant is not
really necessary.
When ci.method="bootstrap"
, the function egammaCensored
computes both
the percentile method and bias-corrected and accelerated method
bootstrap confidence intervals.
a list of class "estimateCensored"
containing the estimated parameters
and other information. See estimateCensored.object
for details.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation. Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators for parameters of a normal or lognormal distribution based on censored environmental data.
In practice, it is better to use a confidence interval for the mean or a joint confidence region for the mean and standard deviation (or coefficient of variation), rather than rely on a single point-estimate of the mean. Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and coefficient of variation of a gamma distribution when data are subjected to single or multiple censoring. See, for example, Singh et al. (2006).
Steven P. Millard ([email protected])
Cohen, A.C. (1963). Progressively Censored Samples in Life Testing. Technometrics 5, 327–339
Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, 312pp.
Cox, D.R. (1970). Analysis of Binary Data. Chapman & Hall, London. 142pp.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions, Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Johnson, N.L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York, Chapter 17.
Millard, S.P., P. Dixon, and N.K. Neerchal. (2014; in preparation). Environmental Statistics with R. CRC Press, Boca Raton, Florida.
Nelson, W. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, 634pp.
Royston, P. (2007). Profile Likelihood for Estimation and Confdence Intervals. The Stata Journal 7(3), pp. 376–387.
Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Stryhn, H., and J. Christensen. (2003). Confidence Intervals by the Profile Likelihood Method, with Applications in Veterinary Epidemiology. Contributed paper at ISVEE X (November 2003, Chile). https://gilvanguedes.com/wp-content/uploads/2019/05/Profile-Likelihood-CI.pdf.
Venzon, D.J., and S.H. Moolgavkar. (1988). A Method for Computing Profile-Likelihood-Based Confidence Intervals. Journal of the Royal Statistical Society, Series C (Applied Statistics) 37(1), pp. 87–94.
egammaAltCensored
, GammaDist, egamma
,
estimateCensored.object
.
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the shape and scale parameters using # the data ON THE ORIGINAL SCALE, using the MLE and # assuming a gamma distribution. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the shape and scale parameters # using the MLE, and compute a confidence interval # for the mean using the profile-likelihood method. #--------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, egammaCensored(Manganese.ppb, Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Gamma # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): shape = 0.6370043 # scale = 30.8707533 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 12.25151 # UCL = 34.35332 #---------- # Compare the confidence interval for the mean # based on assuming a lognormal distribution versus # assuming a gamma distribution. with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.37629 69.87694 with(EPA.09.Ex.15.1.manganese.df, egammaCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.25151 34.35332
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the shape and scale parameters using # the data ON THE ORIGINAL SCALE, using the MLE and # assuming a gamma distribution. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the shape and scale parameters # using the MLE, and compute a confidence interval # for the mean using the profile-likelihood method. #--------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, egammaCensored(Manganese.ppb, Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Gamma # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): shape = 0.6370043 # scale = 30.8707533 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 12.25151 # UCL = 34.35332 #---------- # Compare the confidence interval for the mean # based on assuming a lognormal distribution versus # assuming a gamma distribution. with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.37629 69.87694 with(EPA.09.Ex.15.1.manganese.df, egammaCensored(Manganese.ppb, Censored, ci = TRUE))$interval$limits # LCL UCL #12.25151 34.35332
Estimate the probability parameter of a geometric distribution.
egeom(x, method = "mle/mme")
egeom(x, method = "mle/mme")
x |
vector of non-negative integers indicating the number of trials that took place
before the first “success” occurred. (The total number of trials
that took place is |
method |
character string specifying the method of estimation. Possible values are |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
independent observations from a geometric distribution
with parameter
prob=
.
It can be shown (e.g., Forbes et al., 2011) that if is defined as:
then is an observation from a
negative binomial distribution with
parameters
prob=
and
size=
.
Estimation
The maximum likelihood and method of moments estimator (mle/mme) of
is given by:
and the minimum variance unbiased estimator (mvue) of is given by:
(Forbes et al., 2011). Note that the mvue of is not defined for
.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The geometric distribution with parameter
prob=
is a special case of the
negative binomial distribution with parameters
size=1
and prob=p
.
The negative binomial distribution has its roots in a gambling game where participants would bet on the number of tosses of a coin necessary to achieve a fixed number of heads. The negative binomial distribution has been applied in a wide variety of fields, including accident statistics, birth-and-death processes, and modeling spatial distributions of biological organisms.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 5.
Geometric, enbinom
, NegBinomial.
# Generate an observation from a geometric distribution with parameter # prob=0.2, then estimate the parameter prob. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgeom(1, prob = 0.2) dat #[1] 4 egeom(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Geometric # #Estimated Parameter(s): prob = 0.2 # #Estimation Method: mle/mme # #Data: dat # #Sample Size: 1 #---------- # Generate 3 observations from a geometric distribution with parameter # prob=0.2, then estimate the parameter prob with the mvue. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(200) dat <- rgeom(3, prob = 0.2) dat #[1] 0 1 2 egeom(dat, method = "mvue") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Geometric # #Estimated Parameter(s): prob = 0.4 # #Estimation Method: mvue # #Data: dat # #Sample Size: 3 #---------- # Clean up #--------- rm(dat)
# Generate an observation from a geometric distribution with parameter # prob=0.2, then estimate the parameter prob. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgeom(1, prob = 0.2) dat #[1] 4 egeom(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Geometric # #Estimated Parameter(s): prob = 0.2 # #Estimation Method: mle/mme # #Data: dat # #Sample Size: 1 #---------- # Generate 3 observations from a geometric distribution with parameter # prob=0.2, then estimate the parameter prob with the mvue. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(200) dat <- rgeom(3, prob = 0.2) dat #[1] 0 1 2 egeom(dat, method = "mvue") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Geometric # #Estimated Parameter(s): prob = 0.4 # #Estimation Method: mvue # #Data: dat # #Sample Size: 3 #---------- # Clean up #--------- rm(dat)
Estimate the location, scale and shape parameters of a generalized extreme value distribution, and optionally construct a confidence interval for one of the parameters.
egevd(x, method = "mle", pwme.method = "unbiased", tsoe.method = "med", plot.pos.cons = c(a = 0.35, b = 0), ci = FALSE, ci.parameter = "location", ci.type = "two-sided", ci.method = "normal.approx", information = "observed", conf.level = 0.95)
egevd(x, method = "mle", pwme.method = "unbiased", tsoe.method = "med", plot.pos.cons = c(a = 0.35, b = 0), ci = FALSE, ci.parameter = "location", ci.type = "two-sided", ci.method = "normal.approx", information = "observed", conf.level = 0.95)
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Possible values are
|
pwme.method |
character string specifying what method to use to compute the
probability-weighted moments when |
tsoe.method |
character string specifying the robust function to apply in the second stage of
the two-stage order-statistics estimator when |
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for the
plotting positions when |
ci |
logical scalar indicating whether to compute a confidence interval for the
location, scale, or shape parameter. The default value is |
ci.parameter |
character string indicating the parameter for which the confidence interval is
desired. The possible values are |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence interval
for the location or scale parameter. Currently, the only possible value is
|
information |
character string indicating which kind of Fisher information to use when
computing the variance-covariance matrix of the maximum likelihood estimators.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from a generalized extreme value distribution with
parameters
location=
,
scale=
, and
shape=
.
Estimation
Maximum Likelihood Estimation (method="mle"
)
The log likelihood function is given by:
where
(see, for example, Jenkinson, 1969; Prescott and Walden, 1980; Prescott and Walden,
1983; Hosking, 1985; MacLeod, 1989). The maximum likelihood estimators (MLE's) of
,
, and
are those values that maximize the
likelihood function, subject to the following constraints:
Although in theory the value of may lie anywhere in the interval
(see GEVD), the constraint
is
imposed because when
the likelihood can be made infinite and
thus the MLE does not exist (Castillo and Hadi, 1994). Hence, this method of
estimation is not valid when the true value of
is larger than 1.
Hosking (1985) and Hosking et al. (1985) note that in practice the value of
tends to lie in the interval
.
The value of is minimized using the R function
nlminb
.
Prescott and Walden (1983) give formulas for the gradient and Hessian. Only
the gradient is supplied in the call to nlminb
. The values of
the PWME (see below) are used as the starting values. If the starting value of
is less than 0.001 in absolute value, it is reset to
sign(k) * 0.001
, as suggested by Hosking (1985).
Probability-Weighted Moments Estimation (method="pwme"
)
The idea of probability-weighted moments was introduced by Greenwood et al. (1979).
Landwehr et al. (1979) derived probability-weighted moment estimators (PWME's) for
the parameters of the Type I (Gumbel) extreme value distribution.
Hosking et al. (1985) extended these results to the generalized extreme value
distribution. See the abstract for Hosking et al. (1985)
for details on how these estimators are computed.
Two-Stage Order Statistics Estimation (method="tsoe"
)
The two-stage order statistics estimator (TSOE) was introduced by
Castillo and Hadi (1994) as an alternative to the MLE and PWME. Unlike the
MLE and PWME, the TSOE of exists for all combinations of sample
values and possible values of
. See the
abstract for Castillo and Hadi (1994) for details
on how these estimators are computed. In the second stage,
Castillo and Hadi (1984) suggest using either the median or the least median of
squares as the robust function. The function
egevd
allows three options
for the robust function: median (tsoe.method="med"
; see the R help file for
median
), least median of squares (tsoe.method="lms"
;
see the help file for lmsreg
in the package MASS),
and least trimmed squares (tsoe.method="lts"
; see the help file for
ltsreg
in the package MASS).
Confidence Intervals
When ci=TRUE
, an approximate 100% confidence intervals
for
can be constructed assuming the distribution of the estimator of
is approximately normally distributed. A two-sided confidence
interval is constructed as:
where is the
'th quantile of Student's t-distribution with
degrees of freedom, and the quantity
denotes the estimated asymptotic standard deviation of the estimator of .
Similarly, a two-sided confidence interval for is constructed as:
and a two-sided confidence interval for is constructed as:
One-sided confidence intervals for ,
, and
are
computed in a similar fashion.
Maximum Likelihood Estimator (method="mle"
)
Prescott and Walden (1980) derive the elements of the Fisher information matrix
(the expected information). The inverse of this matrix, evaluated at the values
of the MLE, is the estimated asymptotic variance-covariance matrix of the MLE.
This method is used to estimate the standard deviations of the estimated
distribution parameters when information="expected"
. The necessary
regularity conditions hold for . Thus, this method of
constructing confidence intervals is not valid when the true value of
is greater than or equal to 1/2.
Prescott and Walden (1983) derive expressions for the observed information matrix
(i.e., the Hessian). This matrix is used to compute the estimated asymptotic
variance-covariance matrix of the MLE when information="observed"
.
In computer simulations, Prescott and Walden (1983) found that the
variance-covariance matrix based on the observed information gave slightly more
accurate estimates of the variance of MLE of compared to the
estimated variance based on the expected information.
Probability-Weighted Moments Estimator (method="pwme"
)
Hosking et al. (1985) show that these estimators are asymptotically multivariate
normal and derive the asymptotic variance-covariance matrix. See the
abstract for Hosking et al. (1985) for details on how
this matrix is computed.
Two-Stage Order Statistics Estimator (method="tsoe"
)
Currently there is no built-in method in EnvStats for computing confidence
intervals when method="tsoe"
. Castillo and Hadi (1994) suggest
using the bootstrap or jackknife method.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
Two-parameter extreme value distributions (EVD) have been applied extensively since the 1930's to several fields of study, including the distributions of hydrological and meteorological variables, human lifetimes, and strength of materials. The three-parameter generalized extreme value distribution (GEVD) was introduced by Jenkinson (1955) to model annual maximum and minimum values of meteorological events. Since then, it has been used extensively in the hydological and meteorological fields.
The three families of EVDs are all special kinds of GEVDs. When the shape
parameter , the GEVD reduces to the Type I extreme value (Gumbel)
distribution. (The function
zTestGevdShape
allows you to test
the null hypothesis .) When
, the GEVD is
the same as the Type II extreme value distribution, and when
it is the same as the Type III extreme value distribution.
Hosking et al. (1985) compare the asymptotic and small-sample statistical
properties of the PWME with the MLE and Jenkinson's (1969) method of sextiles.
Castillo and Hadi (1994) compare the small-sample statistical properties of the
MLE, PWME, and TSOE. Hosking and Wallis (1995) compare the small-sample properties
of unbaised -moment estimators vs. plotting-position
-moment
estimators. (PWMEs can be written as linear combinations of
-moments and
thus have equivalent statistical properties.) Hosking and Wallis (1995) conclude
that unbiased estimators should be used for almost all applications.
Steven P. Millard ([email protected])
Castillo, E., and A. Hadi. (1994). Parameter and Quantile Estimation for the Generalized Extreme-Value Distribution. Environmetrics 5, 417–432.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Greenwood, J.A., J.M. Landwehr, N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments: Definition and Relation to Parameters of Several Distributions Expressible in Inverse Form. Water Resources Research 15(5), 1049–1054.
Hosking, J.R.M. (1984). Testing Whether the Shape Parameter is Zero in the Generalized Extreme-Value Distribution. Biometrika 71(2), 367–374.
Hosking, J.R.M. (1985). Algorithm AS 215: Maximum-Likelihood Estimation of the Parameters of the Generalized Extreme-Value Distribution. Applied Statistics 34(3), 301–310.
Hosking, J.R.M., J.R. Wallis, and E.F. Wood. (1985). Estimation of the Generalized Extreme-Value Distribution by the Method of Probability-Weighted Moments. Technometrics 27(3), 251–261.
Jenkinson, A.F. (1969). Statistics of Extremes. Technical Note 98, World Meteorological Office, Geneva.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Landwehr, J.M., N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments Compared With Some Traditional Techniques in Estimating Gumbel Parameters and Quantiles. Water Resources Research 15(5), 1055–1064.
Macleod, A.J. (1989). Remark AS R76: A Remark on Algorithm AS 215: Maximum Likelihood Estimation of the Parameters of the Generalized Extreme-Value Distribution. Applied Statistics 38(1), 198–199.
Prescott, P., and A.T. Walden. (1980). Maximum Likelihood Estimation of the Parameters of the Generalized Extreme-Value Distribution. Biometrika 67(3), 723–724.
Prescott, P., and A.T. Walden. (1983). Maximum Likelihood Estimation of the Three-Parameter Generalized Extreme-Value Distribution from Censored Samples. Journal of Statistical Computing and Simulation 16, 241–250.
Generalized Extreme Value Distribution,
zTestGevdShape
, Extreme Value Distribution,
eevd
.
# Generate 20 observations from a generalized extreme value distribution # with parameters location=2, scale=1, and shape=0.2, then compute the # MLE and construct a 90% confidence interval for the location parameter. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(498) dat <- rgevd(20, location = 2, scale = 1, shape = 0.2) egevd(dat, ci = TRUE, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Generalized Extreme Value # #Estimated Parameter(s): location = 1.6144631 # scale = 0.9867007 # shape = 0.2632493 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: location # #Confidence Interval Method: Normal Approximation # (t Distribution) based on # observed information # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = 1.225249 # UCL = 2.003677 #---------- # Compare the values of the different types of estimators: egevd(dat, method = "mle")$parameters # location scale shape #1.6144631 0.9867007 0.2632493 egevd(dat, method = "pwme")$parameters # location scale shape #1.5785779 1.0187880 0.2257948 egevd(dat, method = "pwme", pwme.method = "plotting.position")$parameters # location scale shape #1.5509183 0.9804992 0.1657040 egevd(dat, method = "tsoe")$parameters # location scale shape #1.5372694 1.0876041 0.2927272 egevd(dat, method = "tsoe", tsoe.method = "lms")$parameters #location scale shape #1.519469 1.081149 0.284863 egevd(dat, method = "tsoe", tsoe.method = "lts")$parameters # location scale shape #1.4840198 1.0679549 0.2691914 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from a generalized extreme value distribution # with parameters location=2, scale=1, and shape=0.2, then compute the # MLE and construct a 90% confidence interval for the location parameter. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(498) dat <- rgevd(20, location = 2, scale = 1, shape = 0.2) egevd(dat, ci = TRUE, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Generalized Extreme Value # #Estimated Parameter(s): location = 1.6144631 # scale = 0.9867007 # shape = 0.2632493 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: location # #Confidence Interval Method: Normal Approximation # (t Distribution) based on # observed information # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = 1.225249 # UCL = 2.003677 #---------- # Compare the values of the different types of estimators: egevd(dat, method = "mle")$parameters # location scale shape #1.6144631 0.9867007 0.2632493 egevd(dat, method = "pwme")$parameters # location scale shape #1.5785779 1.0187880 0.2257948 egevd(dat, method = "pwme", pwme.method = "plotting.position")$parameters # location scale shape #1.5509183 0.9804992 0.1657040 egevd(dat, method = "tsoe")$parameters # location scale shape #1.5372694 1.0876041 0.2927272 egevd(dat, method = "tsoe", tsoe.method = "lms")$parameters #location scale shape #1.519469 1.081149 0.284863 egevd(dat, method = "tsoe", tsoe.method = "lts")$parameters # location scale shape #1.4840198 1.0679549 0.2691914 #---------- # Clean up #--------- rm(dat)
Estimate , the number of white balls in the urn, or
, the total number of balls in the urn, for a
hypergeometric distribution.
ehyper(x, m = NULL, total = NULL, k, method = "mle")
ehyper(x, m = NULL, total = NULL, k, method = "mle")
x |
non-negative integer indicating the number of white balls out of a sample of
size |
m |
non-negative integer indicating the number of white balls in the urn.
You must supply |
total |
positive integer indicating the total number of balls in the urn (i.e.,
|
k |
positive integer indicating the number of balls drawn without replacement from the
urn. Missing values ( |
method |
character string specifying the method of estimation. Possible values are
|
Missing (NA
), undefined (NaN
), and infinite (Inf
, -Inf
)
values are not allowed.
Let be an observation from a
hypergeometric distribution with
parameters
m=
,
n=
, and
k=
.
In R nomenclature,
represents the number of white balls drawn out of a
sample of
balls drawn without replacement from an urn containing
white balls and
black balls. The total number of balls in the
urn is thus
. Denote the total number of balls by
.
Estimation
Estimating M, Given T and K are known
When and
are known, the maximum likelihood estimator (mle) of
is given by (Forbes et al., 2011):
where represents the
floor
function.
That is, is the largest integer less than or equal to
.
If the quantity is an integer, then the mle of
is also given by (Johnson et al., 1992, p.263):
which is what the function ehyper
uses for this case.
The minimum variance unbiased estimator (mvue) of is given by
(Forbes et al., 2011):
Estimating T, given M and K are known
When and
are known, the maximum likelihood estimator (mle) of
is given by (Forbes et al., 2011):
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The hypergeometric distribution can be described by
an urn model with white balls and
black balls. If
balls
are drawn with replacement, then the number of white balls in the sample
of size
follows a binomial distribution with
parameters
size=
and
prob=
. If
balls are
drawn without replacement, then the number of white balls in the sample of
size
follows a hypergeometric distribution
with parameters
m=
,
n=
, and
k=
.
The name “hypergeometric” comes from the fact that the probabilities associated with this distribution can be written as successive terms in the expansion of a function of a Gaussian hypergeometric series.
The hypergeometric distribution is applied in a variety of fields, including quality control and estimation of animal population size. It is also the distribution used to compute probabilities for Fishers's exact test for a 2x2 contingency table.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 6.
# Generate an observation from a hypergeometric distribution with # parameters m=10, n=30, and k=5, then estimate the parameter m. # Note: the call to set.seed simply allows you to reproduce this example. # Also, the only parameter actually estimated is m; once m is estimated, # n is computed by subtracting the estimated value of m (8 in this example) # from the given of value of m+n (40 in this example). The parameters # n and k are shown in the output in order to provide information on # all of the parameters associated with the hypergeometric distribution. set.seed(250) dat <- rhyper(nn = 1, m = 10, n = 30, k = 5) dat #[1] 1 ehyper(dat, total = 40, k = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Hypergeometric # #Estimated Parameter(s): m = 8 # n = 32 # k = 5 # #Estimation Method: mle for 'm' # #Data: dat # #Sample Size: 1 #---------- # Use the same data as in the previous example, but estimate m+n instead. # Note: The only parameter estimated is m+n. Once this is estimated, # n is computed by subtracting the given value of m (10 in this case) # from the estimated value of m+n (50 in this example). ehyper(dat, m = 10, k = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Hypergeometric # #Estimated Parameter(s): m = 10 # n = 40 # k = 5 # #Estimation Method: mle for 'm+n' # #Data: dat # #Sample Size: 1 #---------- # Clean up #--------- rm(dat)
# Generate an observation from a hypergeometric distribution with # parameters m=10, n=30, and k=5, then estimate the parameter m. # Note: the call to set.seed simply allows you to reproduce this example. # Also, the only parameter actually estimated is m; once m is estimated, # n is computed by subtracting the estimated value of m (8 in this example) # from the given of value of m+n (40 in this example). The parameters # n and k are shown in the output in order to provide information on # all of the parameters associated with the hypergeometric distribution. set.seed(250) dat <- rhyper(nn = 1, m = 10, n = 30, k = 5) dat #[1] 1 ehyper(dat, total = 40, k = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Hypergeometric # #Estimated Parameter(s): m = 8 # n = 32 # k = 5 # #Estimation Method: mle for 'm' # #Data: dat # #Sample Size: 1 #---------- # Use the same data as in the previous example, but estimate m+n instead. # Note: The only parameter estimated is m+n. Once this is estimated, # n is computed by subtracting the given value of m (10 in this case) # from the estimated value of m+n (50 in this example). ehyper(dat, m = 10, k = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Hypergeometric # #Estimated Parameter(s): m = 10 # n = 40 # k = 5 # #Estimation Method: mle for 'm+n' # #Data: dat # #Sample Size: 1 #---------- # Clean up #--------- rm(dat)
Estimate the mean and standard deviation parameters of the logarithm of a lognormal distribution, and optionally construct a confidence interval for the mean.
elnorm(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "exact", conf.level = 0.95)
elnorm(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "exact", conf.level = 0.95)
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Possible values are
|
ci |
logical scalar indicating whether to compute a confidence interval for the
mean. The default value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean or variance. The only possible value is |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let denote a random variable with a
lognormal distribution with
parameters
meanlog=
and
sdlog=
. Then
has a normal (Gaussian) distribution with
parameters
mean=
and
sd=
. Thus, the function
elnorm
simply calls the function enorm
using the
log-transformed values of x
.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The normal and lognormal distribution are probably the two most frequently used distributions to model environmental data. In order to make any kind of probability statement about a normally-distributed population (of chemical concentrations for example), you have to first estimate the mean and standard deviation (the population parameters) of the distribution. Once you estimate these parameters, it is often useful to characterize the uncertainty in the estimate of the mean or variance. This is done with confidence intervals.
Steven P. Millard ([email protected])
Aitchison, J., and J.A.C. Brown (1957). The Lognormal Distribution (with special references to its uses in economics). Cambridge University Press, London, Chapter 5.
Crow, E.L., and K. Shimizu. (1988). Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, Chapter 2.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Limpert, E., W.A. Stahel, and M. Abbt. (2001). Log-Normal Distributions Across the Sciences: Keys and Clues. BioScience 51, 341–352.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Lognormal, LognormalAlt, Normal.
# Using the Reference area TcCB data in the data frame EPA.94b.tccb.df, # estimate the mean and standard deviation of the log-transformed distribution, # and construct a 95% confidence interval for the mean. with(EPA.94b.tccb.df, elnorm(TcCB[Area == "Reference"], ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.6195712 # sdlog = 0.4679530 # #Estimation Method: mvue # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.7569673 # UCL = -0.4821751
# Using the Reference area TcCB data in the data frame EPA.94b.tccb.df, # estimate the mean and standard deviation of the log-transformed distribution, # and construct a 95% confidence interval for the mean. with(EPA.94b.tccb.df, elnorm(TcCB[Area == "Reference"], ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.6195712 # sdlog = 0.4679530 # #Estimation Method: mvue # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.7569673 # UCL = -0.4821751
Estimate the mean, standard deviation, and threshold parameters for a three-parameter lognormal distribution, and optionally construct a confidence interval for the threshold or the median of the distribution.
elnorm3(x, method = "lmle", ci = FALSE, ci.parameter = "threshold", ci.method = "avar", ci.type = "two-sided", conf.level = 0.95, threshold.lb.sd = 100, evNormOrdStats.method = "royston")
elnorm3(x, method = "lmle", ci = FALSE, ci.parameter = "threshold", ci.method = "avar", ci.type = "two-sided", conf.level = 0.95, threshold.lb.sd = 100, evNormOrdStats.method = "royston")
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Possible values are:
See the DETAILS section for more information. |
ci |
logical scalar indicating whether to compute a confidence interval for either
the threshold or median of the distribution. The default value is |
ci.parameter |
character string indicating the parameter for which the confidence interval is
desired. The possible values are |
ci.method |
character string indicating the method to use to construct the confidence interval.
The possible values are |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
threshold.lb.sd |
a positive numeric scalar specifying the range over which to look for the
local maximum likelihood ( |
evNormOrdStats.method |
character string indicating which method to use in the call to
|
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let denote a random variable from a
three-parameter lognormal distribution with
parameters
meanlog=
,
sdlog=
, and
threshold=
. Let
denote a vector of
observations from this distribution. Furthermore, let
denote
the
'th order statistic in the sample, so that
denotes the
smallest value and
denote the largest value in
.
Finally, denote the sample mean and variance by:
Note that the sample variance is the unbiased version. Denote the method of moments estimator of variance by:
Estimation
Local Maximum Likelihood Estimation (method="lmle"
)
Hill (1963) showed that the likelihood function approaches infinity as
approaches
, so that the global maximum likelihood estimators of
are
, which are
inadmissible, since
must be smaller than
. Cohen (1951)
suggested using local maximum likelihood estimators (lmle's), derived by equating
partial derivatives of the log-likelihood function to zero. These estimators were
studied by Harter and Moore (1966), Calitz (1973), Cohen and Whitten (1980), and
Griffiths (1980), and appear to possess most of the desirable properties ordinarily
associated with maximum likelihood estimators.
Cohen (1951) showed that the lmle of is given by the solution to the
following equation:
where
and that the lmle's of and
then follow as:
Unfortunately, while equation (4) simplifies the task of computing the lmle's,
for certain data sets there still may be convergence problems (Calitz, 1973), and
occasionally multiple roots of equation (4) may exist. When multiple roots to
equation (4) exisit, Cohen and Whitten (1980) recommend using the one that results
in closest agreement between the mle of (equation (7)) and the sample
mean (equation (1)).
On the other hand, Griffiths (1980) showed that for a given value of the threshold
parameter , the maximized value of the log-likelihood (the
“profile likelihood” for
) is given by:
where the estimates of and
are defined in equations (7)
and (8), so the lmle of
reduces to an iterative search over the values
of
. Griffiths (1980) noted that the distribution of the lmle of
is far from normal and that
is not quadratic
near the lmle of
. He suggested a better parameterization based on
Thus, once the lmle of is found using equations (9) and (10), the lmle of
is given by:
When method="lmle"
, the function elnorm3
uses the function
nlminb
to search for the minimum of , using the
modified method of moments estimator (
method="mmme"
; see below) as the
starting value for . Equation (11) is then used to solve for the
lmle of
, and equation (4) is used to “fine tune” the estimated
value of
. The lmle's of
and
are then computed
using equations (6)-(8).
Method of Moments Estimation (method="mme"
)
Denote the 'th sample central moment by:
and note that
Equating the sample first moment (the sample mean) with its population value (the population mean), and equating the second and third sample central moments with their population values yields (Johnson et al., 1994, p.228):
where
Combining equations (15) and (16) yields:
The quantity on the left-hand side of equation (19) is the usual estimator of
skewness. Solving equation (19) for yields:
where
Using equation (18), the method of moments estimator of is then
computed as:
Combining equations (15) and (17), the method of moments estimator of
is computed as:
Finally, using equations (14), (17), and (18), the method of moments estimator of
is computed as:
There are two major problems with using method of moments estimators for the
three-parameter lognormal distribution. First, they are subject to very large
sampling error due to the use of second and third sample moments
(Cohen, 1988, p.121; Johnson et al., 1994, p.228). Second, Heyde (1963) showed
that the lognormal distribution is not uniquely determined by its moments.
Method of Moments Estimators Using an Unbiased Estimate of Variance (method="mmue"
)
This method of estimation is exactly the same as the method of moments
(method="mme"
), except that the unbiased estimator of variance (equation (3))
is used in place of the method of moments one (equation (4)). This modification is
given in Cohen (1988, pp.119-120).
Modified Method of Moments Estimation (method="mmme"
)
This method of estimation is described by Cohen (1988, pp.125-132). It was
introduced by Cohen and Whitten (1980; their MME-II with r=1) and was further
investigated by Cohen et al. (1985). It is motivated by the fact that the first
order statistic in the sample, , contains more information about
the threshold parameter
than any other observation and often more
information than all of the other observations combined (Cohen, 1988, p.125).
The first two sets of equations are the same as for the modified method of moments
estimators (method="mmme"
), i.e., equations (14) and (15) with the
unbiased estimator of variance (equation (3)) used in place of the method of
moments one (equation (4)). The third equation replaces equation (16)
by equating a function of the first order statistic with its expected value:
where denotes the expected value of the
'th order
statistic in a random sample of
observations from a standard normal
distribution. (See the help file for
evNormOrdStats
for information
on how is computed.) Using equations (17) and (18),
equation (26) can be rewritten as:
Combining equations (14), (15), (17), (18), and (27) yields the following equation
for the estimate of :
After equation (28) is solved for , the estimate of
is again computed using equation (23), and the estimate of
is computed
using equation (24), where the unbiased estimate of variaince is used in place of
the biased one (just as for
method="mmue"
).
Zero-Skewness Estimation (method="zero.skew"
)
This method of estimation was introduced by Griffiths (1980), and elaborated upon
by Royston (1992b). The idea is that if the threshold parameter were
known, then the distribution of:
is normal, so the skew of is 0. Thus, the threshold parameter
is estimated as that value that forces the sample skew (defined in equation (19)) of
the observations defined in equation (6) to be 0. That is, the zero-skewness
estimator of
is the value that satisfies the following equation:
where
Note that since the denominator in equation (30) is always positive (assuming
there are at least two unique values in ), only the numerator
needs to be used to determine the value of
.
Once the value of has been determined,
and
are estimated using equations (7) and (8), except the unbiased estimator of variance
is used in equation (8).
Royston (1992b) developed a modification of the Shaprio-Wilk goodness-of-fit test
for normality based on tranforming the data using equation (6) and the zero-skewness
estimator of (see
gofTest
).
Estimators Based on Royston's Index of Skewness (method="royston.skew"
)
This method of estimation is discussed by Royston (1992b), and is similar to the
zero-skewness method discussed above, except a different measure of skewness is used.
Royston's (1992b) index of skewness is given by:
where denotes the
'th order statistic of
and
is defined in equation (31) above, and
denotes the median of
.
Royston (1992b) shows that the value of
that yields a value of
is given by:
Again, as for the zero-skewness method, once the value of has
been determined,
and
are estimated using equations (7) and (8),
except the unbiased estimator of variance is used in equation (8).
Royston (1992b) developed this estimator as a quick way to estimate .
Confidence Intervals
This section explains three different methods for constructing confidence intervals
for the threshold parameter , or the median of the three-parameter
lognormal distribution, which is given by:
Normal Approximation Based on Asymptotic Variances and Covariances (ci.method="avar"
)
Formulas for asymptotic variances and covariances for the three-parameter lognormal
distribution, based on the information matrix, are given in Cohen (1951), Cohen and
Whitten (1980), Cohen et al., (1985), and Cohen (1988). The relevant quantities for
and the median are:
where
A two-sided confidence interval for
is computed as:
where denotes the
'th quantile of
Student's t-distribution with
degrees of freedom, and the
quantity
is computed using equations (35) and (38)
and substituting estimated values of
,
, and
.
One-sided confidence intervals are computed in a similar manner.
A two-sided confidence interval for the median (see equation
(34) above) is computed as:
where
is computed using equations (35)-(38) and substituting estimated values of
,
, and
. One-sided confidence intervals are
computed in a similar manner.
This method of constructing confidence intervals is analogous to using the Wald test (e.g., Silvey, 1975, pp.115-118) to test hypotheses on the parameters.
Because of the regularity problems associated with the global maximum likelihood estimators, it is questionble whether the asymptotic variances and covariances shown above apply to local maximum likelihood estimators. Simulation studies, however, have shown that these estimates of variance and covariance perform reasonably well (Harter and Moore, 1966; Cohen and Whitten, 1980).
Note that this method of constructing confidence intervals can be used with
estimators other than the lmle's. Cohen and Whitten (1980) and Cohen et al. (1985)
found that the asymptotic variances and covariances are reasonably close to
corresponding simulated variances and covariances for the modified method of moments
estimators (method="mmme"
).
Likelihood Profile (ci.method="likelihood.profile"
)
Griffiths (1980) suggested constructing confidence intervals for the threshold
parameter based on the profile likelihood function given in equations
(9) and (10). Royston (1992b) further elaborated upon this procedure. A
two-sided
confidence interval for
is constructed as:
by finding the two values of (one larger than the lmle of
and
one smaller than the lmle of
) that satisfy:
where denotes the
'th quantile of the
chi-square distribution with
degrees of freedom.
Once these values are found, the two-sided confidence for
is computed as:
where
One-sided intervals are construced in a similar manner.
This method of constructing confidence intervals is analogous to using the likelihood-ratio test (e.g., Silvey, 1975, pp.108-115) to test hypotheses on the parameters.
To construct a two-sided confidence interval for the median
(see equation (34)), Royston (1992b) suggested the following procedure:
Construct a confidence interval for using the likelihood
profile procedure.
Construct a confidence interval for as:
Construct the confidence interval for the median as:
Royston (1992b) actually suggested using the quantile from the standard normal
distribution instead of Student's t-distribution in step 2 above. The function
elnorm3
, however, uses the Student's t quantile.
Note that this method of constructing confidence intervals can be used with
estimators other than the lmle's.
Royston's Confidence Interval Based on Significant Skewness (ci.method="skewness"
)
Royston (1992b) suggested constructing confidence intervals for the threshold
parameter based on the idea behind the zero-skewness estimator
(
method="zero.skew"
). A two-sided confidence interval
for
is constructed by finding the two values of
that yield
a p-value of
for the test of zero-skewness on the observations
defined in equation (6) (see
gofTest
). One-sided
confidence intervals are constructed in a similar manner.
To construct confidence intervals for the median
(see equation (34)), the exact same procedure is used as for
ci.method="likelihood.profile"
, except that the confidence interval for
is based on the zero-skewness method just described instead of the
likelihood profile method.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The problem of estimating the parameters of a three-parameter lognormal distribution has been extensively discussed by Aitchison and Brown (1957, Chapter 6), Calitz (1973), Cohen (1951), Cohen (1988), Cohen and Whitten (1980), Cohen et al. (1985), Griffiths (1980), Harter and Moore (1966), Hill (1963), and Royston (1992b). Stedinger (1980) and Hoshi et al. (1984) discuss fitting the three-parameter lognormal distribution to hydrologic data.
The global maximum likelihood estimates are inadmissible. In the past, several
researchers have found that the local maximum likelihood estimates (lmle's)
occasionally fail because of convergence problems, but they were not using the
likelihood profile and reparameterization of Griffiths (1980). Cohen (1988)
recommends the modified methods of moments estimators over lmle's because they are
easy to compute, they are unbiased with respect to and
(the
mean and standard deviation on the log-scale), their variances are minimal or near
minimal, and they do not suffer from regularity problems.
Because the distribution of the lmle of the threshold parameter is far
from normal for moderate sample sizes (Griffiths, 1980), it is questionable whether
confidence intervals for
or the median based on asymptotic variances
and covariances will perform well. Cohen and Whitten (1980) and Cohen et al. (1985),
however, found that the asymptotic variances and covariances are reasonably close to
corresponding simulated variances and covariances for the modified method of moments
estimators (
method="mmme"
). In a simulation study (5000 monte carlo trials),
Royston (1992b) found that the coverage of confidence intervals for
based on the likelihood profile (
ci.method="likelihood.profile"
) was very
close the nominal level (94.1% for a nominal level of 95%), although not
symmetric. Royston (1992b) also found that the coverage of confidence intervals
for based on the skewness method (
ci.method="skewness"
) was also
very close (95.4%) and symmetric.
Steven P. Millard ([email protected])
Aitchison, J., and J.A.C. Brown (1957). The Lognormal Distribution (with special references to its uses in economics). Cambridge University Press, London, Chapter 5.
Calitz, F. (1973). Maximum Likelihood Estimation of the Parameters of the Three-Parameter Lognormal Distribution–a Reconsideration. Australian Journal of Statistics 15(3), 185–190.
Cohen, A.C. (1951). Estimating Parameters of Logarithmic-Normal Distributions by Maximum Likelihood. Journal of the American Statistical Association 46, 206–212.
Cohen, A.C. (1988). Three-Parameter Estimation. In Crow, E.L., and K. Shimizu, eds. Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, Chapter 4.
Cohen, A.C., and B.J. Whitten. (1980). Estimation in the Three-Parameter Lognormal Distribution. Journal of the American Statistical Association 75, 399–404.
Cohen, A.C., B.J. Whitten, and Y. Ding. (1985). Modified Moment Estimation for the Three-Parameter Lognormal Distribution. Journal of Quality Technology 17, 92–99.
Crow, E.L., and K. Shimizu. (1988). Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, Chapter 2.
Griffiths, D.A. (1980). Interval Estimation for the Three-Parameter Lognormal Distribution via the Likelihood Function. Applied Statistics 29, 58–68.
Harter, H.L., and A.H. Moore. (1966). Local-Maximum-Likelihood Estimation of the Parameters of Three-Parameter Lognormal Populations from Complete and Censored Samples. Journal of the American Statistical Association 61, 842–851.
Heyde, C.C. (1963). On a Property of the Lognormal Distribution. Journal of the Royal Statistical Society, Series B 25, 392–393.
Hill, .B.M. (1963). The Three-Parameter Lognormal Distribution and Bayesian Analysis of a Point-Source Epidemic. Journal of the American Statistical Association 58, 72–84.
Hoshi, K., J.R. Stedinger, and J. Burges. (1984). Estimation of Log-Normal Quantiles: Monte Carlo Results and First-Order Approximations. Journal of Hydrology 71, 1–30.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897–912.
Stedinger, J.R. (1980). Fitting Lognormal Distributions to Hydrologic Data. Water Resources Research 16(3), 481–490.
Lognormal3, Lognormal, LognormalAlt, Normal.
# Generate 20 observations from a 3-parameter lognormal distribution # with parameters meanlog=1.5, sdlog=1, and threshold=10, then use # Cohen and Whitten's (1980) modified moments estimators to estimate # the parameters, and construct a confidence interval for the # threshold based on the estimated asymptotic variance. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnorm3(20, meanlog = 1.5, sdlog = 1, threshold = 10) elnorm3(dat, method = "mmme", ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: 3-Parameter Lognormal # #Estimated Parameter(s): meanlog = 1.5206664 # sdlog = 0.5330974 # threshold = 9.6620403 # #Estimation Method: mmme # #Data: dat # #Sample Size: 20 # #Confidence Interval for: threshold # #Confidence Interval Method: Normal Approximation # Based on Asymptotic Variance # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 6.985258 # UCL = 12.338823 #---------- # Repeat the above example using the other methods of estimation # and compare. round(elnorm3(dat, "lmle")$parameters, 1) #meanlog sdlog threshold # 1.3 0.7 10.5 round(elnorm3(dat, "mme")$parameters, 1) #meanlog sdlog threshold # 2.1 0.3 6.0 round(elnorm3(dat, "mmue")$parameters, 1) #meanlog sdlog threshold # 2.2 0.3 5.8 round(elnorm3(dat, "mmme")$parameters, 1) #meanlog sdlog threshold # 1.5 0.5 9.7 round(elnorm3(dat, "zero.skew")$parameters, 1) #meanlog sdlog threshold # 1.3 0.6 10.3 round(elnorm3(dat, "royston")$parameters, 1) #meanlog sdlog threshold # 1.4 0.6 10.1 #---------- # Compare methods for computing a two-sided 95% confidence interval # for the threshold: # modified method of moments estimator using asymptotic variance, # lmle using asymptotic variance, # lmle using likelihood profile, and # zero-skewness estimator using the skewness method. elnorm3(dat, method = "mmme", ci = TRUE, ci.method = "avar")$interval$limits # LCL UCL # 6.985258 12.338823 elnorm3(dat, method = "lmle", ci = TRUE, ci.method = "avar")$interval$limits # LCL UCL # 9.017223 11.980107 elnorm3(dat, method = "lmle", ci = TRUE, ci.method="likelihood.profile")$interval$limits # LCL UCL # 3.699989 11.266029 elnorm3(dat, method = "zero.skew", ci = TRUE, ci.method = "skewness")$interval$limits # LCL UCL #-25.18851 11.18652 #---------- # Now construct a confidence interval for the median of the distribution # based on using the modified method of moments estimator for threshold # and the asymptotic variances and covariances. Note that the true median # is given by threshold + exp(meanlog) = 10 + exp(1.5) = 14.48169. elnorm3(dat, method = "mmme", ci = TRUE, ci.parameter = "median") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: 3-Parameter Lognormal # #Estimated Parameter(s): meanlog = 1.5206664 # sdlog = 0.5330974 # threshold = 9.6620403 # #Estimation Method: mmme # #Data: dat # #Sample Size: 20 # #Confidence Interval for: median # #Confidence Interval Method: Normal Approximation # Based on Asymptotic Variance # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 11.20541 # UCL = 17.26922 #---------- # Compare methods for computing a two-sided 95% confidence interval # for the median: # modified method of moments estimator using asymptotic variance, # lmle using asymptotic variance, # lmle using likelihood profile, and # zero-skewness estimator using the skewness method. elnorm3(dat, method = "mmme", ci = TRUE, ci.parameter = "median", ci.method = "avar")$interval$limits # LCL UCL #11.20541 17.26922 elnorm3(dat, method = "lmle", ci = TRUE, ci.parameter = "median", ci.method = "avar")$interval$limits # LCL UCL #12.28326 15.87233 elnorm3(dat, method = "lmle", ci = TRUE, ci.parameter = "median", ci.method = "likelihood.profile")$interval$limits # LCL UCL # 6.314583 16.165525 elnorm3(dat, method = "zero.skew", ci = TRUE, ci.parameter = "median", ci.method = "skewness")$interval$limits # LCL UCL #-22.38322 16.33569 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from a 3-parameter lognormal distribution # with parameters meanlog=1.5, sdlog=1, and threshold=10, then use # Cohen and Whitten's (1980) modified moments estimators to estimate # the parameters, and construct a confidence interval for the # threshold based on the estimated asymptotic variance. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnorm3(20, meanlog = 1.5, sdlog = 1, threshold = 10) elnorm3(dat, method = "mmme", ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: 3-Parameter Lognormal # #Estimated Parameter(s): meanlog = 1.5206664 # sdlog = 0.5330974 # threshold = 9.6620403 # #Estimation Method: mmme # #Data: dat # #Sample Size: 20 # #Confidence Interval for: threshold # #Confidence Interval Method: Normal Approximation # Based on Asymptotic Variance # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 6.985258 # UCL = 12.338823 #---------- # Repeat the above example using the other methods of estimation # and compare. round(elnorm3(dat, "lmle")$parameters, 1) #meanlog sdlog threshold # 1.3 0.7 10.5 round(elnorm3(dat, "mme")$parameters, 1) #meanlog sdlog threshold # 2.1 0.3 6.0 round(elnorm3(dat, "mmue")$parameters, 1) #meanlog sdlog threshold # 2.2 0.3 5.8 round(elnorm3(dat, "mmme")$parameters, 1) #meanlog sdlog threshold # 1.5 0.5 9.7 round(elnorm3(dat, "zero.skew")$parameters, 1) #meanlog sdlog threshold # 1.3 0.6 10.3 round(elnorm3(dat, "royston")$parameters, 1) #meanlog sdlog threshold # 1.4 0.6 10.1 #---------- # Compare methods for computing a two-sided 95% confidence interval # for the threshold: # modified method of moments estimator using asymptotic variance, # lmle using asymptotic variance, # lmle using likelihood profile, and # zero-skewness estimator using the skewness method. elnorm3(dat, method = "mmme", ci = TRUE, ci.method = "avar")$interval$limits # LCL UCL # 6.985258 12.338823 elnorm3(dat, method = "lmle", ci = TRUE, ci.method = "avar")$interval$limits # LCL UCL # 9.017223 11.980107 elnorm3(dat, method = "lmle", ci = TRUE, ci.method="likelihood.profile")$interval$limits # LCL UCL # 3.699989 11.266029 elnorm3(dat, method = "zero.skew", ci = TRUE, ci.method = "skewness")$interval$limits # LCL UCL #-25.18851 11.18652 #---------- # Now construct a confidence interval for the median of the distribution # based on using the modified method of moments estimator for threshold # and the asymptotic variances and covariances. Note that the true median # is given by threshold + exp(meanlog) = 10 + exp(1.5) = 14.48169. elnorm3(dat, method = "mmme", ci = TRUE, ci.parameter = "median") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: 3-Parameter Lognormal # #Estimated Parameter(s): meanlog = 1.5206664 # sdlog = 0.5330974 # threshold = 9.6620403 # #Estimation Method: mmme # #Data: dat # #Sample Size: 20 # #Confidence Interval for: median # #Confidence Interval Method: Normal Approximation # Based on Asymptotic Variance # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 11.20541 # UCL = 17.26922 #---------- # Compare methods for computing a two-sided 95% confidence interval # for the median: # modified method of moments estimator using asymptotic variance, # lmle using asymptotic variance, # lmle using likelihood profile, and # zero-skewness estimator using the skewness method. elnorm3(dat, method = "mmme", ci = TRUE, ci.parameter = "median", ci.method = "avar")$interval$limits # LCL UCL #11.20541 17.26922 elnorm3(dat, method = "lmle", ci = TRUE, ci.parameter = "median", ci.method = "avar")$interval$limits # LCL UCL #12.28326 15.87233 elnorm3(dat, method = "lmle", ci = TRUE, ci.parameter = "median", ci.method = "likelihood.profile")$interval$limits # LCL UCL # 6.314583 16.165525 elnorm3(dat, method = "zero.skew", ci = TRUE, ci.parameter = "median", ci.method = "skewness")$interval$limits # LCL UCL #-22.38322 16.33569 #---------- # Clean up #--------- rm(dat)
Estimate the mean and coefficient of variation of a lognormal distribution, and optionally construct a confidence interval for the mean.
elnormAlt(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "land", conf.level = 0.95, parkin.list = NULL)
elnormAlt(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "land", conf.level = 0.95, parkin.list = NULL)
x |
numeric vector of positive observations. |
method |
character string specifying the method of estimation. Possible values are
|
ci |
logical scalar indicating whether to compute a confidence interval for the
mean. The default value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
parkin.list |
a list containing arguments for the function |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from a
lognormal distribution with
parameters
mean=
and
cv=
. Let
denote the
standard deviation of this distribution, so that
. Set
. Then
is a vector of observations
from a normal distribution with parameters
mean=
and
sd=
.
See the help file for LognormalAlt for the relationship between
, and
.
Estimation
This section explains how each of the estimators of mean=
and
cv=
are computed. The approach is to first compute estimates of
and
(the mean and variance of the lognormal distribution),
say
and
, then compute the estimate of the cv
by
.
Minimum Variance Unbiased Estimation (method="mvue"
)
The minimum variance unbiased estimators (mvue's) of and
were derived
by Finney (1941) and are discussed in Gilbert (1987, pp. 164-167) and Cohn et al. (1989). These
estimators are computed as:
where
The expected value and variance of the mvue of are
(Bradu and Mundlak, 1970; Cohn et al., 1989):
Maximum Likelihood Estimation (method="mle"
)
The maximum likelihood estimators (mle's) of and
are given by:
where
The expected value and variance of the mle of are (after Cohn et al., 1989):
As can be seen from equation (12), the expected value of the mle of does not exist
when
. In general, the
'th moment of the mle of
does not
exist when
.
Quasi Maximum Likelihood Estimation (method="qmle"
)
The quasi maximum likelihood estimators (qmle's; Cohn et al., 1989; Gilbert, 1987, p.167) of
and
are the same as the mle's, except the mle of
in equations (8) and (10) is replaced with the more commonly used
mvue of
shown in equation (4):
The expected value and variance of the qmle of are (Cohn et al., 1989):
As can be seen from equation (17), the expected value of the qmle of does not exist
when
. In general, the
'th moment of the mle of
does
not exist when
.
Note that Gilbert (1987, p. 167) incorrectly presents equation (12) rather than
equation (17) as the expected value of the qmle of . For large values
of
relative to
, however, equations (12) and (17) are
virtually identical.
Method of Moments Estimation (method="mme"
)
The method of moments estimators (mme's) of and
are found by equating
the sample mean and variance with their population values:
Note that the estimator of variance in equation (20) is biased.
The expected value and variance of the mme of are:
Method of Moments Estimation Based on the Unbiased Estimate of Variance (method="mmue"
)
These estimators are exactly the same as the method of moments estimators described above, except
that the usual unbiased estimate of variance is used:
Since the mmue of is equivalent to the mme of
, so are its mean and varaince.
Confidence Intervals
This section explains the different methods for constructing confidence intervals
for , the mean of the lognormal distribution.
Land's Method (ci.method="land"
)
Land (1971, 1975) derived a method for computing one-sided (lower or upper)
uniformly most accurate unbiased confidence intervals for . A
two-sided confidence interval can be constructed by combining an optimal lower
confidence limit with an optimal upper confidence limit. This procedure for
two-sided confidence intervals is only asymptotically optimal, but for most
purposes should be acceptable (Land, 1975, p.387).
As shown in equation (3) in the help file for LognormalAlt, the mean
of a lognormal random variable is related to the mean
and
standard deviation
of the log-transformed random variable by the
following relationship:
where
Land (1971) developed confidence bounds for the quantity . The mvue of
is given by:
Note that .
The
two-sided confidence interval for
is
given by:
the one-sided upper confidence interval for
is
given by:
and the one-sided lower confidence interval for
is
given by:
where is the estimate of
(see equation (4) above), and the
factor
is given in tables in Land (1975).
Thus, by equations (25)-(30), the two-sided confidence
interval for
is given by:
the one-sided upper confidence interval for
is
given by:
and the one-sided lower confidence interval for
is given by:
Note that Gilbert (1987, pp. 169-171, 264-265) denotes the quantity above as
and reproduces a subset of Land's (1975) tables. Some guidance documents
(e.g., USEPA, 1992d) refer to this quantity as the
-statistic.
Zou et al.'s Method (ci.method="zou"
)
Zou et al. (2009) proposed the following approximation for the two-sided
confidence intervals for
. The lower limit
is given by:
and the upper limit is given by:
where denotes the
'th quantile of the standard
normal distribuiton, and
denotes the
'th quantile of the chi-square distribution with
degrees of freedom. The
one-sided lower confidence
limit and one-sided upper confidence limit are given by equations (34) and (35),
respectively, with
replaced by
.
Parkin et al.'s Method (ci.method="parkin"
)
This method was developed by Parkin et al. (1990). It can be shown that the
mean of a lognormal distribution corresponds to the 'th quantile, where
and denotes the cumulative distribution function of the standard
normal distribution. Parkin et al. (1990) suggested
estimating
by replacing
in equation (36) with the estimate
as computed in equation (4). Once an estimate of
is obtained, a
nonparametric confidence interval can be constructed for
, assuming
is equal to its estimated value (see
eqnpar
).
Cox's Method (ci.method="cox"
)
This method was suggested by Professor D.R. Cox and is illustrated in Land (1972).
El-Shaarawi (1989) adapts this method to the case of censored water quality data.
Cox's idea is to construct an approximate confidence interval
for the quantity
defined in equation (26) above assuming the estimate of
is approximately normally distributed, and then exponentiate the
confidence limits. That is, a two-sided
confidence interval
for
is constructed as:
where denotes the
'th quantile of
Student's t-distribution with
degrees of freedom.
Note that this method, unlike the normal approximation method discussed below,
guarantees a positive value for the lower confidence limit. One-sided confidence
intervals are computed in a similar fashion.
Define an estimator of by:
Then the variance of this estimator is given by:
The function elnormAlt
follows Land (1972) and uses the minimum variance
unbiased estimator for shown in equation (27) above, so the variance and
estimated variance of this estimator are:
Note that El-Shaarawi (1989, equation 5) simply replaces the value of in
equation (41) with some estimator of
(the mle or mvue of
), rather than using the mvue of the variance of
as shown
in equation (41).
Normal Approximation (ci.method="normal.approx"
)
This method constructs approximate confidence intervals for
based on the assumption that the estimator of
is
approximately normally distributed. That is, a two-sided
confidence interval for
is constructed as:
One-sided confidence intervals are computed in a similar fashion.
When method="mvue"
is used to estimate , an unbiased estimate of
the variance of the estimator of
is used in equation (42)
(Bradu and Mundlak, 1970, equation 4.3; Gilbert, 1987, equation 13.5):
When method="mle"
is used to estimate , the estimate of the
variance of the estimator of
is computed by replacing
and
in equation (13) with their mle's:
When method="qmle"
is used to estimate , the estimate of the
variance of the estimator of
is computed by replacing
and
in equation (18) with their mvue's:
Note that equation (45) is exactly the same as Gilbert's (1987, p. 167) equation
13.8a, except that Gilbert (1987) erroneously uses where he should use
instead. For large values of
relative to
, however,
this makes little difference.
When method="mme"
, the estimate of the variance of the estimator of
is computed by replacing
in equation (22) with the
mme of
defined in equation (20):
When method="mmue"
, the estimate of the variance of the estimator of
is computed by replacing
in equation (22) with the
mmue of
defined in equation (24):
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The normal and lognormal distribution are probably the two most frequently used distributions to model environmental data. In order to make any kind of probability statement about a normally-distributed population (of chemical concentrations for example), you have to first estimate the mean and standard deviation (the population parameters) of the distribution. Once you estimate these parameters, it is often useful to characterize the uncertainty in the estimate of the mean or variance. This is done with confidence intervals.
Some EPA guidance documents (e.g., Singh et al., 2002; Singh et al., 2010a,b) strongly recommend against using a lognormal model for environmental data and recommend trying a gamma distribuiton instead.
USEPA (1992d) directs persons involved in risk assessment for Superfund sites to
use Land's (1971, 1975) method (ci.method="land"
) for computing the upper
95% confidence interval for the mean, assuming the data follow a lognormal
distribution (the guidance document cites Gilbert (1987) as a source of descriptions
and tables for this method). The last example in the EXAMPLES section below
reproduces an example from this guidance document.
In the past, some authors suggested using the geometric mean, also called the
"rating curve" estimator (Cohn et al., 1989), as the estimator of the mean,
. This estimator is computed as:
Cohn et al. (1989) cite several authors who have pointed out this estimator is
biased and is not even a consistent estimator of the mean. In fact, it is the
maximum likelihood estimator of the median of the distribution
(see eqlnorm
.)
Finney (1941) computed the efficiency of the method of moments estimators of the
mean () and variance (
) of the lognormal distribution
(equations (19)-(20)) relative to the mvue's (equations (1)-(2)) as a function of
(the variance of the log-transformed observations), and found that
while the mme of
is reasonably efficient compared to the mvue of
, the mme of
performs quite poorly relative to the
mvue of
.
Cohn et al. (1989) and Parkin et al. (1988) have shown that the qmle and the mle of the mean can be severely biased for typical environmental data, and suggest always using the mvue.
Parkin et al. (1990) studied the performance of various methods for constructing a
confidence interval for the mean via Monte Carlo simulation. They compared
approximate methods to Land's optimal method (ci.method="land"
). They used
four parent lognormal distributions to generate observations; all had mean 10, but
differed in coefficient of variation: 50, 100, 200, and 500%. They also generated
sample sizes from 6 to 100 in increments of 2. For each combination of parent
distribution and sample size, they generated 25,000 Monte Carlo trials.
Parkin et al. found that for small sample sizes (), none of the
approximate methods (
"parkin"
, "cox"
, "normal.approx"
) worked
very well. For , their method (
"parkin"
) provided reasonably
accurate coverage. Cox's method ("cox"
) worked well for , and
performed slightly better than Parkin et al.'s method (
"parkin"
) for highly
skewed populations.
Zou et al. (2009) used Monte Carlo simulation to compare the performance of their method with the CGI method of Krishnamoorthy and Mathew (2003) and the modified Cox method of Armstrong (1992) and El-Shaarawi and Lin (2007). Performance was assessed based on 1) percentage of times the interval contained the parameter value (coverage%), 2) balance between left and right tail errors, and 3) confidence interval width. All three methods showed acceptable coverage percentages. The modified Cox method showed unbalanced tail errors, and Zou et al.'s method showed consistently narrower average width.
Steven P. Millard ([email protected])
Aitchison, J., and J.A.C. Brown (1957). The Lognormal Distribution (with special references to its uses in economics). Cambridge University Press, London, Chapter 5.
Armstrong, B.G. (1992). Confidence Intervals for Arithmetic Means of Lognormally Distributed Exposures. American Industrial Hygiene Association Journal 53, 481–485.
Bradu, D., and Y. Mundlak. (1970). Estimation in Lognormal Linear Models. Journal of the American Statistical Association 65, 198–211.
Cohn, T.A., L.L. DeLong, E.J. Gilroy, R.M. Hirsch, and D.K. Wells. (1989). Estimating Constituent Loads. Water Resources Research 25(5), 937–942.
Crow, E.L., and K. Shimizu. (1988). Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, Chapter 2.
El-Shaarawi, A.H., and J. Lin. (2007). Interval Estimation for Log-Normal Mean with Applications to Water Quality. Environmetrics 18, 1–10.
El-Shaarawi, A.H., and R. Viveros. (1997). Inference About the Mean in Log-Regression with Environmental Applications. Environmetrics 8, 569–582.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Finney, D.J. (1941). On the Distribution of a Variate Whose Logarithm is Normally Distributed. Supplement to the Journal of the Royal Statistical Society 7, 155–161.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Krishnamoorthy, K., and T.P. Mathew. (2003). Inferences on the Means of Lognormal Distributions Using Generalized p-Values and Generalized Confidence Intervals. Journal of Statistical Planning and Inference 115, 103–121.
Land, C.E. (1971). Confidence Intervals for Linear Functions of the Normal Mean and Variance. The Annals of Mathematical Statistics 42(4), 1187–1205.
Land, C.E. (1972). An Evaluation of Approximate Confidence Interval Estimation Methods for Lognormal Means. Technometrics 14(1), 145–158.
Land, C.E. (1973). Standard Confidence Limits for Linear Functions of the Normal Mean and Variance. Journal of the American Statistical Association 68(344), 960–963.
Land, C.E. (1975). Tables of Confidence Limits for Linear Functions of the Normal Mean and Variance, in Selected Tables in Mathematical Statistics, Vol. III. American Mathematical Society, Providence, RI, pp. 385–419.
Likes, J. (1980). Variance of the MVUE for Lognormal Variance. Technometrics 22(2), 253–258.
Limpert, E., W.A. Stahel, and M. Abbt. (2001). Log-Normal Distributions Across the Sciences: Keys and Clues. BioScience 51, 341–352.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
Parkin, T.B., J.J. Meisinger, S.T. Chester, J.L. Starr, and J.A. Robinson. (1988). Evaluation of Statistical Estimation Methods for Lognormally Distributed Variables. Journal of the Soil Science Society of America 52, 323–329.
Parkin, T.B., S.T. Chester, and J.A. Robinson. (1990). Calculating Confidence Intervals for the Mean of a Lognormally Distributed Variable. Journal of the Soil Science Society of America 54, 321–326.
Singh, A., A.K. Singh, and R.J. Iaci. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (1992d). Supplemental Guidance to RAGS: Calculating the Concentration Term. Publication 9285.7-081, May 1992. Intermittenet Bulletin, Volume 1, Number 1. Office of Emergency and Remedial Response, Hazardous Site Evaluation Division, OS-230. Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Zou, G.Y., C.Y. Huo, and J. Taleban. (2009). Simple Confidence Intervals for Lognormal Means and their Differences with Environmental Applications. Environmetrics 20, 172–180.
LognormalAlt, Lognormal, Normal.
# Using the Reference area TcCB data in the data frame EPA.94b.tccb.df, # estimate the mean and coefficient of variation, # and construct a 95% confidence interval for the mean. with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): mean = 0.5989072 # cv = 0.4899539 # #Estimation Method: mvue # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Land # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.5243787 # UCL = 0.7016992 #---------- # Compare the different methods of estimating the distribution parameters using the # Reference area TcCB data. with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue"))$parameters # mean cv #0.5989072 0.4899539 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "qmle"))$parameters # mean cv #0.6004468 0.4947791 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mle"))$parameters # mean cv #0.5990497 0.4888968 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mme"))$parameters # mean cv #0.5985106 0.4688423 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mmue"))$parameters # mean cv #0.5985106 0.4739110 #---------- # Compare the different methods of constructing the confidence interval for # the mean using the Reference area TcCB data. with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "land"))$interval$limits # LCL UCL #0.5243787 0.7016992 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "zou"))$interval$limits # LCL UCL #0.5230444 0.6962071 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "parkin"))$interval$limits # LCL UCL #0.50 0.74 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "cox"))$interval$limits # LCL UCL #0.5196213 0.6938444 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "normal.approx"))$interval$limits # LCL UCL #0.5130160 0.6847984 #---------- # Reproduce the example in Highlights 7 and 8 of USEPA (1992d). This example shows # how to compute the upper 95% confidence limit of the mean of a lognormal distribution # and compares it to the result of computing the upper 95% confidence limit assuming a # normal distribution. The data for this example are chromium concentrations (mg/kg) in # soil samples collected randomly over a Superfund site, and are stored in the data frame # EPA.92d.chromium.vec. # First look at the data EPA.92d.chromium.vec # [1] 10 13 20 36 41 59 67 110 110 136 140 160 200 230 1300 stripChart(EPA.92d.chromium.vec, ylab = "Chromium (mg/kg)") # Note there is one very large "outlier" (1300). # Perform a goodness-of-fit test to determine whether a lognormal distribution # is appropriate: gof.list <- gofTest(EPA.92d.chromium.vec, dist = 'lnormAlt') gof.list #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): mean = 159.855185 # cv = 1.493994 # #Estimation Method: mvue # #Data: EPA.92d.chromium.vec # #Sample Size: 15 # #Test Statistic: W = 0.9607179 # #Test Statistic Parameter: n = 15 # #P-value: 0.7048747 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. plot(gof.list, digits = 2) # The lognormal distribution seems to provide an adequate fit, although the largest # observation (1300) is somewhat suspect, and given the small sample size there is # not much power to detect any kind of mild deviation from a lognormal distribution. # Now compute the one-sided 95% upper confidence limit for the mean. # Note that the value of 502 mg/kg shown in Hightlight 7 of USEPA (1992d) is a bit # larger than the exact value of 496.6 mg/kg shown below. # This is simply due to rounding error. elnormAlt(EPA.92d.chromium.vec, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): mean = 159.855185 # cv = 1.493994 # #Estimation Method: mvue # #Data: EPA.92d.chromium.vec # #Sample Size: 15 # #Confidence Interval for: mean # #Confidence Interval Method: Land # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0 # UCL = 496.6282 # Now compare this result with the upper 95% confidence limit based on assuming # a normal distribution. Again note that the value of 325 mg/kg shown in # Hightlight 8 is slightly larger than the exact value of 320.3 mg/kg shown below. # This is simply due to rounding error. enorm(EPA.92d.chromium.vec, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 175.4667 # sd = 318.5440 # #Estimation Method: mvue # #Data: EPA.92d.chromium.vec # #Sample Size: 15 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 320.3304 #---------- # Clean up #--------- rm(gof.list)
# Using the Reference area TcCB data in the data frame EPA.94b.tccb.df, # estimate the mean and coefficient of variation, # and construct a 95% confidence interval for the mean. with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): mean = 0.5989072 # cv = 0.4899539 # #Estimation Method: mvue # #Data: TcCB[Area == "Reference"] # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Land # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.5243787 # UCL = 0.7016992 #---------- # Compare the different methods of estimating the distribution parameters using the # Reference area TcCB data. with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue"))$parameters # mean cv #0.5989072 0.4899539 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "qmle"))$parameters # mean cv #0.6004468 0.4947791 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mle"))$parameters # mean cv #0.5990497 0.4888968 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mme"))$parameters # mean cv #0.5985106 0.4688423 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mmue"))$parameters # mean cv #0.5985106 0.4739110 #---------- # Compare the different methods of constructing the confidence interval for # the mean using the Reference area TcCB data. with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "land"))$interval$limits # LCL UCL #0.5243787 0.7016992 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "zou"))$interval$limits # LCL UCL #0.5230444 0.6962071 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "parkin"))$interval$limits # LCL UCL #0.50 0.74 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "cox"))$interval$limits # LCL UCL #0.5196213 0.6938444 with(EPA.94b.tccb.df, elnormAlt(TcCB[Area == "Reference"], method = "mvue", ci = TRUE, ci.method = "normal.approx"))$interval$limits # LCL UCL #0.5130160 0.6847984 #---------- # Reproduce the example in Highlights 7 and 8 of USEPA (1992d). This example shows # how to compute the upper 95% confidence limit of the mean of a lognormal distribution # and compares it to the result of computing the upper 95% confidence limit assuming a # normal distribution. The data for this example are chromium concentrations (mg/kg) in # soil samples collected randomly over a Superfund site, and are stored in the data frame # EPA.92d.chromium.vec. # First look at the data EPA.92d.chromium.vec # [1] 10 13 20 36 41 59 67 110 110 136 140 160 200 230 1300 stripChart(EPA.92d.chromium.vec, ylab = "Chromium (mg/kg)") # Note there is one very large "outlier" (1300). # Perform a goodness-of-fit test to determine whether a lognormal distribution # is appropriate: gof.list <- gofTest(EPA.92d.chromium.vec, dist = 'lnormAlt') gof.list #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): mean = 159.855185 # cv = 1.493994 # #Estimation Method: mvue # #Data: EPA.92d.chromium.vec # #Sample Size: 15 # #Test Statistic: W = 0.9607179 # #Test Statistic Parameter: n = 15 # #P-value: 0.7048747 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. plot(gof.list, digits = 2) # The lognormal distribution seems to provide an adequate fit, although the largest # observation (1300) is somewhat suspect, and given the small sample size there is # not much power to detect any kind of mild deviation from a lognormal distribution. # Now compute the one-sided 95% upper confidence limit for the mean. # Note that the value of 502 mg/kg shown in Hightlight 7 of USEPA (1992d) is a bit # larger than the exact value of 496.6 mg/kg shown below. # This is simply due to rounding error. elnormAlt(EPA.92d.chromium.vec, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): mean = 159.855185 # cv = 1.493994 # #Estimation Method: mvue # #Data: EPA.92d.chromium.vec # #Sample Size: 15 # #Confidence Interval for: mean # #Confidence Interval Method: Land # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0 # UCL = 496.6282 # Now compare this result with the upper 95% confidence limit based on assuming # a normal distribution. Again note that the value of 325 mg/kg shown in # Hightlight 8 is slightly larger than the exact value of 320.3 mg/kg shown below. # This is simply due to rounding error. enorm(EPA.92d.chromium.vec, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 175.4667 # sd = 318.5440 # #Estimation Method: mvue # #Data: EPA.92d.chromium.vec # #Sample Size: 15 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 320.3304 #---------- # Clean up #--------- rm(gof.list)
Estimate the mean and coefficient of variation of a lognormal distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for the mean.
elnormAltCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", ...)
elnormAltCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", ...)
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
method |
character string specifying the method of estimation. For singly censored data, the possible values are: For multiply censored data, the possible values are: See the DETAILS section for more information. |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean or variance. The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are
The confidence interval method See the DETAILS section for more information.
This argument is ignored if |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
n.bootstraps |
numeric scalar indicating how many bootstraps to use to construct the
confidence interval for the mean when |
pivot.statistic |
character string indicating which pivot statistic to use in the construction
of the confidence interval for the mean when |
... |
additional arguments to pass to other functions.
|
If x
or censored
contain any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from a
lognormal distribution with
parameters
mean=
and
cv=
. Let
denote the
standard deviation of this distribution, so that
. Set
. Then
is a vector of observations
from a normal distribution with parameters
mean=
and
sd=
.
See the help file for LognormalAlt for the relationship between
, and
.
Let denote a vector of
observations from a
lognormal distribution with parameters
mean=
and
cv=
. Let
denote the
standard deviation of this distribution, so that
. Set
. Then
is a
vector of observations from a normal distribution with parameters
mean=
and
sd=
. See the help file for
LognormalAlt for the relationship between
, and
.
Assume (
) of the
observations are known and
(
) of the observations are all censored below (left-censored)
or all censored above (right-censored) at
fixed censoring levels
For the case when , the data are said to be Type I
multiply censored. For the case when
,
set
. If the data are left-censored
and all
known observations are greater
than or equal to
, or if the data are right-censored and all
known observations are less than or equal to
, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let denote the number of observations censored below or above censoring
level
for
, so that
Let denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
right-censored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For left-censored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity does not necessarily represent
the
'th “largest” observation from the (unknown) complete sample.
Finally, let (omega) denote the set of
subscripts in the
“ordered” sample that correspond to uncensored observations.
ESTIMATION
This section explains how each of the estimators of mean=
and
cv=
are computed. The approach is to first compute estimates of
and
(the mean and variance of the lognormal distribution),
say
and
, then compute the estimate of the cv
by
.
Maximum Likelihood Estimation (method="mle"
)
The maximum likelihood estimators of ,
, and
are
computed as:
where and
denote the maximum
likelihood estimators of
and
. See the help for for
enormCensored
for information on how and
are computed.
Quasi Minimum Variance Unbiased Estimation Based on the MLE's (method="qmvue"
)
The maximum likelihood estimators of and
are biased.
Even for complete (uncensored) samples these estimators are biased
(see equation (12) in the help file for
elnormAlt
).
The bias tends to 0 as the sample size increases, but it can be considerable for
small sample sizes.
(Cohn et al., 1989, demonstrate the bias for complete data sets.)
For the case of complete samples, the minimum variance unbiased estimators (mvue's)
of and
were derived by Finney (1941) and are discussed in
Gilbert (1987, pp.164-167) and Cohn et al. (1989). These estimators are computed as:
where
(see the help file for elnormAlt
).
For Type I censored samples, the quasi minimum variance unbiased estimators
(qmvue's) of and
are computed using equations (6) and (7)
and estimating
and
with their mle's (see
elnormCensored
).
For singly censored data, this is apparently the LM method of Gilliom and Helsel
(1986, p.137) (it is not clear from their description on page 137 whether their
LM method is the straight method="mle"
described above or
method="qmvue"
described here). This method was also used by
Newman et al. (1989, p.915, equations 10-11).
For multiply censored data, this is apparently the MM method of Helsel and Cohn
(1988, p.1998). (It is not clear from their description on page 1998 and the
description in Gilliom and Helsel, 1986, page 137 whether Helsel and Cohn's (1988)
MM method is the straight method="mle"
described above or method="qmvue"
described here.)
Bias-Corrected Maximum Likelihood Estimation (method="bcmle"
)
This method was derived by El-Shaarawi (1989) and can be applied to complete or
censored data sets. For complete data, the exact relative bias of the mle of
the mean is given as:
(see equation (12) in the help file for elnormAlt
).
For the case of complete or censored data, El-Shaarawi (1989) proposed the following “bias-corrected” maximum likelihood estimator:
where
and denotes the asymptotic variance-covariance of the mle's of
and
, which is based on the observed information matrix, formulas for
which are given in Cohen (1991). El-Shaarawi (1989) does not propose a
bias-corrected estimator of the variance
, so the mle of
is computed when
method="bcmle"
.
Robust Regression on Order Statistics (method="rROS"
) or
Imputation Using Quantile-Quantile Regression (method ="impute.w.qq.reg"
)
This is the robust Regression on Order Statistics (rROS) method discussed in USEPA (2009)
and Helsel (2012). This method involves using quantile-quantile regression on the
log-transformed observations to fit a regression line (and thus initially estimate the mean
and standard deviation
in log-space), imputing the
log-transformed values of the
censored observations by predicting them
from the regression equation, transforming the log-scale imputed values back to
the original scale, and then computing the method of moments estimates of the
mean and standard deviation based on the observed and imputed values.
The steps are:
Estimate and
by computing the least-squares
estimates in the following model:
where denotes the plotting position associated with the
'th
largest value,
is a constant such that
(the default value is 0.375),
denotes the cumulative
distribution function (cdf) of the standard normal distribution and
denotes the set of
subscripts associated with the
uncensored observations in the ordered sample. The plotting positions are
computed by calling the function
ppointsCensored
.
Compute the log-scale imputed values as:
Retransform the log-scale imputed values:
Compute the usual method of moments estimates of the mean and variance.
Note that the estimate of variance is actually the usual unbiased one (not the method of moments one) in the case of complete data.
For sinlgy censored data, this method is discussed by Hashimoto and Trussell (1983), Gilliom and Helsel (1986), and El-Shaarawi (1989), and is referred to as the LR (Log-Regression) or Log-Probability Method.
For multiply censored data, this is the MR method of Helsel and Cohn (1988, p.1998).
They used it with the probability method of Hirsch and Stedinger (1987) and
Weibull plotting positions (i.e., prob.method="hirsch-stedinger"
and
plot.pos.con=0
).
The argument plot.pos.con
(see the entry for ... in the ARGUMENTS
section above) determines the value of the plotting positions computed in
equations (14) and (15) when method
equals "hirsch-stedinger"
or
"michael-schucany"
. The default value is plot.pos.con=0.375
.
See the help file for ppointsCensored
for more information.
The arguments lb.impute
and ub.impute
(see the entry for ... in
the ARGUMENTS section above) determine the lower and upper bounds for the
imputed values. Imputed values smaller than lb.impute
are set to this
value. Imputed values larger than ub.impute
are set to this value.
The default values are lb.impute=0
and ub.impute=Inf
.
Imputation Using Quantile-Quantile Regression Including the Censoring Level
(method ="impute.w.qq.reg.w.cen.level"
)
This method is only available for sinlgy censored data. This method was
proposed by El-Shaarawi (1989), which he denoted as the Modified LR Method.
It is exactly the same method as imputation
using quantile-quantile regression (method="impute.w.qq.reg"
), except that
the quantile-quantile regression includes the censoring level. For left singly
censored data, the modification involves adding the point
to the plot before fitting the least-squares line.
For right singly censored data, the point
is added to the plot before fitting the least-squares line.
Imputation Using Maximum Likelihood (method ="impute.w.mle"
)
This method is only available for sinlgy censored data.
This is exactly the same method as robust Regression on Order Statistics (i.e.,
the same as using method="rROS"
or method="impute.w.qq.reg"
),
except that the maximum likelihood method (method="mle"
) is used to compute
the initial estimates of the mean and standard deviation.
In the context of lognormal data, this method is discussed
by El-Shaarawi (1989), which he denotes as the Modified Maximum Likelihood Method.
Setting Censored Observations to Half the Censoring Level (method="half.cen.level"
)
This method is applicable only to left censored data that is bounded below by 0.
This method involves simply replacing all the censored observations with half their
detection limit, and then computing the usual moment estimators of the mean and
variance. That is, all censored observations are imputed to be half the detection
limit, and then Equations (17) and (18) are used to estimate the mean and varaince.
This method is included only to allow comparison of this method to other methods.
Setting left-censored observations to half the censoring level is not
recommended. In particular, El-Shaarawi and Esterby (1992) show that these
estimators are biased and inconsistent (i.e., the bias remains even as the sample
size increases).
CONFIDENCE INTERVALS
This section explains how confidence intervals for the mean are
computed.
f
Likelihood Profile (ci.method="profile.likelihood"
)
This method was proposed by Cox (1970, p.88), and Venzon and Moolgavkar (1988)
introduced an efficient method of computation. This method is also discussed by
Stryhn and Christensen (2003) and Royston (2007).
The idea behind this method is to invert the likelihood-ratio test to obtain a
confidence interval for the mean while treating the coefficient of
variation
as a nuisance parameter.
For Type I left censored data, the likelihood function is given by:
where and
denote the probability density function (pdf) and
cumulative distribution function (cdf) of the population. That is,
where
and and
denote the pdf and cdf of the standard normal
distribution, respectively (Cohen, 1963; 1991, pp.6, 50). For left singly
censored data, equation (3) simplifies to:
Similarly, for Type I right censored data, the likelihood function is given by:
and for right singly censored data this simplifies to:
Following Stryhn and Christensen (2003), denote the maximum likelihood estimates
of the mean and coefficient of variation by .
The likelihood ratio test statistic (
) of the hypothesis
(where
is a fixed value) equals the
drop in
between the “full” model and the reduced model with
fixed at
, i.e.,
where is the maximum likelihood estimate of
for the
reduced model (i.e., when
). Under the null hypothesis,
the test statistic
follows a
chi-squared distribution with 1 degree of freedom.
Alternatively, we may
express the test statistic in terms of the profile likelihood function
for the mean
, which is obtained from the usual likelihood function by
maximizing over the parameter
, i.e.,
Then we have
A two-sided confidence interval for the mean
consists of all values of
for which the test is not significant at
level
:
where denotes the
'th quantile of the
chi-squared distribution with
degrees of freedom.
One-sided lower and one-sided upper confidence intervals are computed in a similar
fashion, except that the quantity
in Equation (30) is replaced with
.
Direct Normal Approximations (ci.method="delta"
or ci.method="normal.approx"
)
An approximate confidence interval for
can be
constructed assuming the distribution of the estimator of
is
approximately normally distributed. That is, a two-sided
confidence interval for
is constructed as:
where denotes the estimate of
,
denotes the estimated asymptotic standard
deviation of the estimator of
,
denotes the assumed sample
size for the confidence interval, and
denotes the
'th
quantile of Student's t-distribuiton with
degrees of freedom. One-sided confidence intervals are computed in a
similar fashion.
The argument ci.sample.size
determines the value of (see
see the entry for ... in the ARGUMENTS section above).
When
method
equals "mle"
, "qmvue"
, or "bcmle"
and the data are singly censored, the default value is the
expected number of uncensored observations, otherwise it is ,
the observed number of uncensored observations. This is simply an ad-hoc
method of constructing confidence intervals and is not based on any
published theoretical results.
When pivot.statistic="z"
, the 'th quantile from the
standard normal distribution is used in place of the
'th quantile from Student's t-distribution.
Direct Normal Approximation Based on the Delta Method (ci.method="delta"
)
This method is usually applied with the maximum likelihood estimators
(method="mle"
). It should also work approximately for the quasi minimum
variance unbiased estimators (method="qmvue"
) and the bias-corrected maximum
likelihood estimators (method="bcmle"
).
When method="mle"
, the variance of the mle of can be estimated
based on the variance-covariance matrix of the mle's of
and
(denoted
), and the delta method:
where
(Shumway et al., 1989). The variance-covariance matrix of the mle's of
and
is estimated based on the inverse of the observed Fisher
Information matrix, formulas for which are given in Cohen (1991).
Direct Normal Approximation Based on the Moment Estimators (ci.method="normal.approx"
)
This method is valid only for the moment estimators based on imputed values
(i.e., method="impute.w.qq.reg"
or method="half.cen.level"
). For
these cases, the standard deviation of the estimated mean is assumed to be
approximated by
where, as already noted, denotes the assumed sample size.
This is simply an ad-hoc method of constructing confidence intervals and is not
based on any published theoretical results.
Cox's Method (ci.method="cox"
)
This method may be applied with the maximum likelihood estimators
(method="mle"
), the quasi minimum variance unbiased estimators
(method="qmvue"
), and the bias-corrected maximum likelihood estimators
(method="bcmle"
).
This method was proposed by El-Shaarawi (1989) and is an extension of the
method derived by Cox and presented in Land (1972) for the case of
complete data (see the explanation of ci.method="cox"
in the help file
for elnormAlt
). The idea is to construct an approximate
confidence interval for the quantity
assuming the estimate of
is approximately normally distributed, and then exponentiate the confidence limits.
That is, a two-sided confidence interval for
is constructed as:
where
and denotes the estimated asymptotic standard
deviation of the estimator of
,
denotes the assumed sample
size for the confidence interval, and
denotes the
'th
quantile of Student's t-distribuiton with
degrees of freedom.
El-Shaarawi (1989) shows that the standard deviation of the mle of can
be estimated by:
where denotes the variance-covariance matrix of the mle's of
and
and is estimated based on the inverse of the Fisher Information matrix.
One-sided confidence intervals are computed in a similar fashion.
Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap"
)
The bootstrap is a nonparametric method of estimating the distribution
(and associated distribution parameters and quantiles) of a sample statistic,
regardless of the distribution of the population from which the sample was drawn.
The bootstrap was introduced by Efron (1979) and a general reference is
Efron and Tibshirani (1993).
In the context of deriving an approximate confidence interval
for the population mean
, the bootstrap can be broken down into the
following steps:
Create a bootstrap sample by taking a random sample of size from
the observations in
, where sampling is done with
replacement. Note that because sampling is done with replacement, the same
element of
can appear more than once in the bootstrap
sample. Thus, the bootstrap sample will usually not look exactly like the
original sample (e.g., the number of censored observations in the bootstrap
sample will often differ from the number of censored observations in the
original sample).
Estimate based on the bootstrap sample created in Step 1, using
the same method that was used to estimate
using the original
observations in
. Because the bootstrap sample usually
does not match the original sample, the estimate of
based on the
bootstrap sample will usually differ from the original estimate based on
.
Repeat Steps 1 and 2 times, where
is some large number.
The number of bootstraps
is determined by the argument
n.bootstraps
(see the section ARGUMENTS above).
The default value of n.bootstraps
is 1000
.
Use the estimated values of
to compute the empirical
cumulative distribution function of this estimator of
(see
ecdfPlot
), and then create a confidence interval for
based on this estimated cdf.
The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as:
where denotes the empirical cdf evaluated at
and thus
denotes the
'th empirical quantile, that is,
the
'th quantile associated with the empirical cdf. Similarly, a one-sided lower
confidence interval is computed as:
and a one-sided upper confidence interval is computed as:
The function elnormAltCensored
calls the R function quantile
to compute the empirical quantiles used in Equations (42)-(44).
The percentile method bootstrap confidence interval is only first-order
accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability
that the confidence interval will contain the true value of can be
off by
, where
is some constant. Efron and Tibshirani
(1993, pp.184-188) proposed a bias-corrected and accelerated interval that is
second-order accurate, meaning that the probability that the confidence interval
will contain the true value of
may be off by
instead of
. The two-sided bias-corrected and accelerated confidence interval is
computed as:
where
where the quantity denotes the estimate of
using
all the values in
except the
'th one, and
A one-sided lower confidence interval is given by:
and a one-sided upper confidence interval is given by:
where and
are computed as for a two-sided confidence
interval, except
is replaced with
in Equations (51) and (52).
The constant incorporates the bias correction, and the constant
is the acceleration constant. The term “acceleration” refers
to the rate of change of the standard error of the estimate of
with
respect to the true value of
(Efron and Tibshirani, 1993, p.186). For a
normal (Gaussian) distribution, the standard error of the estimate of
does not depend on the value of
, hence the acceleration constant is not
really necessary.
When ci.method="bootstrap"
, the function elnormAltCensored
computes both
the percentile method and bias-corrected and accelerated method bootstrap confidence
intervals.
This method of constructing confidence intervals for censored data was studied by Shumway et al. (1989).
a list of class "estimateCensored"
containing the estimated parameters
and other information. See estimateCensored.object
for details.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation.
Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators based on censored environmental data.
In practice, it is better to use a confidence interval for the mean or a joint confidence region for the mean and standard deviation, rather than rely on a single point-estimate of the mean. Since confidence intervals and regions depend on the properties of the estimators for both the mean and standard deviation, the results of studies that simply evaluated the performance of the mean and standard deviation separately cannot be readily extrapolated to predict the performance of various methods of constructing confidence intervals and regions. Furthermore, for several of the methods that have been proposed to estimate the mean based on type I left-censored data, standard errors of the estimates are not available, hence it is not possible to construct confidence intervals (El-Shaarawi and Dolan, 1989).
Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and standard deviation on the original scale, not the log-scale, when data are subjected to single or multiple censoring. See, for example, Singh et al. (2006).
Steven P. Millard ([email protected])
Bain, L.J., and M. Engelhardt. (1991). Statistical Analysis of Reliability and Life-Testing Models. Marcel Dekker, New York, 496pp.
Cohen, A.C. (1959). Simplified Estimators for the Normal Distribution When Samples are Singly Censored or Truncated. Technometrics 1(3), 217–237.
Cohen, A.C. (1963). Progressively Censored Samples in Life Testing. Technometrics 5, 327–339
Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, 312pp.
Cox, D.R. (1970). Analysis of Binary Data. Chapman & Hall, London. 142pp.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.
El-Shaarawi, A.H. (1989). Inferences About the Mean from Censored Water Quality Data. Water Resources Research 25(4) 685–690.
El-Shaarawi, A.H., and D.M. Dolan. (1989). Maximum Likelihood Estimation of Water Quality Concentrations from Censored Data. Canadian Journal of Fisheries and Aquatic Sciences 46, 1033–1039.
El-Shaarawi, A.H., and S.R. Esterby. (1992). Replacement of Censored Observations by a Constant: An Evaluation. Water Research 26(6), 835–844.
El-Shaarawi, A.H., and A. Naderi. (1991). Statistical Inference from Multiply Censored Environmental Data. Environmental Monitoring and Assessment 17, 339–347.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135–146.
Gleit, A. (1985). Estimation for Small Normal Data Sets with Detection Limits. Environmental Science and Technology 19, 1201–1206.
Haas, C.N., and P.A. Scheff. (1990). Estimation of Averages in Truncated Samples. Environmental Science and Technology 24(6), 912–919.
Hashimoto, L.K., and R.R. Trussell. (1983). Evaluating Water Quality Data Near the Detection Limit. Paper presented at the Advanced Technology Conference, American Water Works Association, Las Vegas, Nevada, June 5-9, 1983.
Helsel, D.R. (1990). Less than Obvious: Statistical Treatment of Data Below the Detection Limit. Environmental Science and Technology 24(12), 1766–1774.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997–2004.
Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715–727.
Korn, L.R., and D.E. Tyler. (2001). Robust Estimation for Chemical Concentration Data Subject to Detection Limits. In Fernholz, L., S. Morgenthaler, and W. Stahel, eds. Statistics in Genetics and in the Environmental Sciences. Birkhauser Verlag, Basel, pp.41–63.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461–496.
Millard, S.P., P. Dixon, and N.K. Neerchal. (2014; in preparation). Environmental Statistics with R. CRC Press, Boca Raton, Florida.
Nelson, W. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, 634pp.
Newman, M.C., P.M. Dixon, B.B. Looney, and J.E. Pinder. (1989). Estimating Mean and Variance for Environmental Samples with Below Detection Limit Observations. Water Resources Bulletin 25(4), 905–916.
Pettitt, A. N. (1983). Re-Weighted Least Squares Estimation with Censored and Grouped Data: An Application of the EM Algorithm. Journal of the Royal Statistical Society, Series B 47, 253–260.
Regal, R. (1982). Applying Order Statistic Censored Normal Confidence Intervals to Time Censored Data. Unpublished manuscript, University of Minnesota, Duluth, Department of Mathematical Sciences.
Royston, P. (2007). Profile Likelihood for Estimation and Confdence Intervals. The Stata Journal 7(3), pp. 376–387.
Saw, J.G. (1961b). The Bias of the Maximum Likelihood Estimators of Location and Scale Parameters Given a Type II Censored Normal Sample. Biometrika 48, 448–451.
Schmee, J., D.Gladstein, and W. Nelson. (1985). Confidence Limits for Parameters of a Normal Distribution from Singly Censored Samples, Using Maximum Likelihood. Technometrics 27(2) 119–128.
Schneider, H. (1986). Truncated and Censored Samples from Normal Populations. Marcel Dekker, New York, New York, 273pp.
Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.
Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Stryhn, H., and J. Christensen. (2003). Confidence Intervals by the Profile Likelihood Method, with Applications in Veterinary Epidemiology. Contributed paper at ISVEE X (November 2003, Chile). https://gilvanguedes.com/wp-content/uploads/2019/05/Profile-Likelihood-CI.pdf.
Travis, C.C., and M.L. Land. (1990). Estimating the Mean of Data Sets with Nondetectable Values. Environmental Science and Technology 24, 961–962.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Venzon, D.J., and S.H. Moolgavkar. (1988). A Method for Computing Profile-Likelihood-Based Confidence Intervals. Journal of the Royal Statistical Society, Series C (Applied Statistics) 37(1), pp. 87–94.
LognormalAlt
, elnormAlt
,
elnormCensored
, enormCensored
,
estimateCensored.object
.
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and coefficient of variation # ON THE ORIGINAL SCALE using the MLE, QMVUE, # and robust ROS (imputation with Q-Q regression). # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean and coefficient of variation # using the MLE: #--------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 23.003987 # cv = 2.300772 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # Now compare the MLE with the QMVUE and the # estimator based on robust ROS #------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored))$parameters # mean cv #23.003987 2.300772 with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, method = "qmvue"))$parameters # mean cv #21.566945 1.841366 with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, method = "rROS"))$parameters # mean cv #19.886180 1.298868 #---------- # The method used to estimate quantiles for a Q-Q plot is # determined by the argument prob.method. For the function # elnormCensoredAlt, for any estimation method that involves # Q-Q regression, the default value of prob.method is # "hirsch-stedinger" and the default value for the # plotting position constant is plot.pos.con=0.375. # Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger # probability method but set the plotting position constant to 0. with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, method = "rROS", plot.pos.con = 0))$parameters # mean cv #19.827673 1.304725 #---------- # Using the same data as above, compute a confidence interval # for the mean using the profile-likelihood method. with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 23.003987 # cv = 2.300772 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 12.37629 # UCL = 69.87694
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and coefficient of variation # ON THE ORIGINAL SCALE using the MLE, QMVUE, # and robust ROS (imputation with Q-Q regression). # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean and coefficient of variation # using the MLE: #--------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 23.003987 # cv = 2.300772 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # Now compare the MLE with the QMVUE and the # estimator based on robust ROS #------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored))$parameters # mean cv #23.003987 2.300772 with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, method = "qmvue"))$parameters # mean cv #21.566945 1.841366 with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, method = "rROS"))$parameters # mean cv #19.886180 1.298868 #---------- # The method used to estimate quantiles for a Q-Q plot is # determined by the argument prob.method. For the function # elnormCensoredAlt, for any estimation method that involves # Q-Q regression, the default value of prob.method is # "hirsch-stedinger" and the default value for the # plotting position constant is plot.pos.con=0.375. # Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger # probability method but set the plotting position constant to 0. with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, method = "rROS", plot.pos.con = 0))$parameters # mean cv #19.827673 1.304725 #---------- # Using the same data as above, compute a confidence interval # for the mean using the profile-likelihood method. with(EPA.09.Ex.15.1.manganese.df, elnormAltCensored(Manganese.ppb, Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 23.003987 # cv = 2.300772 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 12.37629 # UCL = 69.87694
Estimate the mean and standard deviation parameters of the logarithm of a lognormal distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for the mean.
elnormCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", nmc = 1000, seed = NULL, ...)
elnormCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", nmc = 1000, seed = NULL, ...)
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
method |
character string specifying the method of estimation. For singly censored data, the possible values are: For multiply censored data, the possible values are: See the DETAILS section for more information. |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean or variance. The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are: See the DETAILS section for more information.
This argument is ignored if |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
n.bootstraps |
numeric scalar indicating how many bootstraps to use to construct the
confidence interval for the mean when |
pivot.statistic |
character string indicating which pivot statistic to use in the construction
of the confidence interval for the mean when |
nmc |
numeric scalar indicating the number of Monte Carlo simulations to run when
|
seed |
integer supplied to the function |
... |
additional arguments to pass to other functions.
|
If x
or censored
contain any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let denote a random variable with a
lognormal distribution with
parameters
meanlog=
and
sdlog=
. Then
has a normal (Gaussian) distribution with
parameters
mean=
and
sd=
. Thus, the function
elnormCensored
simply calls the function enormCensored
using the
log-transformed values of x
.
a list of class "estimateCensored"
containing the estimated parameters
and other information. See estimateCensored.object
for details.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation.
Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators based on censored environmental data.
In practice, it is better to use a confidence interval for the mean or a joint confidence region for the mean and standard deviation, rather than rely on a single point-estimate of the mean. Since confidence intervals and regions depend on the properties of the estimators for both the mean and standard deviation, the results of studies that simply evaluated the performance of the mean and standard deviation separately cannot be readily extrapolated to predict the performance of various methods of constructing confidence intervals and regions. Furthermore, for several of the methods that have been proposed to estimate the mean based on type I left-censored data, standard errors of the estimates are not available, hence it is not possible to construct confidence intervals (El-Shaarawi and Dolan, 1989).
Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and standard deviation when data are subjected to single or multiple censoring. See, for example, Singh et al. (2006).
Schmee et al. (1985) studied Type II censoring for a normal distribution and
noted that the bias and variances of the maximum likelihood estimators are of the
order , and that the bias is negligible for
and as much as
90% censoring. (If the proportion of censored observations is less than 90%,
the bias becomes negligible for smaller sample sizes.) For small samples with
moderate to high censoring, however, the bias of the mle's causes confidence
intervals based on them using a normal approximation (e.g.,
method="mle"
and ci.method="normal.approx"
) to be too short. Schmee et al. (1985)
provide tables for exact confidence intervals for sample sizes up to
that were created based on Monte Carlo simulation. Schmee et al. (1985) state
that these tables should work well for Type I censored data as well.
Shumway et al. (1989) evaluated the coverage of 90% confidence intervals for the mean based on using a Box-Cox transformation to induce normality, computing the mle's based on the normal distribution, then computing the mean in the original scale. They considered three methods of constructing confidence intervals: the delta method, the bootstrap, and the bias-corrected bootstrap. Shumway et al. (1989) used three parent distributions in their study: Normal(3,1), the square of this distribuiton, and the exponentiation of this distribution (i.e., a lognormal distribution). Based on sample sizes of 10 and 50 with a censoring level at the 10'th or 20'th percentile, Shumway et al. (1989) found that the delta method performed quite well and was superior to the bootstrap method.
Millard et al. (2014; in preparation) show that the coverage of profile likelihood method is excellent.
Steven P. Millard ([email protected])
Bain, L.J., and M. Engelhardt. (1991). Statistical Analysis of Reliability and Life-Testing Models. Marcel Dekker, New York, 496pp.
Cohen, A.C. (1959). Simplified Estimators for the Normal Distribution When Samples are Singly Censored or Truncated. Technometrics 1(3), 217–237.
Cohen, A.C. (1963). Progressively Censored Samples in Life Testing. Technometrics 5, 327–339
Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, 312pp.
Cox, D.R. (1970). Analysis of Binary Data. Chapman & Hall, London. 142pp.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.
El-Shaarawi, A.H. (1989). Inferences About the Mean from Censored Water Quality Data. Water Resources Research 25(4) 685–690.
El-Shaarawi, A.H., and D.M. Dolan. (1989). Maximum Likelihood Estimation of Water Quality Concentrations from Censored Data. Canadian Journal of Fisheries and Aquatic Sciences 46, 1033–1039.
El-Shaarawi, A.H., and S.R. Esterby. (1992). Replacement of Censored Observations by a Constant: An Evaluation. Water Research 26(6), 835–844.
El-Shaarawi, A.H., and A. Naderi. (1991). Statistical Inference from Multiply Censored Environmental Data. Environmental Monitoring and Assessment 17, 339–347.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135–146.
Gleit, A. (1985). Estimation for Small Normal Data Sets with Detection Limits. Environmental Science and Technology 19, 1201–1206.
Haas, C.N., and P.A. Scheff. (1990). Estimation of Averages in Truncated Samples. Environmental Science and Technology 24(6), 912–919.
Hashimoto, L.K., and R.R. Trussell. (1983). Evaluating Water Quality Data Near the Detection Limit. Paper presented at the Advanced Technology Conference, American Water Works Association, Las Vegas, Nevada, June 5-9, 1983.
Helsel, D.R. (1990). Less than Obvious: Statistical Treatment of Data Below the Detection Limit. Environmental Science and Technology 24(12), 1766–1774.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997–2004.
Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715–727.
Korn, L.R., and D.E. Tyler. (2001). Robust Estimation for Chemical Concentration Data Subject to Detection Limits. In Fernholz, L., S. Morgenthaler, and W. Stahel, eds. Statistics in Genetics and in the Environmental Sciences. Birkhauser Verlag, Basel, pp.41–63.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461–496.
Millard, S.P., P. Dixon, and N.K. Neerchal. (2014; in preparation). Environmental Statistics with R. CRC Press, Boca Raton, Florida.
Nelson, W. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, 634pp.
Newman, M.C., P.M. Dixon, B.B. Looney, and J.E. Pinder. (1989). Estimating Mean and Variance for Environmental Samples with Below Detection Limit Observations. Water Resources Bulletin 25(4), 905–916.
Pettitt, A. N. (1983). Re-Weighted Least Squares Estimation with Censored and Grouped Data: An Application of the EM Algorithm. Journal of the Royal Statistical Society, Series B 47, 253–260.
Regal, R. (1982). Applying Order Statistic Censored Normal Confidence Intervals to Time Censored Data. Unpublished manuscript, University of Minnesota, Duluth, Department of Mathematical Sciences.
Royston, P. (2007). Profile Likelihood for Estimation and Confdence Intervals. The Stata Journal 7(3), pp. 376–387.
Saw, J.G. (1961b). The Bias of the Maximum Likelihood Estimators of Location and Scale Parameters Given a Type II Censored Normal Sample. Biometrika 48, 448–451.
Schmee, J., D.Gladstein, and W. Nelson. (1985). Confidence Limits for Parameters of a Normal Distribution from Singly Censored Samples, Using Maximum Likelihood. Technometrics 27(2) 119–128.
Schneider, H. (1986). Truncated and Censored Samples from Normal Populations. Marcel Dekker, New York, New York, 273pp.
Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.
Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Stryhn, H., and J. Christensen. (2003). Confidence Intervals by the Profile Likelihood Method, with Applications in Veterinary Epidemiology. Contributed paper at ISVEE X (November 2003, Chile). https://gilvanguedes.com/wp-content/uploads/2019/05/Profile-Likelihood-CI.pdf.
Travis, C.C., and M.L. Land. (1990). Estimating the Mean of Data Sets with Nondetectable Values. Environmental Science and Technology 24, 961–962.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Venzon, D.J., and S.H. Moolgavkar. (1988). A Method for Computing Profile-Likelihood-Based Confidence Intervals. Journal of the Royal Statistical Society, Series C (Applied Statistics) 37(1), pp. 87–94.
enormCensored
, Lognormal, elnorm
,
estimateCensored.object
.
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and standard deviation using the MLE, # Q-Q regression (also called parametric regression on order statistics # or ROS; e.g., USEPA, 2009 and Helsel, 2012), and imputation with Q-Q # regression (also called robust ROS or rROS). # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean and standard deviation on the log-scale # using the MLE: #--------------------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # Now compare the MLE with the estimators based on # Q-Q regression (ROS) and imputation with Q-Q regression (rROS) #--------------------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored))$parameters # meanlog sdlog #2.215905 1.356291 with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored, method = "ROS"))$parameters # meanlog sdlog #2.293742 1.283635 with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored, method = "rROS"))$parameters # meanlog sdlog #2.298656 1.238104 #---------- # The method used to estimate quantiles for a Q-Q plot is # determined by the argument prob.method. For the functions # enormCensored and elnormCensored, for any estimation # method that involves Q-Q regression, the default value of # prob.method is "hirsch-stedinger" and the default value for the # plotting position constant is plot.pos.con=0.375. # Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger # probability method but set the plotting position constant to 0. with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored, method = "rROS", plot.pos.con = 0))$parameters # meanlog sdlog #2.277175 1.261431 #---------- # Using the same data as above, compute a confidence interval # for the mean on the log-scale using the profile-likelihood # method. with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: meanlog # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 1.595062 # UCL = 2.771197
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and standard deviation using the MLE, # Q-Q regression (also called parametric regression on order statistics # or ROS; e.g., USEPA, 2009 and Helsel, 2012), and imputation with Q-Q # regression (also called robust ROS or rROS). # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean and standard deviation on the log-scale # using the MLE: #--------------------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # Now compare the MLE with the estimators based on # Q-Q regression (ROS) and imputation with Q-Q regression (rROS) #--------------------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored))$parameters # meanlog sdlog #2.215905 1.356291 with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored, method = "ROS"))$parameters # meanlog sdlog #2.293742 1.283635 with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored, method = "rROS"))$parameters # meanlog sdlog #2.298656 1.238104 #---------- # The method used to estimate quantiles for a Q-Q plot is # determined by the argument prob.method. For the functions # enormCensored and elnormCensored, for any estimation # method that involves Q-Q regression, the default value of # prob.method is "hirsch-stedinger" and the default value for the # plotting position constant is plot.pos.con=0.375. # Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger # probability method but set the plotting position constant to 0. with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored, method = "rROS", plot.pos.con = 0))$parameters # meanlog sdlog #2.277175 1.261431 #---------- # Using the same data as above, compute a confidence interval # for the mean on the log-scale using the profile-likelihood # method. with(EPA.09.Ex.15.1.manganese.df, elnormCensored(Manganese.ppb, Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: meanlog # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 1.595062 # UCL = 2.771197
Estimate the location and scale parameters of a logistic distribution, and optionally construct a confidence interval for the location parameter.
elogis(x, method = "mle", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95)
elogis(x, method = "mle", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95)
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Possible values are
|
ci |
logical scalar indicating whether to compute a confidence interval for the
location or scale parameter. The default value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence interval
for the location or scale parameter. Currently, the only possible value is
|
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from an logistic distribution with
parameters
location=
and
scale=
.
Estimation
Maximum Likelihood Estimation (method="mle"
)
The maximum likelihood estimators (mle's) of and
are
the solutions of the simultaneous equations (Forbes et al., 2011):
where
Method of Moments Estimation (method="mme"
)
The method of moments estimators (mme's) of and
are
given by:
where
that is, denotes the square root of the method of moments estimator
of variance.
Method of Moments Estimators Based on the Unbiased Estimator of Variance (method="mmue"
)
These estimators are exactly the same as the method of moments estimators given in
equations (4-7) above, except that the method of moments estimator of variance in
equation (7) is replaced with the unbiased estimator of variance:
Confidence Intervals
When ci=TRUE
, an approximate 100% confidence intervals
for
can be constructed assuming the distribution of the estimator of
is approximately normally distributed. A two-sided confidence
interval is constructed as:
where is the
'th quantile of
Student's t-distribution with
degrees of freedom, and the quantity
denotes the estimated asymptotic standard deviation of the estimator of .
One-sided confidence intervals for and
are computed in
a similar fashion.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The logistic distribution is defined on the real line and is unimodal and symmetric about its location parameter (the mean). It has longer tails than a normal (Gaussian) distribution. It is used to model growth curves and bioassay data.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
# Generate 20 observations from a logistic distribution with # parameters location=0 and scale=1, then estimate the parameters # and construct a 90% confidence interval for the location parameter. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlogis(20) elogis(dat, ci = TRUE, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Logistic # #Estimated Parameter(s): location = -0.2181845 # scale = 0.8152793 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: location # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = -0.7899382 # UCL = 0.3535693 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from a logistic distribution with # parameters location=0 and scale=1, then estimate the parameters # and construct a 90% confidence interval for the location parameter. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlogis(20) elogis(dat, ci = TRUE, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Logistic # #Estimated Parameter(s): location = -0.2181845 # scale = 0.8152793 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: location # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = -0.7899382 # UCL = 0.3535693 #---------- # Clean up #--------- rm(dat)
Density, distribution function, quantile function, and random generation for the empirical distribution based on a set of observations
demp(x, obs, discrete = FALSE, density.arg.list = NULL) pemp(q, obs, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = 0.375) qemp(p, obs, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = 0.375) remp(n, obs)
demp(x, obs, discrete = FALSE, density.arg.list = NULL) pemp(q, obs, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = 0.375) qemp(p, obs, discrete = FALSE, prob.method = ifelse(discrete, "emp.probs", "plot.pos"), plot.pos.con = 0.375) remp(n, obs)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
obs |
numeric vector of observations. Missing ( |
discrete |
logical scalar indicating whether the assumed parent distribution of |
density.arg.list |
list with arguments to the R |
prob.method |
character string indicating what method to use to compute the empirical
probabilities. Possible values are |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant. The default value is |
Let denote a random sample of n observations
from some unknown probability distribution (i.e., the elements of the argument
obs
), and let denote the
order statistic, that is,
the
largest observation, for
.
Estimating Density
The function demp
computes the empirical probability density function. If
the observations are assumed to come from a discrete distribution, the probability
density (mass) function is estimated by:
where is the indicator function:
|
|
if , |
|
if
|
That is, the estimated probability of observing the value is simply the
observed proportion of observations equal to
.
If the observations are assumed to come from a continuous distribution, the
function demp
calls the R function density
to compute the
estimated density based on the values specified in the argument obs
,
and then uses linear interpolation to estimate the density at the values
specified in the argument x
. See the R help file for
density
for more information on how the empirical density is
computed in the continuous case.
Estimating Probabilities
The function pemp
computes the estimated cumulative distribution function
(cdf), also called the empirical cdf (ecdf). If the observations are assumed to
come from a discrete distribution, the value of the cdf evaluated at the
order statistic is usually estimated by:
where:
|
|
if , |
|
if
|
(D'Agostino, 1986a). That is, the estimated value of the cdf at the
order statistic is simply the observed proportion of observations less than or
equal to the
order statistic. This estimator is sometimes called the
“empirical probabilities” estimator and is intuitively appealing.
The function
pemp
uses the above equations to compute the empirical cdf when
prob.method="emp.probs"
.
For any general value of , when the observations are assumed to come from a
discrete distribution, the value of the cdf is estimated by:
|
|
if , |
|
if , |
|
|
if
|
The function pemp
uses the above equation when discrete=TRUE
.
If the observations are assumed to come from a continuous distribution, the value
of the cdf evaluated at the order statistic is usually estimated by:
where denotes the plotting position constant and
(Cleveland, 1993, p.18; D'Agostino, 1986a, pp.8,25). The estimators defined by
the above equation are called plotting positions and are used to construct
probability plots. The function
pemp
uses the above equation
when prob.method="plot.pos"
.
For any general value of , the value of the cdf is estimated by linear
interpolation:
|
|
if , |
|
if , |
|
|
if
|
where
(Chambers et al., 1983). The function pemp
uses the above two equations
when discrete=FALSE
.
Estimating Quantiles
The function qemp
computes the estimated quantiles based on the observed
data. If the observations are assumed to come from a discrete distribution, the
quantile is usually estimated by:
|
|
if , |
|
if , |
|
|
if
|
The function qemp
uses the above equation when discrete=TRUE
.
If the observations are assumed to come from a continuous distribution, the
quantile is usually estimated by linear interpolation:
|
|
if , |
|
if , |
|
|
if |
|
where
The function qemp
uses the above two equations when discrete=FALSE
.
Generating Random Numbers From the Empirical Distribution
The function remp
simply calls the R function sample
to
sample the elements of obs
with replacement.
density (demp
), probability (pemp
), quantile (qemp
), or
random sample (remp
) for the empirical distribution based on the data
contained in the vector obs
.
The function demp
let's you perform nonparametric density estimation.
The function pemp
computes the value of the empirical cumulative
distribution function (ecdf) for user-specified quantiles. The ecdf is a
nonparametric estimate of the true cdf (see ecdfPlot
). The
function qemp
computes nonparametric estimates of quantiles
(see the help files for eqnpar
and quantile
).
The function remp
let's you sample a set of observations with replacement,
which is often done while bootstrapping or performing some other kind of
Monte Carlo simulation.
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11–16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7–62.
Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice and Visualization. John Wiley and Sons, New York.
Sheather, S. J. and Jones M. C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statististical Society B, 683–690.
Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
Wegman, E.J. (1972). Nonparametric Probability Density Estimation. Technometrics 14, 533-546.
density
, approx
, epdfPlot
,
ecdfPlot
, cdfCompare
, qqplot
,
eqnpar
, quantile
, sample
, simulateVector
, simulateMvMatrix
.
# Create a set of 100 observations from a gamma distribution with # parameters shape=4 and scale=5. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(3) obs <- rgamma(100, shape=4, scale=5) # Now plot the empirical distribution (with a histogram) and the true distribution: dev.new() hist(obs, col = "cyan", xlim = c(0, 65), freq = FALSE, ylab = "Relative Frequency") pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) box() # Now plot the empirical distribution (based on demp) with the # true distribution: x <- qemp(p = seq(0, 1, len = 100), obs = obs) y <- demp(x, obs) dev.new() plot(x, y, xlim = c(0, 65), type = "n", xlab = "Value of Random Variable", ylab = "Relative Frequency") lines(x, y, lwd = 2, col = "cyan") pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) # Alternatively, you can create the above plot with the function # epdfPlot: dev.new() epdfPlot(obs, xlim = c(0, 65), epdf.col = "cyan", xlab = "Value of Random Variable", main = "Empirical and Theoretical PDFs") pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) # Clean Up #--------- rm(obs, x, y)
# Create a set of 100 observations from a gamma distribution with # parameters shape=4 and scale=5. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(3) obs <- rgamma(100, shape=4, scale=5) # Now plot the empirical distribution (with a histogram) and the true distribution: dev.new() hist(obs, col = "cyan", xlim = c(0, 65), freq = FALSE, ylab = "Relative Frequency") pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) box() # Now plot the empirical distribution (based on demp) with the # true distribution: x <- qemp(p = seq(0, 1, len = 100), obs = obs) y <- demp(x, obs) dev.new() plot(x, y, xlim = c(0, 65), type = "n", xlab = "Value of Random Variable", ylab = "Relative Frequency") lines(x, y, lwd = 2, col = "cyan") pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) # Alternatively, you can create the above plot with the function # epdfPlot: dev.new() epdfPlot(obs, xlim = c(0, 65), epdf.col = "cyan", xlab = "Value of Random Variable", main = "Empirical and Theoretical PDFs") pdfPlot('gamma', list(shape = 4, scale = 5), add = TRUE) # Clean Up #--------- rm(obs, x, y)
Estimate the probability parameter of a negative binomial distribution.
enbinom(x, size, method = "mle/mme")
enbinom(x, size, method = "mle/mme")
x |
vector of non-negative integers indicating the number of trials that took place
before |
size |
vector of positive integers indicating the number of “successes” that
must be observed before the trials are stopped. Missing ( |
method |
character string specifying the method of estimation. Possible values are: |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
independent observations from negative binomial distributions
with parameters
prob=
and
size=
, where
where
is a vector of
(possibly different) values.
It can be shown (e.g., Forbes et al., 2011) that if is defined as:
then is an observation from a
negative binomial distribution with
parameters
prob=
and
size=
, where
Estimation
The maximum likelihood and method of moments estimator (mle/mme) of
is given by:
and the minimum variance unbiased estimator (mvue) of is given by:
(Forbes et al., 2011). Note that the mvue of is not defined for
.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The negative binomial distribution has its roots in a gambling game where participants would bet on the number of tosses of a coin necessary to achieve a fixed number of heads. The negative binomial distribution has been applied in a wide variety of fields, including accident statistics, birth-and-death processes, and modeling spatial distributions of biological organisms.
The geometric distribution with parameter prob=
is a special case of the negative binomial distribution with parameters
size=1
and prob=
.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 5.
NegBinomial, egeom
, Geometric.
# Generate an observation from a negative binomial distribution with # parameters size=2 and prob=0.2, then estimate the parameter prob. # Note: the call to set.seed simply allows you to reproduce this example. # Also, the only parameter that is estimated is prob; the parameter # size is supplied in the call to enbinom. The parameter size is printed in # order to show all of the parameters associated with the distribution. set.seed(250) dat <- rnbinom(1, size = 2, prob = 0.2) dat #[1] 5 enbinom(dat, size = 2) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Negative Binomial # #Estimated Parameter(s): size = 2.0000000 # prob = 0.2857143 # #Estimation Method: mle/mme for 'prob' # #Data: dat, 2 # #Sample Size: 1 #---------- # Generate 3 observations from negative binomial distributions with # parameters size=c(2,3,4) and prob=0.2, then estimate the parameter # prob using the mvue. # (Note: the call to set.seed simply allows you to reproduce this example.) size.vec <- 2:4 set.seed(250) dat <- rnbinom(3, size = size.vec, prob = 0.2) dat #[1] 5 19 12 enbinom(dat, size = size.vec, method = "mvue") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Negative Binomial # #Estimated Parameter(s): size = 9.0000000 # prob = 0.1818182 # #Estimation Method: mvue for 'prob' # #Data: dat, size.vec # #Sample Size: 3 #---------- # Clean up #--------- rm(dat)
# Generate an observation from a negative binomial distribution with # parameters size=2 and prob=0.2, then estimate the parameter prob. # Note: the call to set.seed simply allows you to reproduce this example. # Also, the only parameter that is estimated is prob; the parameter # size is supplied in the call to enbinom. The parameter size is printed in # order to show all of the parameters associated with the distribution. set.seed(250) dat <- rnbinom(1, size = 2, prob = 0.2) dat #[1] 5 enbinom(dat, size = 2) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Negative Binomial # #Estimated Parameter(s): size = 2.0000000 # prob = 0.2857143 # #Estimation Method: mle/mme for 'prob' # #Data: dat, 2 # #Sample Size: 1 #---------- # Generate 3 observations from negative binomial distributions with # parameters size=c(2,3,4) and prob=0.2, then estimate the parameter # prob using the mvue. # (Note: the call to set.seed simply allows you to reproduce this example.) size.vec <- 2:4 set.seed(250) dat <- rnbinom(3, size = size.vec, prob = 0.2) dat #[1] 5 19 12 enbinom(dat, size = size.vec, method = "mvue") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Negative Binomial # #Estimated Parameter(s): size = 9.0000000 # prob = 0.1818182 # #Estimation Method: mvue for 'prob' # #Data: dat, size.vec # #Sample Size: 3 #---------- # Clean up #--------- rm(dat)
Estimate the mean and standard deviation parameters of a normal (Gaussian) distribution, and optionally construct a confidence interval for the mean or the variance.
enorm(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "exact", conf.level = 0.95, ci.param = "mean")
enorm(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "exact", conf.level = 0.95, ci.param = "mean")
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Possible values are
|
ci |
logical scalar indicating whether to compute a confidence interval for the
mean or variance. The default value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean or variance. The only possible value is |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
ci.param |
character string indicating which parameter to create a confidence interval for.
The possible values are |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from an normal (Gaussian) distribution with
parameters
mean=
and
sd=
.
Estimation
Minimum Variance Unbiased Estimation (method="mvue"
)
The minimum variance unbiased estimators (mvue's) of the mean and variance are:
(Johnson et al., 1994; Forbes et al., 2011). Note that when method="mvue"
,
the estimated standard deviation is the square root of the mvue of the variance,
but is not itself an mvue.
Maximum Likelihood/Method of Moments Estimation (method="mle/mme"
)
The maximum likelihood estimator (mle) and method of moments estimator (mme) of the
mean are both the same as the mvue of the mean given in equation (1) above. The
mle and mme of the variance is given by:
When method="mle/mme"
, the estimated standard deviation is the square root of
the mle of the variance, and is itself an mle.
Confidence Intervals
Confidence Interval for the Mean (ci.param="mean"
)
When ci=TRUE
and ci.param = "mean"
, the usual confidence interval
for is constructed as follows. If
ci.type="two-sided"
, a
the 100% confidence interval for
is given by:
where is the
'th quantile of
Student's t-distribution with
degrees of freedom
(Zar, 2010; Gilbert, 1987; Ott, 1995; Helsel and Hirsch, 1992).
If ci.type="lower"
, the 100% confidence interval for
is given by:
and if ci.type="upper"
, the confidence interval is given by:
Confidence Interval for the Variance (ci.param="variance"
)
When ci=TRUE
and ci.param = "variance"
, the usual confidence interval
for is constructed as follows. A two-sided
100% confidence interval for
is given by:
Similarly, a one-sided upper 100% confidence interval for the
population variance is given by:
and a one-sided lower 100% confidence interval for the population
variance is given by:
(van Belle et al., 2004; Zar, 2010).
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The normal and lognormal distribution are probably the two most frequently used distributions to model environmental data. In order to make any kind of probability statement about a normally-distributed population (of chemical concentrations for example), you have to first estimate the mean and standard deviation (the population parameters) of the distribution. Once you estimate these parameters, it is often useful to characterize the uncertainty in the estimate of the mean or variance. This is done with confidence intervals.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
# Generate 20 observations from a N(3, 2) distribution, then estimate # the parameters and create a 95% confidence interval for the mean. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rnorm(20, mean = 3, sd = 2) enorm(dat, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 2.861160 # sd = 1.180226 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 2.308798 # UCL = 3.413523 #---------- # Using the same data, construct an upper 90% confidence interval for # the variance. enorm(dat, ci = TRUE, ci.type = "upper", ci.param = "variance")$interval #Confidence Interval for: variance # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.000000 # UCL = 2.615963 #---------- # Clean up #--------- rm(dat) #---------- # Using the Reference area TcCB data in the data frame EPA.94b.tccb.df, # estimate the mean and standard deviation of the log-transformed data, # and construct a 95% confidence interval for the mean. with(EPA.94b.tccb.df, enorm(log(TcCB[Area == "Reference"]), ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = -0.6195712 # sd = 0.4679530 # #Estimation Method: mvue # #Data: log(TcCB[Area == "Reference"]) # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.7569673 # UCL = -0.4821751
# Generate 20 observations from a N(3, 2) distribution, then estimate # the parameters and create a 95% confidence interval for the mean. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rnorm(20, mean = 3, sd = 2) enorm(dat, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 2.861160 # sd = 1.180226 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 2.308798 # UCL = 3.413523 #---------- # Using the same data, construct an upper 90% confidence interval for # the variance. enorm(dat, ci = TRUE, ci.type = "upper", ci.param = "variance")$interval #Confidence Interval for: variance # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.000000 # UCL = 2.615963 #---------- # Clean up #--------- rm(dat) #---------- # Using the Reference area TcCB data in the data frame EPA.94b.tccb.df, # estimate the mean and standard deviation of the log-transformed data, # and construct a 95% confidence interval for the mean. with(EPA.94b.tccb.df, enorm(log(TcCB[Area == "Reference"]), ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = -0.6195712 # sd = 0.4679530 # #Estimation Method: mvue # #Data: log(TcCB[Area == "Reference"]) # #Sample Size: 47 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.7569673 # UCL = -0.4821751
Estimate the mean and standard deviation of a normal (Gaussian) distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for the mean.
enormCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", nmc = 1000, seed = NULL, ...)
enormCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", nmc = 1000, seed = NULL, ...)
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
method |
character string specifying the method of estimation. For singly censored data, the possible values are: For multiply censored data, the possible values are: See the DETAILS section for more information. |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean or variance. The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are: See the DETAILS section for more information. This argument is ignored if |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
n.bootstraps |
numeric scalar indicating how many bootstraps to use to construct the
confidence interval for the mean when |
pivot.statistic |
character string indicating which pivot statistic to use in the construction
of the confidence interval for the mean when |
nmc |
numeric scalar indicating the number of Monte Carlo simulations to run when
|
seed |
integer supplied to the function |
... |
additional arguments to pass to other functions.
|
If x
or censored
contain any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let denote a vector of
observations from a
normal distribution with mean
and
standard deviation
. Assume
(
) of these
observations are known and
(
) of these observations are
all censored below (left-censored) or all censored above (right-censored) at
fixed censoring levels
For the case when , the data are said to be Type I
multiply censored. For the case when
,
set
. If the data are left-censored
and all
known observations are greater
than or equal to
, or if the data are right-censored and all
known observations are less than or equal to
, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let denote the number of observations censored below or above censoring
level
for
, so that
Let denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
right-censored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For left-censored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity does not necessarily represent
the
'th “largest” observation from the (unknown) complete sample.
Finally, let (omega) denote the set of
subscripts in the
“ordered” sample that correspond to uncensored observations.
ESTIMATION
Estimation Methods for Multiply and Singly Censored Data
The following methods are available for multiply and singly censored data.
Maximum Likelihood Estimation (method="mle"
)
For Type I left censored data, the likelihood function is given by:
where and
denote the probability density function (pdf) and
cumulative distribution function (cdf) of the population. That is,
where and
denote the pdf and cdf of the standard normal
distribution, respectively (Cohen, 1963; 1991, pp.6, 50). For left singly
censored data, Equation (3) simplifies to:
Similarly, for Type I right censored data, the likelihood function is given by:
and for right singly censored data this simplifies to:
The maximum likelihood estimators are computed by maximizing the likelihood
function. For right-censored data, Cohen (1963; 1991, pp.50-51) shows that
taking partial derivatives of the log-likelihood function with respect to
and
and setting these to 0 produces the following two
simultaneous equations:
where
Note that the quantity defined in Equation (11) is simply the mean of the uncensored
observations, the quantity defined in Equation (12) is simply the method of
moments estimator of variance based on the uncensored observations, and the
function defined in Equation (15) is the hazard function for the
standard normal distribution.
For left-censored data, Equations (9) and (10) stay the same, except is
replaced with
.
The function enormCensored
computes the maximum likelihood estimators by
solving Equations (9) and (10) and uses the quantile-quantile regression
estimators (see below) as initial values.
Regression on Order Statistics (method="ROS"
) or
Quantile-Quantile Regression (method="qq.reg"
)
This method is sometimes called the probability plot method
(Nelson, 1982, Chapter 3; Gilbert, 1987, pp.134-136;
Helsel and Hirsch, 1992, p. 361), and more recently also called
parametric regression on order statistics or ROS
(USEPA, 2009; Helsel, 2012). In the case of no censoring, it is well known
(e.g., Nelson, 1982, p.113; Cleveland, 1993, p.31) that for the standard
normal (Gaussian) quantile-quantile plot (i.e., the plot of the sorted observations
(empirical quantiles) versus standard normal quantiles; see qqPlot
),
the intercept and slope of the fitted least-squares line estimate the mean and
standard deviation, respectively. Specifically, the estimates of and
are found by computing the least-squares estimates in the following
model:
where
denotes the plotting position associated with the 'th largest value,
is a constant such that
(the plotting position constant),
and
denotes the cumulative distribution function (cdf) of the standard
normal distribution. The default value of
is 0.375 (see below).
This method can be adapted to the case of left (right) singly censored data
as follows. Plot the uncensored observations against the
largest (smallest) normal quantiles, where the normal quantiles are computed
based on a sample size of
, fit the least-squares line to this plot, and
estimate the mean and standard deviation from the intercept and slope, respectively.
That is, use Equations (16) and (17), but for right singly censored data use
, and for left singly censored data use
.
The argument plot.pos.con
(see the entry for ... in the ARGUMENTS
section above) determines the value of the plotting positions computed in
Equation (18). The default value is plot.pos.con=0.375
.
See the help file for qqPlot
for more information.
This method is discussed by Haas and Scheff (1990). In the context of lognormal data, Travis and Land (1990) suggest exponentiating the predicted 50'th percentile from this fit to estimate the geometric mean (i.e., the median of the lognormal distribution).
This method is easily extended to multiply censored data. Equation (16) becomes
where denotes the set of
subscripts associated with the
uncensored observations in the ordered sample. The plotting positions are
computed by calling the EnvStats function
ppointsCensored
.
The argument prob.method
determines the method of computing the plotting
positions (default is prob.method="hirsch-stedinger"
), and the argument
plot.pos.con
determines the plotting position constant (default is
plot.pos.con=0.375
). (See the entry for ... in the ARGUMENTS section above.)
Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger probability
method but set the plotting position constant to 0.
Robust Regression on Order Statistics (method="rROS"
) or
Imputation Using Quantile-Quantile Regression (method="impute.w.qq.reg"
)
This is the robust Regression on Order Statistics (rROS) method discussed in
USEPA (2009) and Helsel (2012). It involves using the quantile-quantile regression
method (method="qq.reg"
or method="ROS"
) to fit a regression line
(and thus initially estimate the mean and standard deviation), and then imputing the
values of the censored observations by predicting them from the regression equation.
The final estimates of the mean and standard deviation are then computed using
the usual formulas (see enorm
) based on the observed and imputed
values.
The imputed values are computed as:
See the help file for ppointsCensored
for information on how the
plotting positions for the censored observations are computed.
The argument prob.method
determines the method of computing the plotting
positions (default is prob.method="hirsch-stedinger"
), and the argument
plot.pos.con
determines the plotting position constant (default is
plot.pos.con=0.375
). (See the entry for ... in the ARGUMENTS section above.)
Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger probability
method but set the plotting position constant to 0.
The arguments lb.impute
and ub.impute
determine the lower and upper
bounds for the imputed values. Imputed values smaller than lb.impute
are
set to this value. Imputed values larger than ub.impute
are set to this
value. The default values are lb.impute=-Inf
and ub.impute=Inf
.
See the entry for ... in the ARGUMENTS section above.
For singly censored data, this is the NR method of Gilliom and Helsel (1986, p. 137). In the context of lognormal data, this method is discussed by Hashimoto and Trussell (1983), Gilliom and Helsel (1986), and El-Shaarawi (1989), and is referred to as the LR or Log-Probability Method.
For multiply censored data, this method was developed in the context of
lognormal data by Helsel and Cohn (1988) using the formulas for plotting
positions given in Hirsch and Stedinger (1987) and Weibull plotting positions
(i.e., prob.method="hirsch-stedinger"
and plot.pos.con=0
).
Setting Censored Observations to Half the Censoring Level (method="half.cen.level"
)
This method is applicable only to left censored data that is bounded below by 0.
This method involves simply replacing all the censored observations with half their
detection limit, and then computing the mean and standard deviation with the usual
formulas (see enorm
).
This method is included only to allow comparison of this method to other methods. Setting left-censored observations to half the censoring level is not recommended.
For singly censored data, this method is discussed by Gleit (1985), Haas and Scheff (1990), and El-Shaarawi and Esterby (1992). El-Shaarawi and Esterby (1992) show that these estimators are biased and inconsistent (i.e., the bias remains even as the sample size increases).
For multiply censored data, this method was studied by Helsel and Cohn
(1988).
Estimation Methods for Singly Censored Data
The following methods are available only for singly censored data.
Bias-Corrected Maximum Likelihood Estimation (method="bcmle"
)
The maximum likelihood estimates of and
are biased.
The bias tends to 0 as the sample size increases, but it can be considerable
for small sample sizes, especially in the case of a large percentage of
censored observations (Saw, 1961b). Schmee et al. (1985) note that bias and
variances of the mle's are of the order
(see for example,
Bain and Engelhardt, 1991), and that for 90% censoring the bias is negligible
if
is at least 100. (For less intense censoring, even fewer
observations are needed.)
The exact bias of each estimator is extremely difficult to compute.
Saw (1961b), however, derived the first-order term (i.e., the term of order
) in the bias of the mle's of
and
and
proposed bias-corrected mle's. His bias-corrected estimators were derived for
the case of Type II singly censored data. Schneider (1986, p.110) and
Haas and Scheff (1990), however, state that this bias correction should
reduce the bias of the estimators in the case of Type I censoring as well.
Based on the tables of bias-correction terms given in Saw (1961b), Schneider (1986, pp.107-110) performed a least-squares fit to produce the following computational formulas for right-censored data:
For left-censored data, Equation (22) becomes:
Quantile-Quantile Regression Including the Censoring Level (method="qq.reg.w.cen.level"
)
This is a modification of the quantile-quantile regression method and was proposed
by El-Shaarawi (1989) in the context of lognormal data. El-Shaarawi's idea is to
include the censoring level and an associated plotting position, along with the
uncensored observations and their associated plotting positions, in order to include
information about the value of the censoring level .
For left singly censored data, the modification involves adding the point
to the plot before fitting the least-squares line.
For right singly censored data, the point
is added to the plot before fitting the least-squares line.
El-Shaarawi (1989) also proposed replacing the estimated normal quantiles with the
exact expected values of normal order statistics, and using the values in their
variance-covariance matrix to perform a weighted least least-squared regression.
These last two modifications are not incorporated here.
Imputation Using Quantile-Quantile Regression Including the Censoring Level
(method ="impute.w.qq.reg.w.cen.level"
)
This is exactly the same method as imputation using quantile-quantile regression
(method="impute.w.qq.reg"
), except that the quantile-quantile regression
including the censoring level method (method="qq.reg.w.cen.level"
) is used
to fit the regression line. In the context of lognormal data, this method is
discussed by El-Shaarawi (1989), which he denotes as the Modified LR Method.
Imputation Using Maximum Likelihood (method ="impute.w.mle"
)
This is exactly the same method as imputation with quantile-quantile regression
(method="impute.w.qq.reg"
), except that the maximum likelihood method
(method="mle"
) is used to compute the initial estimates of the mean and
standard deviation. In the context of lognormal data, this method is discussed
by El-Shaarawi (1989), which he denotes as the Modified Maximum Likelihood Method.
Iterative Imputation Using Quantile-Quantile Regression (method="iterative.impute.w.qq.reg"
)
This method is similar to the imputation with quantile-quantile regression method
(method="impute.w.qq.reg"
), but iterates until the estimates of the mean
and standard deviation converge. The algorithm is:
Compute the initial estimates of and
using the
"impute.w.qq.reg"
method. (Actually, any suitable estimates will do.)
Using the current values of and
and Equation (19),
compute new imputed values of the censored observations.
Use the new imputed values along with the uncensored observations to
compute new estimates of and
based on the usual
formulas (see
enorm
).
Repeat Steps 2 and 3 until the estimates converge (the convergence
criterion is determined by the arguments tol
and convergence
;
see the entry for ... in the ARGUMENTS section above).
This method is discussed by Gleit (1985), which he denotes as
“Fill-In with Expected Values”.
M-Estimators (method="m.est"
)
This method was contributed by Leo R. Korn (Korn and Tyler, 2001).
This method finds location and scale estimates that are consistent at the
normal model and robust to deviations from the normal model, including both
outliers on the right and outliers on the left above and below the limit of
detection. The estimates are found by solving the simultaneous equations:
where
and and
denote the probability density function
(pdf) and cumulative distribution function (cdf) of
Student's t-distribution with
degrees of freedom.
This results in an M-estimating equation based on the t-density function (Korn and Tyler., 2001). Since the t-density has heavier tails than the normal density, this M-estimator will tend to down-weight values that are far away from the center of the data. When censoring is present, neither the location nor the scale estimates are consistent at the normal model. A computational correction is performed that converts the above M-estimator to another M-estimator that is consistent at the normal model, even under censoring.
The degrees of freedom parameter is set by the argument
t.df
and
may be viewed as a tuning parameter that will determine the robustness and
efficiency properties. When t.df
is large, the estimator is similar to
the usual mle and the output will then be very close to that when method="mle"
.
As t.df
decreases, the efficiency will decline and the outlier rejection
property will increase in strength. Choosing t.df=3
(the default) provides
a good combination of efficiency and robustness. A reasonable strategy is to
transform the data so that they are approximately symmetric (often the log
transformation for environmental data is appropriate) and then apply the
M-estimator using t.df=3
.
CONFIDENCE INTERVALS
This section explains how confidence intervals for the mean are
computed.
Likelihood Profile (ci.method="profile.likelihood"
)
This method was proposed by Cox (1970, p.88), and Venzon and Moolgavkar (1988)
introduced an efficient method of computation. This method is also discussed by
Stryhn and Christensen (2003) and Royston (2007).
The idea behind this method is to invert the likelihood-ratio test to obtain a
confidence interval for the mean while treating the standard deviation
as a nuisance parameter. Equation (3) above
shows the form of the likelihood function
for
multiply left-censored data, and Equation (7) shows the function for
multiply right-censored data.
Following Stryhn and Christensen (2003), denote the maximum likelihood estimates
of the mean and standard deviation by . The likelihood
ratio test statistic (
) of the hypothesis
(where
is a fixed value) equals the drop in
between the
“full” model and the reduced model with
fixed at
, i.e.,
where is the maximum likelihood estimate of
for the
reduced model (i.e., when
). Under the null hypothesis,
the test statistic
follows a
chi-squared distribution with 1 degree of freedom.
Alternatively, we may
express the test statistic in terms of the profile likelihood function
for the mean
, which is obtained from the usual likelihood function by
maximizing over the parameter
, i.e.,
Then we have
A two-sided confidence interval for the mean
consists of all values of
for which the test is not significant at
level
:
where denotes the
'th quantile of the
chi-squared distribution with
degrees of freedom.
One-sided lower and one-sided upper confidence intervals are computed in a similar
fashion, except that the quantity
in Equation (33) is replaced with
.
Normal Approximation (ci.method="normal.approx"
)
This method constructs approximate confidence intervals for
based on the assumption that the estimator of
is
approximately normally distributed. That is, a two-sided
confidence interval for
is constructed as:
where denotes the estimate of
,
denotes the estimated asymptotic standard
deviation of the estimator of
,
denotes the assumed sample
size for the confidence interval, and
denotes the
'th
quantile of Student's t-distribuiton with
degrees of freedom. One-sided confidence intervals are computed in a
similar fashion.
The argument ci.sample.size
determines the value of (see
see the entry for ... in the ARGUMENTS section above).
When
method
equals
"mle"
or "bcmle"
, the default value is the expected number of
uncensored observations, otherwise it is the observed number of
uncensored observations. This is simply an ad-hoc method of constructing
confidence intervals and is not based on any published theoretical results.
When pivot.statistic="z"
, the 'th quantile from the
standard normal distribution is used in place of the
'th quantile from Student's t-distribution.
Approximate Confidence Interval Based on Maximum Likelihood Estimators
When method="mle"
, the standard deviation of the mle of is
estimated based on the inverse of the Fisher Information matrix. The estimated
variance-covariance matrix for the estimates of
and
are based on the observed information matrix, formulas for which are given in
Cohen (1991).
Approximate Confidence Interval Based on Bias-Corrected Maximum Likelihood Estimators
When method="bcmle"
(available only for singly censored data),
the same procedures are used to construct the
confidence interval as for method="mle"
. The true variance of the
bias-corrected mle of is necessarily larger than the variance of the
mle of
(although the differences in the variances goes to 0 as the
sample size gets large). Hence this method of constructing a confidence interval
leads to intervals that are too short for small sample sizes, but these intervals
should be better centered about the true value of
.
Approximate Confidence Interval Based on Other Estimators
When method is some value other than "mle"
, the standard deviation of the
estimated mean is approximated by
where, as already noted, denotes the assumed sample size. This is simply
an ad-hoc method of constructing confidence intervals and is not based on any
published theoretical results.
Normal Approximation Using Covariance (ci.method="normal.approx.w.cov"
)
This method is only available for singly censored data and only applicable when
method="mle"
or method="bcmle"
. It was proposed by Schneider
(1986, pp. 191-193) for the case of Type II censoring, but is applicable to any
situation where the estimated mean and standard deviation are consistent estimators
and are correlated. In particular, the mle's of and
are
correlated under Type I censoring as well.
Schneider's idea is to determine two positive quantities such that
so that
is a confidence interval for
.
For cases where the estimators of and
are independent
(e.g., complete samples), it is well known that setting
yields an exact confidence interval and setting
where denotes the
'th quantile of the standard normal distribution
yields an approximate confidence interval that is asymptotically correct.
For the general case, Schneider (1986) considers the random variable
and provides formulas for and
.
Note that the resulting confidence interval for the mean is not symmetric about
the estimated mean. Also note that the quantity is a random variable for
Type I censoring, while Schneider (1986) assumed it to be fixed since he derived
the result for Type II censoring (in which case
).
Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap"
)
The bootstrap is a nonparametric method of estimating the distribution
(and associated distribution parameters and quantiles) of a sample statistic,
regardless of the distribution of the population from which the sample was drawn.
The bootstrap was introduced by Efron (1979) and a general reference is
Efron and Tibshirani (1993).
In the context of deriving an approximate confidence interval
for the population mean
, the bootstrap can be broken down into the
following steps:
Create a bootstrap sample by taking a random sample of size from
the observations in
, where sampling is done with
replacement. Note that because sampling is done with replacement, the same
element of
can appear more than once in the bootstrap
sample. Thus, the bootstrap sample will usually not look exactly like the
original sample (e.g., the number of censored observations in the bootstrap
sample will often differ from the number of censored observations in the
original sample).
Estimate based on the bootstrap sample created in Step 1, using
the same method that was used to estimate
using the original
observations in
. Because the bootstrap sample usually
does not match the original sample, the estimate of
based on the
bootstrap sample will usually differ from the original estimate based on
.
Repeat Steps 1 and 2 times, where
is some large number.
For the function
enormCensored
, the number of bootstraps is
determined by the argument
n.bootstraps
(see the section ARGUMENTS above).
The default value of n.bootstraps
is 1000
.
Use the estimated values of
to compute the empirical
cumulative distribution function of this estimator of
(see
ecdfPlot
), and then create a confidence interval for
based on this estimated cdf.
The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as:
where denotes the empirical cdf evaluated at
and thus
denotes the
'th empirical quantile, that is,
the
'th quantile associated with the empirical cdf. Similarly, a one-sided lower
confidence interval is computed as:
and a one-sided upper confidence interval is computed as:
The function enormCensored
calls the R function quantile
to compute the empirical quantiles used in Equations (42)-(44).
The percentile method bootstrap confidence interval is only first-order
accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability
that the confidence interval will contain the true value of can be
off by
, where
is some constant. Efron and Tibshirani
(1993, pp.184-188) proposed a bias-corrected and accelerated interval that is
second-order accurate, meaning that the probability that the confidence interval
will contain the true value of
may be off by
instead of
. The two-sided bias-corrected and accelerated confidence interval is
computed as:
where
where the quantity denotes the estimate of
using
all the values in
except the
'th one, and
A one-sided lower confidence interval is given by:
and a one-sided upper confidence interval is given by:
where and
are computed as for a two-sided confidence
interval, except
is replaced with
in Equations (46) and (47).
The constant incorporates the bias correction, and the constant
is the acceleration constant. The term “acceleration” refers
to the rate of change of the standard error of the estimate of
with
respect to the true value of
(Efron and Tibshirani, 1993, p.186). For a
normal (Gaussian) distribution, the standard error of the estimate of
does not depend on the value of
, hence the acceleration constant is not
really necessary.
When ci.method="bootstrap"
, the function enormCensored
computes both
the percentile method and bias-corrected and accelerated method
bootstrap confidence intervals.
This method of constructing confidence intervals for censored data was studied by
Shumway et al. (1989).
Generalized Pivotal Quantity (ci.method="gpq"
)
This method was introduced by Schmee et al. (1985) and is discussed by
Krishnamoorthy and Mathew (2009). The idea is essentially to use a parametric
bootstrap to estimate the correct pivotal quantities and
in
Equation (38) above. For singly censored data, these quantities are computed as
follows:
Generate a random sample of observations from a standard normal
(i.e., N(0,1)) distribution and let
denote the ordered (sorted) observations.
Set the smallest c
observations to be censored.
Compute the estimates of and
using the method
specified by the
method
argument, and denote these estimates as
.
Compute the t-like pivotal quantity
.
Repeat steps 1-4 nmc
times to produce an empirical distribution of
the t-like pivotal quantity.
The function enormCensored
calls the function
gpqCiNormSinglyCensored
to generate the distribution of
pivotal quantities in the case of singly censored data.
A two-sided confidence interval for
is then
computed as:
where denotes the
'th empirical quantile of the
nmc
generated values.
Schmee at al. (1985) derived this method in the context of Type II singly censored data (for which these limits are exact within Monte Carlo error), but state that according to Regal (1982) this method produces confidence intervals that are close apporximations to the correct limits for Type I censored data.
For multiply censored data, this method has been extended as follows. The
algorithm stays the same, except that Step 2 becomes:
2. Set the 'th ordered generated observation to be censored or not censored
according to whether the
'th observed observation in the original data
is censored or not censored.
The function enormCensored
calls the function
gpqCiNormMultiplyCensored
to generate the distribution of
pivotal quantities in the case of multiply censored data.
a list of class "estimateCensored"
containing the estimated parameters
and other information. See estimateCensored.object
for details.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation.
Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators based on censored environmental data.
In practice, it is better to use a confidence interval for the mean or a joint confidence region for the mean and standard deviation, rather than rely on a single point-estimate of the mean. Since confidence intervals and regions depend on the properties of the estimators for both the mean and standard deviation, the results of studies that simply evaluated the performance of the mean and standard deviation separately cannot be readily extrapolated to predict the performance of various methods of constructing confidence intervals and regions. Furthermore, for several of the methods that have been proposed to estimate the mean based on type I left-censored data, standard errors of the estimates are not available, hence it is not possible to construct confidence intervals (El-Shaarawi and Dolan, 1989).
Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and standard deviation when data are subjected to single or multiple censoring. See, for example, Singh et al. (2006).
Schmee et al. (1985) studied Type II censoring for a normal distribution and
noted that the bias and variances of the maximum likelihood estimators are of the
order , and that the bias is negligible for
and as much as
90% censoring. (If the proportion of censored observations is less than 90%,
the bias becomes negligible for smaller sample sizes.) For small samples with
moderate to high censoring, however, the bias of the mle's causes confidence
intervals based on them using a normal approximation (e.g.,
method="mle"
and ci.method="normal.approx"
) to be too short. Schmee et al. (1985)
provide tables for exact confidence intervals for sample sizes up to
that were created based on Monte Carlo simulation. Schmee et al. (1985) state
that these tables should work well for Type I censored data as well.
Shumway et al. (1989) evaluated the coverage of 90% confidence intervals for the mean based on using a Box-Cox transformation to induce normality, computing the mle's based on the normal distribution, then computing the mean in the original scale. They considered three methods of constructing confidence intervals: the delta method, the bootstrap, and the bias-corrected bootstrap. Shumway et al. (1989) used three parent distributions in their study: Normal(3,1), the square of this distribuiton, and the exponentiation of this distribution (i.e., a lognormal distribution). Based on sample sizes of 10 and 50 with a censoring level at the 10'th or 20'th percentile, Shumway et al. (1989) found that the delta method performed quite well and was superior to the bootstrap method.
Millard et al. (2015; in preparation) show that the coverage of the profile likelihood method is excellent.
Steven P. Millard ([email protected])
Bain, L.J., and M. Engelhardt. (1991). Statistical Analysis of Reliability and Life-Testing Models. Marcel Dekker, New York, 496pp.
Cohen, A.C. (1959). Simplified Estimators for the Normal Distribution When Samples are Singly Censored or Truncated. Technometrics 1(3), 217–237.
Cohen, A.C. (1963). Progressively Censored Samples in Life Testing. Technometrics 5, 327–339
Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, 312pp.
Cox, D.R. (1970). Analysis of Binary Data. Chapman & Hall, London. 142pp.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.
El-Shaarawi, A.H. (1989). Inferences About the Mean from Censored Water Quality Data. Water Resources Research 25(4) 685–690.
El-Shaarawi, A.H., and D.M. Dolan. (1989). Maximum Likelihood Estimation of Water Quality Concentrations from Censored Data. Canadian Journal of Fisheries and Aquatic Sciences 46, 1033–1039.
El-Shaarawi, A.H., and S.R. Esterby. (1992). Replacement of Censored Observations by a Constant: An Evaluation. Water Research 26(6), 835–844.
El-Shaarawi, A.H., and A. Naderi. (1991). Statistical Inference from Multiply Censored Environmental Data. Environmental Monitoring and Assessment 17, 339–347.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135–146.
Gleit, A. (1985). Estimation for Small Normal Data Sets with Detection Limits. Environmental Science and Technology 19, 1201–1206.
Haas, C.N., and P.A. Scheff. (1990). Estimation of Averages in Truncated Samples. Environmental Science and Technology 24(6), 912–919.
Hashimoto, L.K., and R.R. Trussell. (1983). Evaluating Water Quality Data Near the Detection Limit. Paper presented at the Advanced Technology Conference, American Water Works Association, Las Vegas, Nevada, June 5-9, 1983.
Helsel, D.R. (1990). Less than Obvious: Statistical Treatment of Data Below the Detection Limit. Environmental Science and Technology 24(12), 1766–1774.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997–2004.
Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715–727.
Korn, L.R., and D.E. Tyler. (2001). Robust Estimation for Chemical Concentration Data Subject to Detection Limits. In Fernholz, L., S. Morgenthaler, and W. Stahel, eds. Statistics in Genetics and in the Environmental Sciences. Birkhauser Verlag, Basel, pp.41–63.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461–496.
Millard, S.P., P. Dixon, and N.K. Neerchal. (2014; in preparation). Environmental Statistics with R. CRC Press, Boca Raton, Florida.
Nelson, W. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, 634pp.
Newman, M.C., P.M. Dixon, B.B. Looney, and J.E. Pinder. (1989). Estimating Mean and Variance for Environmental Samples with Below Detection Limit Observations. Water Resources Bulletin 25(4), 905–916.
Pettitt, A. N. (1983). Re-Weighted Least Squares Estimation with Censored and Grouped Data: An Application of the EM Algorithm. Journal of the Royal Statistical Society, Series B 47, 253–260.
Regal, R. (1982). Applying Order Statistic Censored Normal Confidence Intervals to Time Censored Data. Unpublished manuscript, University of Minnesota, Duluth, Department of Mathematical Sciences.
Royston, P. (2007). Profile Likelihood for Estimation and Confdence Intervals. The Stata Journal 7(3), pp. 376–387.
Saw, J.G. (1961b). The Bias of the Maximum Likelihood Estimators of Location and Scale Parameters Given a Type II Censored Normal Sample. Biometrika 48, 448–451.
Schmee, J., D.Gladstein, and W. Nelson. (1985). Confidence Limits for Parameters of a Normal Distribution from Singly Censored Samples, Using Maximum Likelihood. Technometrics 27(2) 119–128.
Schneider, H. (1986). Truncated and Censored Samples from Normal Populations. Marcel Dekker, New York, New York, 273pp.
Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.
Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Stryhn, H., and J. Christensen. (2003). Confidence Intervals by the Profile Likelihood Method, with Applications in Veterinary Epidemiology. Contributed paper at ISVEE X (November 2003, Chile). https://gilvanguedes.com/wp-content/uploads/2019/05/Profile-Likelihood-CI.pdf.
Travis, C.C., and M.L. Land. (1990). Estimating the Mean of Data Sets with Nondetectable Values. Environmental Science and Technology 24, 961–962.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Venzon, D.J., and S.H. Moolgavkar. (1988). A Method for Computing Profile-Likelihood-Based Confidence Intervals. Journal of the Royal Statistical Society, Series C (Applied Statistics) 37(1), pp. 87–94.
Normal, enorm
, estimateCensored.object
.
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and standard deviation using the MLE, # Q-Q regression (also called parametric regression on order statistics # or ROS; e.g., USEPA, 2009 and Helsel, 2012), and imputation with Q-Q # regression (also called robust regression on order statistics or rROS). # We will log-transform the original observations and then call # enormCensored. Alternatively, we could have more simply called # elnormCensored. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean and standard deviation on the log-scale # using the MLE: #--------------------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 0.6931472 1.6094379 # #Estimated Parameter(s): mean = 2.215905 # sd = 1.356291 # #Estimation Method: MLE # #Data: log(Manganese.ppb) # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # Now compare the MLE with the estimators based on # Q-Q regression and imputation with Q-Q regression #-------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored))$parameters # mean sd #2.215905 1.356291 with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored, method = "ROS"))$parameters # mean sd #2.293742 1.283635 with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored, method = "rROS"))$parameters # mean sd #2.298656 1.238104 #---------- # The method used to estimate quantiles for a Q-Q plot is # determined by the argument prob.method. For the functions # enormCensored and elnormCensored, for any estimation # method that involves Q-Q regression, the default value of # prob.method is "hirsch-stedinger" and the default value for the # plotting position constant is plot.pos.con=0.375. # Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger # probability method but set the plotting position constant to 0. with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored, method = "rROS", plot.pos.con = 0))$parameters # mean sd #2.277175 1.261431 #---------- # Using the same data as above, compute a confidence interval # for the mean on the log-scale using the profile-likelihood # method. with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 0.6931472 1.6094379 # #Estimated Parameter(s): mean = 2.215905 # sd = 1.356291 # #Estimation Method: MLE # #Data: log(Manganese.ppb) # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 1.595062 # UCL = 2.771197
# Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and standard deviation using the MLE, # Q-Q regression (also called parametric regression on order statistics # or ROS; e.g., USEPA, 2009 and Helsel, 2012), and imputation with Q-Q # regression (also called robust regression on order statistics or rROS). # We will log-transform the original observations and then call # enormCensored. Alternatively, we could have more simply called # elnormCensored. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean and standard deviation on the log-scale # using the MLE: #--------------------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 0.6931472 1.6094379 # #Estimated Parameter(s): mean = 2.215905 # sd = 1.356291 # #Estimation Method: MLE # #Data: log(Manganese.ppb) # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # Now compare the MLE with the estimators based on # Q-Q regression and imputation with Q-Q regression #-------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored))$parameters # mean sd #2.215905 1.356291 with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored, method = "ROS"))$parameters # mean sd #2.293742 1.283635 with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored, method = "rROS"))$parameters # mean sd #2.298656 1.238104 #---------- # The method used to estimate quantiles for a Q-Q plot is # determined by the argument prob.method. For the functions # enormCensored and elnormCensored, for any estimation # method that involves Q-Q regression, the default value of # prob.method is "hirsch-stedinger" and the default value for the # plotting position constant is plot.pos.con=0.375. # Both Helsel (2012) and USEPA (2009) also use the Hirsch-Stedinger # probability method but set the plotting position constant to 0. with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored, method = "rROS", plot.pos.con = 0))$parameters # mean sd #2.277175 1.261431 #---------- # Using the same data as above, compute a confidence interval # for the mean on the log-scale using the profile-likelihood # method. with(EPA.09.Ex.15.1.manganese.df, enormCensored(log(Manganese.ppb), Censored, ci = TRUE)) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 0.6931472 1.6094379 # #Estimated Parameter(s): mean = 2.215905 # sd = 1.356291 # #Estimation Method: MLE # #Data: log(Manganese.ppb) # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 1.595062 # UCL = 2.771197
Estimate the mean, standard deviation, and standard error of the mean nonparametrically given a sample of data, and optionally construct a confidence interval for the mean.
enpar(x, ci = FALSE, ci.method = "bootstrap", ci.type = "two-sided", conf.level = 0.95, pivot.statistic = "z", n.bootstraps = 1000, seed = NULL)
enpar(x, ci = FALSE, ci.method = "bootstrap", ci.type = "two-sided", conf.level = 0.95, pivot.statistic = "z", n.bootstraps = 1000, seed = NULL)
x |
numeric vector of observations.
Missing ( |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean. The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are
|
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
pivot.statistic |
character string indicating which statistic to use for the confidence interval
for the mean when |
n.bootstraps |
numeric scalar indicating how many bootstraps to use to construct the
confidence interval for the mean. This argument is ignored if
|
seed |
integer supplied to the function |
Let denote a vector of
observations from some distribution with mean
and standard
deviation
.
Estimation
Unbiased and consistent estimators of the mean and variance are:
A consistent (but not unbiased) estimate of the standard deviation is given by the square root of the estimated variance above:
It can be shown that the variance of the sample mean is given by:
so the standard deviation of the sample mean (usually called the standard error) can be estimated by:
Confidence Intervals
This section explains how confidence intervals for the mean are
computed.
Normal Approximation (ci.method="normal.approx"
)
This method constructs approximate confidence intervals for
based on the assumption that the estimator of
, i.e., the
sample mean, is approximately normally distributed. That is, a two-sided
confidence interval for
is constructed as:
where denotes the estimate of
,
denotes the estimated asymptotic standard
deviation of the estimator of
,
denotes the assumed sample
size for the confidence interval, and
denotes the
'th
quantile of Student's t-distribuiton with
degrees of freedom. One-sided confidence intervals are computed in a
similar fashion.
When pivot.statistic="z"
, the 'th quantile from the
standard normal distribution is used in place of the
'th quantile from Student's t-distribution.
Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap"
)
The bootstrap is a nonparametric method of estimating the distribution
(and associated distribution parameters and quantiles) of a sample statistic,
regardless of the distribution of the population from which the sample was drawn.
The bootstrap was introduced by Efron (1979) and a general reference is
Efron and Tibshirani (1993).
In the context of deriving an approximate confidence interval
for the population mean
, the bootstrap can be broken down into the
following steps:
Create a bootstrap sample by taking a random sample of size from
the observations in
, where sampling is done with
replacement. Note that because sampling is done with replacement, the same
element of
can appear more than once in the bootstrap
sample. Thus, the bootstrap sample will usually not look exactly like the
original sample.
Estimate based on the bootstrap sample created in Step 1, using
the same method that was used to estimate
using the original
observations in
. Because the bootstrap sample usually
does not match the original sample, the estimate of
based on the
bootstrap sample will usually differ from the original estimate based on
. For the bootstrap-t method (see below), this step also
involves estimating the standard error of the estimate of the mean and
computing the statistic
where
denotes the estimate of the mean based on the original sample,
and
and
denote the estimate of
the mean and estimate of the standard error of the estimate of the mean based on
the bootstrap sample.
Repeat Steps 1 and 2 times, where
is some large number.
For the function
enpar
, the number of bootstraps is
determined by the argument
n.bootstraps
(see the section ARGUMENTS above).
The default value of n.bootstraps
is 1000
.
Use the estimated values of
to compute the empirical
cumulative distribution function of the estimator of
or to compute
the empirical cumulative distribution function of the statistic
(see
ecdfPlot
), and then create a confidence interval for
based on this estimated cdf.
The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as:
where denotes the empirical cdf of
evaluated at
and thus
denotes the
'th empirical quantile of the
distribution of
, that is, the
'th quantile associated with the
empirical cdf. Similarly, a one-sided lower
confidence interval is computed as:
and a one-sided upper confidence interval is computed as:
The function enpar
calls the R function quantile
to compute the empirical quantiles used in Equations (7)-(9).
The percentile method bootstrap confidence interval is only first-order
accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability
that the confidence interval will contain the true value of can be
off by
, where
is some constant. Efron and Tibshirani
(1993, pp.184–188) proposed a bias-corrected and accelerated interval that is
second-order accurate, meaning that the probability that the confidence interval
will contain the true value of
may be off by
instead of
. The two-sided bias-corrected and accelerated confidence interval is
computed as:
where
where the quantity denotes the estimate of
using
all the values in
except the
'th one, and
A one-sided lower confidence interval is given by:
and a one-sided upper confidence interval is given by:
where and
are computed as for a two-sided confidence
interval, except
is replaced with
in Equations (11) and (12).
The constant incorporates the bias correction, and the constant
is the acceleration constant. The term “acceleration” refers
to the rate of change of the standard error of the estimate of
with
respect to the true value of
(Efron and Tibshirani, 1993, p.186). For a
normal (Gaussian) distribution, the standard error of the estimate of
does not depend on the value of
, hence the acceleration constant is not
really necessary.
For the bootstrap-t method, the two-sided confidence interval (Efron and Tibshirani, 1993, p.160) is computed as:
where and
denote the estimate of the mean
and standard error of the estimate of the mean based on the original sample, and
denotes the
'th empirical quantile of the bootstrap distribution of
the statistic
. Similarly, a one-sided lower confidence interval is computed as:
and a one-sided upper confidence interval is computed as:
When ci.method="bootstrap"
, the function enpar
computes
the percentile method, bias-corrected and accelerated method, and bootstrap-t
bootstrap confidence intervals. The percentile method is transformation respecting,
but not second-order accurate. The bootstrap-t method is second-order accurate, but not
transformation respecting. The bias-corrected and accelerated method is both
transformation respecting and second-order accurate (Efron and Tibshirani, 1993, p.188).
a list of class "estimate"
containing the estimated parameters
and other information. See estimate.object
for details.
The function enpar
is related to the companion function
enparCensored
for censored data. To estimate the median and
compute a confidence interval, use eqnpar
.
The result of the call to enpar
with ci.method="normal.approx"
and pivot.statistic="t"
produces the same result as the call to
enorm
with ci.param="mean"
.
Steven P. Millard ([email protected])
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.
enparCensored
, eqnpar
, enorm
,
mean
, sd
, estimate.object
.
# The data frame ACE.13.TCE.df contains observations on # Trichloroethylene (TCE) concentrations (mg/L) at # 10 groundwater monitoring wells before and after remediation. # # Compute the mean concentration for each period along with # a 95% bootstrap BCa confidence interval for the mean. # # NOTE: Use of the argument "seed" is necessary to reproduce this example. # # Before remediation: 21.6 [14.2, 30.1] # After remediation: 3.6 [ 1.6, 5.7] with(ACE.13.TCE.df, enpar(TCE.mg.per.L[Period=="Before"], ci = TRUE, seed = 476)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Parameter(s): mean = 21.62400 # sd = 13.51134 # se.mean = 4.27266 # #Estimation Method: Sample Mean # #Data: TCE.mg.per.L[Period == "Before"] # #Sample Size: 10 # #Confidence Interval for: mean # #Confidence Interval Method: Bootstrap # #Number of Bootstraps: 1000 # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: Pct.LCL = 13.95560 # Pct.UCL = 29.79510 # BCa.LCL = 14.16080 # BCa.UCL = 30.06848 # t.LCL = 12.41945 # t.UCL = 32.47306 #---------- with(ACE.13.TCE.df, enpar(TCE.mg.per.L[Period=="After"], ci = TRUE, seed = 543)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Parameter(s): mean = 3.632900 # sd = 3.554419 # se.mean = 1.124006 # #Estimation Method: Sample Mean # #Data: TCE.mg.per.L[Period == "After"] # #Sample Size: 10 # #Confidence Interval for: mean # #Confidence Interval Method: Bootstrap # #Number of Bootstraps: 1000 # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: Pct.LCL = 1.833843 # Pct.UCL = 5.830230 # BCa.LCL = 1.631655 # BCa.UCL = 5.677514 # t.LCL = 1.683791 # t.UCL = 8.101829
# The data frame ACE.13.TCE.df contains observations on # Trichloroethylene (TCE) concentrations (mg/L) at # 10 groundwater monitoring wells before and after remediation. # # Compute the mean concentration for each period along with # a 95% bootstrap BCa confidence interval for the mean. # # NOTE: Use of the argument "seed" is necessary to reproduce this example. # # Before remediation: 21.6 [14.2, 30.1] # After remediation: 3.6 [ 1.6, 5.7] with(ACE.13.TCE.df, enpar(TCE.mg.per.L[Period=="Before"], ci = TRUE, seed = 476)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Parameter(s): mean = 21.62400 # sd = 13.51134 # se.mean = 4.27266 # #Estimation Method: Sample Mean # #Data: TCE.mg.per.L[Period == "Before"] # #Sample Size: 10 # #Confidence Interval for: mean # #Confidence Interval Method: Bootstrap # #Number of Bootstraps: 1000 # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: Pct.LCL = 13.95560 # Pct.UCL = 29.79510 # BCa.LCL = 14.16080 # BCa.UCL = 30.06848 # t.LCL = 12.41945 # t.UCL = 32.47306 #---------- with(ACE.13.TCE.df, enpar(TCE.mg.per.L[Period=="After"], ci = TRUE, seed = 543)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Parameter(s): mean = 3.632900 # sd = 3.554419 # se.mean = 1.124006 # #Estimation Method: Sample Mean # #Data: TCE.mg.per.L[Period == "After"] # #Sample Size: 10 # #Confidence Interval for: mean # #Confidence Interval Method: Bootstrap # #Number of Bootstraps: 1000 # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: Pct.LCL = 1.833843 # Pct.UCL = 5.830230 # BCa.LCL = 1.631655 # BCa.UCL = 5.677514 # t.LCL = 1.683791 # t.UCL = 8.101829
Estimate the mean, standard deviation, and standard error of the mean nonparametrically given a sample of data from a positive-valued distribution that has been subjected to left- or right-censoring, and optionally construct a confidence interval for the mean.
enparCensored(x, censored, censoring.side = "left", correct.se = TRUE, restricted = FALSE, left.censored.min = "Censoring Level", right.censored.max = "Censoring Level", ci = FALSE, ci.method = "normal.approx", ci.type = "two-sided", conf.level = 0.95, pivot.statistic = "t", ci.sample.size = "Total", n.bootstraps = 1000, seed = NULL, warn = FALSE)
enparCensored(x, censored, censoring.side = "left", correct.se = TRUE, restricted = FALSE, left.censored.min = "Censoring Level", right.censored.max = "Censoring Level", ci = FALSE, ci.method = "normal.approx", ci.type = "two-sided", conf.level = 0.95, pivot.statistic = "t", ci.sample.size = "Total", n.bootstraps = 1000, seed = NULL, warn = FALSE)
x |
numeric vector of positive-valued observations.
Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
correct.se |
logical scalar indicating whether to multiply the estimated standard error
by a factor to correct for bias. The default value is |
restricted |
logical scalar indicating whether to compute the restricted mean in the case when
the smallest censored value is less than or equal to the smallest uncensored value
(left-censored data) or the largest censored value is greater than or equal to the
largest uncensored value (right-censored data). The default value is
|
left.censored.min |
Only relevant for the case when |
right.censored.max |
Only relevant for the case when |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean or variance. The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are
|
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
pivot.statistic |
character string indicating which statistic to use for the confidence interval
for the mean when |
ci.sample.size |
character string indicating what sample size to assume when
computing the confidence interval for the mean when |
n.bootstraps |
numeric scalar indicating how many bootstraps to use to construct the
confidence interval for the mean when |
seed |
integer supplied to the function |
warn |
logical scalar indicating whether to issue a notification in the case when a
restricted mean will be estimated, but setting the smallest censored value(s)
to an uncensored value (left-censored data) or setting the largest censored
value(s) to an uncensored value (right-censored data) results in no censored
values in the data. In this case, the function |
Let denote a vector of
observations from some positive-valued distribution with mean
and standard deviation
.
Assume
(
) of these
observations are known and
(
) of these observations are
all censored below (left-censored) or all censored above (right-censored) at
censoring levels
Let denote the
ordered uncensored
observations, and let
denote the order of these
uncensored observations within the context of all the observations (censored and
uncensored). For example, if the left-censored data are {<10, 14, 14, <15, 20}, then
, and
.
Let denote the
ordered distinct
uncensored observations, let
denote the number of detects at
(
), and let
denote the number of
, i.e., the number of observations (censored and uncensored)
less than or equal to
(
). For example,
if the left-censored data are {<10, 14, 14, <15, 20}, then
,
, and
.
Estimation
This section explains how the mean , standard deviation
,
and standard error of the mean
are estimated, as well as
the restricted mean.
Estimating the Mean
It can be shown that the mean of a positive-valued distribution is equal to the
area under the survival curve (Klein and Moeschberger, 2003, p.33):
where denotes the cumulative distribution function evaluated at
and
denotes the survival function evaluated at
.
When the Kaplan-Meier estimator is used to construct the survival function,
you can use the area under this curve to estimate the mean of the distribution,
and the estimator can be as efficient or more efficient than
parametric estimators of the mean (Meier, 2004; Helsel, 2012; Lee and Wang, 2003).
Let
denote the Kaplan-Meier estimator of the empirical
cumulative distribution function (ecdf) evaluated at
, and let
denote the estimated survival function evaluated
at
. (See the help files for
ecdfPlotCensored
and
qqPlotCensored
for an explanation of how the Kaplan-Meier
estimator of the ecdf is computed.)
The formula for the estimated mean is given by (Lee and Wang, 2003, p. 74):
where and
by definition. It can be
shown that this formula is eqivalent to:
where by definition, and this is equivalent to:
(USEPA, 2009, pp. 15–7 to 15–12; Beal, 2010; USEPA, 2022, pp. 128–129).
Estimating the Standard Deviation
The formula for the estimated standard deviation is:
which is equivalent to:
(USEPA, 2009, p. 15-10; Beal, 2010).
Estimating the Standard Error of the Mean
For left-censored data, the formula for the estimated standard error of the
mean is:
where
(Beal, 2010; USEPA, 2022, pp. 128–129).
For rigth-censored data, the formula for the estimated standard error of the mean is:
where
(Lee and Wang, 2003, p. 74).
Kaplan and Meier suggest using a bias correction of
for the estimated variance of the mean (Lee and Wang, 2003, p.75):
When correct.se=TRUE
(the default), Equation (12) is used. Beal (2010),
ProUCL 5.2.0 (USEPA, 2022), and the kmms
function in the STAND package
(Frome and Frome, 2015) all compute the bias-corrected estimate of the standard
error of the mean as well.
Estimating the Restricted Mean
If the smallest value for left-censored data is censored and less than or equal to
the smallest uncensored value, then the estimated mean will be biased high, and
if the largest value for right-censored data is censored and greater than or equal to
the largest uncensored value, then the estimated mean will be biased low. One solution
to this problem is to instead estimate what is called the restricted mean
(Miller, 1981; Lee and Wang, 2003, p. 74; Meier, 2004; Barker, 2009).
To compute the restricted mean (restricted=TRUE
), for left-censored data,
the smallest censored observation(s) are treated as observed, and set to the
smallest censoring level
(left.censored.min="Censoring Level"
) or some other
value less than the smallest censoring level and greater than 0, and then applying
the above formulas. To compute the restricted mean for right-censored data,
the largest censored observation(s) are treated as observed and set to the
censoring level (right.censored.max="Censoring Level"
) or some value
greater than the largest censoring level.
ProUCL 5.2.0 (USEPA, 2022, pp. 128–129) and Beal (2010) do not compute the restricted
mean in cases where it could be applied, whereas USEPA (2009, pp. 15–7 to 15–12) and
the kmms
function in Version 2.0 of the R package STAND
(Frome and Frome, 2015) do compute the restricted mean and set the smallest
censored observation(s) equal to the censoring level (i.e., what
enparCensored
does when restricted=TRUE
and
left.censored.min="Censoring Level"
).
To be consistent with ProUCL 5.2.0, by default the function enparCensored
does not compute the restricted mean (i.e., restricted=FALSE
). It should
be noted that when the restricted mean is computed, the number of uncensored
observations increases because the smallest (left-censored) or largest
(right-censored) censored observation(s) is/are set to a specified value and
treated as uncensored. The kmms
function in Version 2.0 of the
STAND package (Frome and Frome, 2015) is inconsistent in how it treats the
number of uncensored observations when computing estimates associated with the
restricted mean. Although kmms
sets the smallest censored observations to the
observed censoring level and treats them as not censored, when it computes
the bias correction factor for the standard error of the mean, it assumes those
observations are still censored (see the EXAMPLES section below).
In the unusual case when a restricted mean will be estimated and setting the
smallest censored value(s) to an uncensored value (left-censored data), or
setting the largest censored value(s) to an uncensored value
(right-censored data), results in no censored values in the data, the Kaplan-Meier
estimate of the mean reduces to the sample mean, so the function
enpar
is called and, if warn=TRUE
, a warning is returned.
Confidence Intervals
This section explains how confidence intervals for the mean are
computed.
Normal Approximation (ci.method="normal.approx"
)
This method constructs approximate confidence intervals for
based on the assumption that the estimator of
is
approximately normally distributed. That is, a two-sided
confidence interval for
is constructed as:
where denotes the estimate of
,
denotes the estimated asymptotic standard
deviation of the estimator of
,
denotes the assumed sample
size for the confidence interval, and
denotes the
'th
quantile of Student's t-distribuiton with
degrees of freedom. One-sided confidence intervals are computed in a
similar fashion.
The argument ci.sample.size
determines the value of .
The possible values are the total number of observations,
(
ci.sample.size="Total"
), or the number of uncensored observations,
(
ci.sample.size="Uncensored"
). To be consistent with ProUCL 5.2.0,
in enparCensored
the default value is the total number of observations.
The kmms
function in the STAND package, on the other hand,
uses the number of uncensored observations.
When pivot.statistic="z"
, the 'th quantile from the
standard normal distribution is used in place of the
'th quantile from Student's t-distribution.
Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap"
)
The bootstrap is a nonparametric method of estimating the distribution
(and associated distribution parameters and quantiles) of a sample statistic,
regardless of the distribution of the population from which the sample was drawn.
The bootstrap was introduced by Efron (1979) and a general reference is
Efron and Tibshirani (1993).
In the context of deriving an approximate confidence interval
for the population mean
, the bootstrap can be broken down into the
following steps:
Create a bootstrap sample by taking a random sample of size from
the observations in
, where sampling is done with
replacement. Note that because sampling is done with replacement, the same
element of
can appear more than once in the bootstrap
sample. Thus, the bootstrap sample will usually not look exactly like the
original sample (e.g., the number of censored observations in the bootstrap
sample will often differ from the number of censored observations in the
original sample).
Estimate based on the bootstrap sample created in Step 1, using
the same method that was used to estimate
using the original
observations in
. Because the bootstrap sample usually
does not match the original sample, the estimate of
based on the
bootstrap sample will usually differ from the original estimate based on
. For the bootstrap-t method (see below), this step also
involves estimating the standard error of the estimate of the mean and
computing the statistic
where
denotes the estimate of the mean based on the original sample,
and
and
denote the estimate of
the mean and estimate of the standard error of the estimate of the mean based on
the bootstrap sample.
Repeat Steps 1 and 2 times, where
is some large number.
For the function
enparCensored
, the number of bootstraps is
determined by the argument
n.bootstraps
(see the section ARGUMENTS
above).
The default value of n.bootstraps
is 1000
.
Use the estimated values of
to compute the empirical
cumulative distribution function of the estimator of
or to compute
the empirical cumulative distribution function of the statistic
(see
ecdfPlot
), and then create a confidence interval for
based on this estimated cdf.
The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as:
where denotes the empirical cdf of
evaluated at
and thus
denotes the
'th empirical quantile of the
distribution of
, that is, the
'th quantile associated with the
empirical cdf. Similarly, a one-sided lower
confidence interval is computed as:
and a one-sided upper confidence interval is computed as:
The function enparCensored
calls the R function quantile
to compute the empirical quantiles used in Equations (14)-(16).
The percentile method bootstrap confidence interval is only first-order
accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability
that the confidence interval will contain the true value of can be
off by
, where
is some constant. Efron and Tibshirani
(1993, pp.184–188) proposed a bias-corrected and accelerated interval that is
second-order accurate, meaning that the probability that the confidence interval
will contain the true value of
may be off by
instead of
. The two-sided bias-corrected and accelerated confidence interval is
computed as:
where
where the quantity denotes the estimate of
using
all the values in
except the
'th one, and
A one-sided lower confidence interval is given by:
and a one-sided upper confidence interval is given by:
where and
are computed as for a two-sided confidence
interval, except
is replaced with
in Equations (18) and (19).
The constant incorporates the bias correction, and the constant
is the acceleration constant. The term “acceleration” refers
to the rate of change of the standard error of the estimate of
with
respect to the true value of
(Efron and Tibshirani, 1993, p.186). For a
normal (Gaussian) distribution, the standard error of the estimate of
does not depend on the value of
, hence the acceleration constant is not
really necessary.
For the bootstrap-t method, the two-sided confidence interval (Efron and Tibshirani, 1993, p.160) is computed as:
where and
denote the estimate of the mean
and standard error of the estimate of the mean based on the original sample, and
denotes the
'th empirical quantile of the bootstrap distribution of
the statistic
. Similarly, a one-sided lower confidence interval is computed as:
and a one-sided upper confidence interval is computed as:
When ci.method="bootstrap"
, the function enparCensored
computes
the percentile method, bias-corrected and accelerated method, and bootstrap-t
bootstrap confidence intervals. The percentile method is transformation respecting,
but not second-order accurate. The bootstrap-t method is second-order accurate, but not
transformation respecting. The bias-corrected and accelerated method is both
transformation respecting and second-order accurate (Efron and Tibshirani, 1993, p.188).
a list of class "estimateCensored"
containing the estimated parameters
and other information. See estimateCensored.object
for details.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation.
Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators based on censored environmental data.
In practice, it is better to use a confidence interval for the mean or a joint confidence region for the mean and standard deviation, rather than rely on a single point-estimate of the mean. Since confidence intervals and regions depend on the properties of the estimators for both the mean and standard deviation, the results of studies that simply evaluated the performance of the mean and standard deviation separately cannot be readily extrapolated to predict the performance of various methods of constructing confidence intervals and regions. Furthermore, for several of the methods that have been proposed to estimate the mean based on type I left-censored data, standard errors of the estimates are not available, hence it is not possible to construct confidence intervals (El-Shaarawi and Dolan, 1989).
Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and standard deviation when data are subjected to single or multiple censoring. See, for example, Singh et al. (2006).
Steven P. Millard ([email protected])
Barker, C. (2009). The Mean, Median, and Confidence Intervals of the Kaplan-Meier Survival Estimate – Computations and Applications. The American Statistician 63(1), 78–80.
Beal, D. (2010). A Macro for Calculating Summary Statistics on Left Censored Environmental Data Using the Kaplan-Meier Method. Paper SDA-09, presented at Southeast SAS Users Group 2010, September 26-28, Savannah, GA. https://analytics.ncsu.edu/sesug/2010/SDA09.Beal.pdf.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.
El-Shaarawi, A.H., and D.M. Dolan. (1989). Maximum Likelihood Estimation of Water Quality Concentrations from Censored Data. Canadian Journal of Fisheries and Aquatic Sciences 46, 1033–1039.
Frome E.L., and D.P. Frome (2015). STAND: Statistical Analysis of Non-Detects. R package version 2.0, https://CRAN.R-project.org/package=STAND.
Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Irwin, J.O. (1949). The Standard Error of an Estimate of Expectation of Life, with Special Reference to Expectation of Tumourless Life in Experiments with Mice. Journal of Hygiene 47, 188–189.
Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.
Klein, J.P., and M.L. Moeschberger. (2003). Survival Analysis: Techniques for Censored and Truncated Data, Second Edition. Springer, New York, 537pp.
Lee, E.T., and J.W. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley & Sons, Hoboken, New Jersey, 513pp.
Meier, P., T. Karrison, R. Chappell, and H. Xie. (2004). The Price of Kaplan-Meier. Journal of the American Statistical Association 99(467), 890–896.
Miller, R.G. (1981). Survival Analysis. John Wiley and Sons, New York.
Nelson, W. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, 634pp.
Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2022). ProUCL Version 5.2.0 Technical Guide: Statistical Software for Environmental Applications for Data Sets with and without Nondetect Observations. Prepared by: Neptune and Company, Inc., 1435 Garrison Street, Suite 201, Lakewood, CO 80215. pp. 128–129, 143. https://www.epa.gov/land-research/proucl-software.
ppointsCensored
, ecdfPlotCensored
,
qqPlotCensored
,estimateCensored.object
,
enpar
.
# Using the lead concentration data from soil samples shown in # Beal (2010), compute the Kaplan-Meier estimators of the mean, # standard deviation, and standard error of the mean, as well as # a 95% upper confidence limit for the mean. Compare these # results to those given in Beal (2010), and also to the results # produced by ProUCL 5.2.0. # First look at the data: #----------------------- head(Beal.2010.Pb.df) # Pb.char Pb Censored #1 <1 1.0 TRUE #2 <1 1.0 TRUE #3 2 2.0 FALSE #4 2.5 2.5 FALSE #5 2.8 2.8 FALSE #6 <3 3.0 TRUE tail(Beal.2010.Pb.df) # Pb.char Pb Censored #24 <10 10 TRUE #25 10 10 FALSE #26 15 15 FALSE #27 49 49 FALSE #28 200 200 FALSE #29 9060 9060 FALSE # enparCensored Results: #----------------------- Beal.unrestricted <- with(Beal.2010.Pb.df, enparCensored(x = Pb, censored = Censored, ci = TRUE, ci.type = "upper")) Beal.unrestricted #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: None # #Censoring Side: left # #Censoring Level(s): 1 3 4 6 9 10 # #Estimated Parameter(s): mean = 325.3396 # sd = 1651.0950 # se.mean = 315.0023 # #Estimation Method: Kaplan-Meier # (Bias-corrected se.mean) # #Data: Pb # #Censoring Variable: Censored # #Sample Size: 29 # #Percent Censored: 34.48276% # #Confidence Interval for: mean # #Assumed Sample Size: 29 # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.0000 # UCL = 861.1996 c(Beal.unrestricted$parameters, Beal.unrestricted$interval$limits) # mean sd se.mean LCL UCL # 325.3396 1651.0950 315.0023 0.0000 861.1996 # Beal (2010) published results: #------------------------------- # Mean Std. Dev. SE of Mean # 325.34 1651.09 315.00 # ProUCL 5.2.0 results: #---------------------- # Mean Std. Dev. SE of Mean 95% UCL # 325.2 1651 315 861.1 #---------- # Now compute the restricted mean and associated quantities, # and compare these results with those produced by the # kmms() function in the STAND package. #----------------------------------------------------------- Beal.restricted <- with(Beal.2010.Pb.df, enparCensored(x = Pb, censored = Censored, restricted = TRUE, ci = TRUE, ci.type = "upper")) Beal.restricted #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: None # #Censoring Side: left # #Censoring Level(s): 1 3 4 6 9 10 # #Estimated Parameter(s): mean = 325.2011 # sd = 1651.1221 # se.mean = 314.1774 # #Estimation Method: Kaplan-Meier (Restricted Mean) # Smallest censored value(s) # set to Censoring Level # (Bias-corrected se.mean) # #Data: Pb # #Censoring Variable: Censored # #Sample Size: 29 # #Percent Censored: 34.48276% # #Confidence Interval for: mean # #Assumed Sample Size: 29 # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.000 # UCL = 859.658 c(Beal.restricted$parameters, Beal.restricted$interval$limits) # mean sd se.mean LCL UCL # 325.2011 1651.1221 314.1774 0.0000 859.6580 # kmms() results: #---------------- # KM.mean KM.LCL KM.UCL KM.se gamma # 325.2011 -221.0419 871.4440 315.0075 0.9500 # NOTE: as pointed out above, the kmms() function treats the # smallest censored observations (<1 and <1) as NOT # censored when computing the mean and uncorrected # standard error of the mean, but assumes these # observations ARE censored when computing the corrected # standard error of the mean. #-------------------------------------------------------------- Beal.restricted$parameters["se.mean"] * sqrt((20/21)) * sqrt((19/18)) # se.mean # 315.0075 #========== # Repeat the above example, estimating the unrestricted mean and # computing an upper confidence limit based on the bootstrap # instead of on the normal approximation with a t pivot statistic. # Compare results to those from ProUCL 5.2.0. # Note: Setting the seed argument lets you reproduce this example. #------------------------------------------------------------------ Beal.unrestricted.boot <- with(Beal.2010.Pb.df, enparCensored(x = Pb, censored = Censored, ci = TRUE, ci.type = "upper", ci.method = "bootstrap", seed = 923)) Beal.unrestricted.boot #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: None # #Censoring Side: left # #Censoring Level(s): 1 3 4 6 9 10 # #Estimated Parameter(s): mean = 325.3396 # sd = 1651.0950 # se.mean = 315.0023 # #Estimation Method: Kaplan-Meier # (Bias-corrected se.mean) # #Data: Pb # #Censoring Variable: Censored # #Sample Size: 29 # #Percent Censored: 34.48276% # #Confidence Interval for: mean # #Assumed Sample Size: 29 # #Confidence Interval Method: Bootstrap # #Number of Bootstraps: 1000 # #Number of Bootstrap Samples #With No Censored Values: 0 # #Number of Times Bootstrap #Repeated Because Too Few #Uncensored Observations: 0 # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: Pct.LCL = 0.0000 # Pct.UCL = 948.7342 # BCa.LCL = 0.0000 # BCa.UCL = 942.6596 # t.LCL = 0.0000 # t.UCL = 62121.8909 c(Beal.unrestricted.boot$interval$limits) # Pct.LCL Pct.UCL BCa.LCL BCa.UCL t.LCL t.UCL # 0.0000 948.7342 0.0000 942.6596 0.0000 62121.8909 # ProUCL 5.2.0 results: #---------------------- # Pct.LCL Pct.UCL BCa.LCL BCa.UCL t.LCL t.UCL # 0.0000 944.3 0.0000 947.8 0.0000 62169 #========== # Clean up #--------- rm(Beal.unrestricted, Beal.restricted, Beal.unrestricted.boot)
# Using the lead concentration data from soil samples shown in # Beal (2010), compute the Kaplan-Meier estimators of the mean, # standard deviation, and standard error of the mean, as well as # a 95% upper confidence limit for the mean. Compare these # results to those given in Beal (2010), and also to the results # produced by ProUCL 5.2.0. # First look at the data: #----------------------- head(Beal.2010.Pb.df) # Pb.char Pb Censored #1 <1 1.0 TRUE #2 <1 1.0 TRUE #3 2 2.0 FALSE #4 2.5 2.5 FALSE #5 2.8 2.8 FALSE #6 <3 3.0 TRUE tail(Beal.2010.Pb.df) # Pb.char Pb Censored #24 <10 10 TRUE #25 10 10 FALSE #26 15 15 FALSE #27 49 49 FALSE #28 200 200 FALSE #29 9060 9060 FALSE # enparCensored Results: #----------------------- Beal.unrestricted <- with(Beal.2010.Pb.df, enparCensored(x = Pb, censored = Censored, ci = TRUE, ci.type = "upper")) Beal.unrestricted #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: None # #Censoring Side: left # #Censoring Level(s): 1 3 4 6 9 10 # #Estimated Parameter(s): mean = 325.3396 # sd = 1651.0950 # se.mean = 315.0023 # #Estimation Method: Kaplan-Meier # (Bias-corrected se.mean) # #Data: Pb # #Censoring Variable: Censored # #Sample Size: 29 # #Percent Censored: 34.48276% # #Confidence Interval for: mean # #Assumed Sample Size: 29 # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.0000 # UCL = 861.1996 c(Beal.unrestricted$parameters, Beal.unrestricted$interval$limits) # mean sd se.mean LCL UCL # 325.3396 1651.0950 315.0023 0.0000 861.1996 # Beal (2010) published results: #------------------------------- # Mean Std. Dev. SE of Mean # 325.34 1651.09 315.00 # ProUCL 5.2.0 results: #---------------------- # Mean Std. Dev. SE of Mean 95% UCL # 325.2 1651 315 861.1 #---------- # Now compute the restricted mean and associated quantities, # and compare these results with those produced by the # kmms() function in the STAND package. #----------------------------------------------------------- Beal.restricted <- with(Beal.2010.Pb.df, enparCensored(x = Pb, censored = Censored, restricted = TRUE, ci = TRUE, ci.type = "upper")) Beal.restricted #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: None # #Censoring Side: left # #Censoring Level(s): 1 3 4 6 9 10 # #Estimated Parameter(s): mean = 325.2011 # sd = 1651.1221 # se.mean = 314.1774 # #Estimation Method: Kaplan-Meier (Restricted Mean) # Smallest censored value(s) # set to Censoring Level # (Bias-corrected se.mean) # #Data: Pb # #Censoring Variable: Censored # #Sample Size: 29 # #Percent Censored: 34.48276% # #Confidence Interval for: mean # #Assumed Sample Size: 29 # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.000 # UCL = 859.658 c(Beal.restricted$parameters, Beal.restricted$interval$limits) # mean sd se.mean LCL UCL # 325.2011 1651.1221 314.1774 0.0000 859.6580 # kmms() results: #---------------- # KM.mean KM.LCL KM.UCL KM.se gamma # 325.2011 -221.0419 871.4440 315.0075 0.9500 # NOTE: as pointed out above, the kmms() function treats the # smallest censored observations (<1 and <1) as NOT # censored when computing the mean and uncorrected # standard error of the mean, but assumes these # observations ARE censored when computing the corrected # standard error of the mean. #-------------------------------------------------------------- Beal.restricted$parameters["se.mean"] * sqrt((20/21)) * sqrt((19/18)) # se.mean # 315.0075 #========== # Repeat the above example, estimating the unrestricted mean and # computing an upper confidence limit based on the bootstrap # instead of on the normal approximation with a t pivot statistic. # Compare results to those from ProUCL 5.2.0. # Note: Setting the seed argument lets you reproduce this example. #------------------------------------------------------------------ Beal.unrestricted.boot <- with(Beal.2010.Pb.df, enparCensored(x = Pb, censored = Censored, ci = TRUE, ci.type = "upper", ci.method = "bootstrap", seed = 923)) Beal.unrestricted.boot #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: None # #Censoring Side: left # #Censoring Level(s): 1 3 4 6 9 10 # #Estimated Parameter(s): mean = 325.3396 # sd = 1651.0950 # se.mean = 315.0023 # #Estimation Method: Kaplan-Meier # (Bias-corrected se.mean) # #Data: Pb # #Censoring Variable: Censored # #Sample Size: 29 # #Percent Censored: 34.48276% # #Confidence Interval for: mean # #Assumed Sample Size: 29 # #Confidence Interval Method: Bootstrap # #Number of Bootstraps: 1000 # #Number of Bootstrap Samples #With No Censored Values: 0 # #Number of Times Bootstrap #Repeated Because Too Few #Uncensored Observations: 0 # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: Pct.LCL = 0.0000 # Pct.UCL = 948.7342 # BCa.LCL = 0.0000 # BCa.UCL = 942.6596 # t.LCL = 0.0000 # t.UCL = 62121.8909 c(Beal.unrestricted.boot$interval$limits) # Pct.LCL Pct.UCL BCa.LCL BCa.UCL t.LCL t.UCL # 0.0000 948.7342 0.0000 942.6596 0.0000 62121.8909 # ProUCL 5.2.0 results: #---------------------- # Pct.LCL Pct.UCL BCa.LCL BCa.UCL t.LCL t.UCL # 0.0000 944.3 0.0000 947.8 0.0000 62169 #========== # Clean up #--------- rm(Beal.unrestricted, Beal.restricted, Beal.unrestricted.boot)
Daily measurements of ozone concentration, wind speed, temperature, and solar radiation in New York City for 153 consecutive days between May 1 and September 30, 1973.
Environmental.df Air.df
Environmental.df Air.df
The data frame Environmental.df
has 153 observations on the following 4 variables.
ozone
Average ozone concentration (of hourly measurements) of in parts per billion.
radiation
Solar radiation (from 08:00 to 12:00) in langleys.
temperature
Maximum daily temperature in degrees Fahrenheit.
wind
Average wind speed (at 07:00 and 10:00) in miles per hour.
Row names are the dates the data were collected.
The data frame Air.df
is the same as Environmental.df
except that the
column ozone
is the cube root of average ozone concentration.
Data on ozone (ppb), solar radiation (langleys), temperature (degrees Fahrenheit), and wind speed (mph)
for 153 consecutive days between May 1 and September 30, 1973. These data are a superset of the data
contained in the data frame environmental
in the package lattice.
Chambers et al. (1983), pp. 347-349.
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, 395pp.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
Cleveland, W.S. (1994). The Elements of Graphing Data. Revised Edition. Hobart Press, Summit, New Jersey, 297pp.
# Scatterplot matrix pairs(Environmental.df) pairs(Air.df) # Time series plot for ozone attach(Environmental.df) dates <- as.Date(row.names(Environmental.df), format = "%m/%d/%Y") plot(dates, ozone, type = "l", xlab = "Time (Year = 1973)", ylab = "Ozone (ppb)", main = "Time Series Plot of Daily Ozone Measures") detach("Environmental.df") rm(dates)
# Scatterplot matrix pairs(Environmental.df) pairs(Air.df) # Time series plot for ozone attach(Environmental.df) dates <- as.Date(row.names(Environmental.df), format = "%m/%d/%Y") plot(dates, ozone, type = "l", xlab = "Time (Year = 1973)", ylab = "Ozone (ppb)", main = "Time Series Plot of Daily Ozone Measures") detach("Environmental.df") rm(dates)
Concentrations (g/L) from an exposure unit.
data(EPA.02d.Ex.2.ug.per.L.vec)
data(EPA.02d.Ex.2.ug.per.L.vec)
a numeric vector of concentrations (g/L)
USEPA. (2002d). Calculating Upper Confidence Limits for Exposure Point Concentrations at Hazardous Waste Sites. OSWER 9285.6-10, December 2002. Office of Emergency and Remedial Response, U.S. Environmental Protection Agency, Washington, D.C., p. 9.
Concentrations (mg/kg) from an exposure unit.
data(EPA.02d.Ex.4.mg.per.kg.vec)
data(EPA.02d.Ex.4.mg.per.kg.vec)
a numeric vector of concentrations (mg/kg)
USEPA. (2002d). Calculating Upper Confidence Limits for Exposure Point Concentrations at Hazardous Waste Sites. OSWER 9285.6-10, December 2002. Office of Emergency and Remedial Response, U.S. Environmental Protection Agency, Washington, D.C., p. 11.
Concentrations (mg/kg) from an exposure unit.
data(EPA.02d.Ex.6.mg.per.kg.vec)
data(EPA.02d.Ex.6.mg.per.kg.vec)
a numeric vector of concentrations (mg/kg)
USEPA. (2002d). Calculating Upper Confidence Limits for Exposure Point Concentrations at Hazardous Waste Sites. OSWER 9285.6-10, December 2002. Office of Emergency and Remedial Response, U.S. Environmental Protection Agency, Washington, D.C., p. 13.
Concentrations (mg/L) from an exposure unit.
data(EPA.02d.Ex.9.mg.per.L.vec)
data(EPA.02d.Ex.9.mg.per.L.vec)
a numeric vector of concentrations (mg/L)
USEPA. (2002d). Calculating Upper Confidence Limits for Exposure Point Concentrations at Hazardous Waste Sites. OSWER 9285.6-10, December 2002. Office of Emergency and Remedial Response, U.S. Environmental Protection Agency, Washington, D.C., p. 16.
Nickel concentrations (ppb) from four wells (five observations per year for each well). The Guidance Document has the label “Year” instead of “Well”; corrected in Errata.
EPA.09.Ex.10.1.nickel.df
EPA.09.Ex.10.1.nickel.df
A data frame with 20 observations on the following 3 variables.
Month
a numeric vector indicating the month the sample was taken
Well
a factor indicating the well number
Nickel.ppb
a numeric vector of nickel concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery, Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C., p.10-12.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Arsenic concentrations (ppb) at six wells (four observations per well).
EPA.09.Ex.11.1.arsenic.df
EPA.09.Ex.11.1.arsenic.df
A data frame with 24 observations on the following 3 variables.
Arsenic.ppb
a numeric vector of arsenic concentrations (ppb)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.11-3.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Carbon tetrachloride (CCL4) concentrations (ppb) at five background wells (four measures at each well).
EPA.09.Ex.12.1.ccl4.df
EPA.09.Ex.12.1.ccl4.df
A data frame with 20 observations on the following 2 variables.
Well
a factor indicating the well number
CCL4.ppb
a numeric vector of CCL4 concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.12-3.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Naphthalene concentrations (ppb) at five background wells (five quarterly measures at each well).
EPA.09.Ex.12.4.naphthalene.df
EPA.09.Ex.12.4.naphthalene.df
A data frame with 25 observations on the following 3 variables.
Quarter
a numeric vector indicating the quarter the sample was taken
Well
a factor indicating the well number
Naphthalene.ppb
a numeric vector of naphthalene concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.12-12.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Dissolved iron (Fe) concentrations (ppm) at six upgradient wells (four quarterly measures at each well).
EPA.09.Ex.13.1.iron.df
EPA.09.Ex.13.1.iron.df
A data frame with 24 observations on the following 4 variables.
Month
a numeric vector indicating the month the sample was taken
Year
a numeric vector indicating the year the sample was taken
Well
a factor indicating the well number
Iron.ppm
a numeric vector if iron concentrations (ppm)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.13-3.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Manganese concentrations (ppm) at four background wells (eight quarterly measures at each well).
EPA.09.Ex.14.1.manganese.df
EPA.09.Ex.14.1.manganese.df
A data frame with 32 observations on the following 3 variables.
Quarter
a numeric vector indicating the quarter the sample was taken
Well
a factor indicating the well number
Manganese.ppm
a numeric vector of manganese concentrations (ppm)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.14-5.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Alkalinity measures (mg/L) collected from leachate at a solid waste landfill during a four and a half year period.
EPA.09.Ex.14.3.alkalinity.df
EPA.09.Ex.14.3.alkalinity.df
A data frame with 54 observations on the following 2 variables.
Date
a Date object indicating the date of collection
Alkalinity.mg.per.L
a numeric vector of alkalinity measures (mg/L)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.14-14.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Sixteen quarterly measures of arsenic concentrations (ppb).
EPA.09.Ex.14.4.arsenic.df
EPA.09.Ex.14.4.arsenic.df
A data frame with 16 observations on the following 4 variables.
Sample.Date
a factor indicating the month and year of collection
Month
a factor indicating the month of collection
Year
a factor indicating the year of collection
Arsenic.ppb
a numeric vector of arsenic concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.14-18.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Monthly unadjusted and adjusted analyte concentrations over a 3-year period. Adjusted concentrations are computed by subtracting the monthly mean and adding the overall mean.
EPA.09.Ex.14.8.df
EPA.09.Ex.14.8.df
A data frame with 36 observations on the following 4 variables.
Month
a factor indicating the month of collection
Year
a numeric vector indicating the year of collection
Unadj.Conc
a numeric vector of unadjusted concentrations
Adj.Conc
a numeric vector adjusted concentrations
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.14-32.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Manganese concentrations (ppb) at five background wells (five measures at each well).
EPA.09.Ex.15.1.manganese.df
EPA.09.Ex.15.1.manganese.df
A data frame with 25 observations on the following 5 variables.
Sample
a numeric vector indicating the sample number (1-5)
Well
a factor indicating the well number
Manganese.Orig.ppb
a character vector of the original manganese concentrations (ppb)
Manganese.ppb
a numeric vector of manganese concentrations with non-detects coded to their detecion limit
Censored
a logical vector indicating which observations are censored
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.15-10.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Sulfate concentrations (ppm) at one background well and one downgradient well (eight quarterly measures at each well).
EPA.09.Ex.16.1.sulfate.df
EPA.09.Ex.16.1.sulfate.df
A data frame with 16 observations on the following 4 variables.
Month
a factor indicating the month of collection
Year
a factor indicating the year of collection
Well.type
a factor indicating the well type (background vs. downgradient)
Sulfate.ppm
a numeric vector of sulfate concentrations (ppm)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.16-6.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Benzene concentrations (ppb) at one background and one downgradient well (eight monthly measures at each well).
EPA.09.Ex.16.2.benzene.df
EPA.09.Ex.16.2.benzene.df
A data frame with 16 observations on the following 3 variables.
Month
a factor indicating the month of collection
Well.type
a factor indicating the well type (background vs. downgradient)
Benzene.ppb
a numeric vector of benzene concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.16-9.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Copper concentrations (ppb) at two background wells and one compliance well (six measures at each well).
EPA.09.Ex.16.4.copper.df
EPA.09.Ex.16.4.copper.df
A data frame with 18 observations on the following 4 variables.
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
Copper.ppb
a numeric vector of copper concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.16-19.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Tetrachloroethylene (PCE) concentrations (ppb) at one background well and one compliance well.
EPA.09.Ex.16.5.PCE.df
EPA.09.Ex.16.5.PCE.df
A data frame with 14 observations on the following 4 variables.
Well.type
a factor with levels Background
Compliance
PCE.Orig.ppb
a character vector of original PCE concentrations (ppb)
PCE.ppb
a numeric vector of PCE concentrations (ppb) with nondetects set to their detection limit
Censored
a logical vector indicating which observations are censored
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.16-22.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Log-transformed lead concentrations (ppb) at two background and four compliance wells (four quarterly measures at each well).
EPA.09.Ex.17.1.loglead.df
EPA.09.Ex.17.1.loglead.df
A data frame with 24 observations on the following 4 variables.
Month
a factor indicating the month of collection; 1
= Jan, 2
= Apr, 3
= Jul, 4
= Oct
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
LogLead
a numeric vector of log-transformed lead concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-7.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Toluene concentrations (ppb) at two background and three compliance wells (five monthly measures at each well).
EPA.09.Ex.17.2.toluene.df
EPA.09.Ex.17.2.toluene.df
A data frame with 25 observations on the following 6 variables.
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
Toluene.ppb.orig
a character vector of original toluene concentrations (ppb)
Toluene.ppb
a numeric vector of toluene concentrations (ppb) with nondetects set to their detection limit
Censored
a logical vector indicating which observations are censored
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-13.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Chrysene concentrations (ppb) at two background and three compliance wells (four monthly measures at each well).
EPA.09.Ex.17.3.chrysene.df
EPA.09.Ex.17.3.chrysene.df
A data frame with 20 observations on the following 4 variables.
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
Chrysene.ppb
a numeric vector of chrysene concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-17.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Log-transformed chrysene concentrations (ppb) at two background and three compliance wells (four monthly measures at each well).
EPA.09.Ex.17.3.log.chrysene.df
EPA.09.Ex.17.3.log.chrysene.df
A data frame with 20 observations on the following 4 variables.
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
Log.Chrysene.ppb
a numeric vector of log-transformed chrysene concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-18.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Copper concentrations (ppb) at three background and two compliance wells (eight monthly measures at the background wells, four monthly measures at the compliance wells).
EPA.09.Ex.17.4.copper.df
EPA.09.Ex.17.4.copper.df
A data frame with 40 observations on the following 6 variables.
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
Copper.ppb.orig
a character vector of original copper concentrations (ppb)
Copper.ppb
a numeric vector of copper concentrations with nondetects set to their detection limit
Censored
a logical vector indicating which observations are censored
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-21.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Chloride concentrations (ppm) collected over a five-year period at a solid waste landfill.
EPA.09.Ex.17.5.chloride.df
EPA.09.Ex.17.5.chloride.df
A data frame with 19 observations on the following 4 variables.
Date
a Date object indicating the date of collection
Chloride.ppm
a numeric vector of chloride concentrations (ppm)
Elapsed.Days
a numeric vector indicating the number of days since January 1, 2002
Residuals
a numeric vector of residuals from a linear regression trend fit
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-26.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Sulfate concentrations (ppm) collected over several years.
The date of collection is simply indicated by month and year of collection.
The column Date
is a Date object where the day of the month has been arbitrarily set to 1.
EPA.09.Ex.17.6.sulfate.df
EPA.09.Ex.17.6.sulfate.df
A data frame with 23 observations on the following 6 variables.
Sample.No
a numeric vector indicating the sample number
Year
a numeric vector indicating the year of collection
Month
a numeric vector indicating the month of collection
Sampling.Date
a numeric vector indicating the year and month of collection
Date
a Date object indicating the date of collection, where the day of the month is arbitrarily set to 1
Sulfate.ppm
a numeric vector of sulfate concentrations (ppm)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-33.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Sodium concentrations (ppm) collected over several years. The sample dates are recorded as the year of collection (2-digit format) plus a fractional part indicating when during the year the sample was collected.
EPA.09.Ex.17.7.sodium.df
EPA.09.Ex.17.7.sodium.df
A data frame with 10 observations on the following 2 variables.
Year
a numeric vector indicating the year of collection (a fractional number)
Sodium.ppm
a numeric vector of sodium concentrations (ppm)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-36.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Arsenic concentrations (ppb) in a single well at a solid waste landfill. Four observations per year over four years. Years 1-3 are the background period and Year 4 is the compliance period.
EPA.09.Ex.18.1.arsenic.df
EPA.09.Ex.18.1.arsenic.df
A data frame with 16 observations on the following 3 variables.
Year
a factor indicating the year of collection
Sampling.Period
a factor indicating the sampling period (background vs. compliance)
Arsenic.ppb
a numeric vector of arsenic concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.18-10.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Chrysene concentrations (ppb) at two background wells and one compliance well (four monthly measures at each well).
EPA.09.Ex.18.2.chrysene.df
EPA.09.Ex.18.2.chrysene.df
A data frame with 12 observations on the following 4 variables.
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
Chrysene.ppb
a numeric vector of chrysene concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.18-15.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Trichloroethylene (TCE) concentrations (ppb) at three background wells and one compliance well. Six monthly measures at each background well, three monthly measures at the compliance well.
EPA.09.Ex.18.3.TCE.df
EPA.09.Ex.18.3.TCE.df
A data frame with 24 observations on the following 6 variables.
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
TCE.ppb.orig
a character vector of original TCE concentrations (ppb)
TCE.ppb
a numeric vector of TCE concentrations (ppb) with nondetects set to their detection limit
Censored
a logical vector indicating which observations are censored
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.18-19.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Xylene concentrations (ppb) at three background wells and one compliance well. Eight monthly measures at each complaince well; three monthly measures at the compliance well.
EPA.09.Ex.18.4.xylene.df
EPA.09.Ex.18.4.xylene.df
A data frame with 32 observations on the following 6 variables.
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
Xylene.ppb.orig
a character vector of original xylene concentrations (ppb)
Xylene.ppb
a numeric vector of xylene concentrations (ppb) with nondetects set to their detection limit
Censored
a logical vector indicating which observations are censored
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.18-22.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Sulfate concentrations (mg/L) at four background wells.
EPA.09.Ex.19.1.sulfate.df
EPA.09.Ex.19.1.sulfate.df
A data frame with 25 observations on the following 7 variables.
Well
a factor indicating the well number
Month
a numeric vector indicating the month of collection
Day
a numeric vector indicating the day of the month of collection
Year
a numeric vector indicating the year of collection
Date
a Date object indicating the date of collection
Sulfate.mg.per.l
a numeric vector of sulfate concentrations (mg/L)
log.Sulfate.mg.per.l
a numeric vector of log-transformed sulfate concentrations (mg/L)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.19-17.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Chloride concentrations (mg/L) at 10 compliance wells at a solid waste landfill. One year of quarterly measures at each well.
EPA.09.Ex.19.2.chloride.df
EPA.09.Ex.19.2.chloride.df
A data frame with 40 observations on the following 2 variables.
Well
a factor indicating the well number
Chloride.mg.per.l
a numeric vector of chloride concentrations (mg/L)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.19-19.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Mercury concentrations (ppb) at four background and two compliance wells.
EPA.09.Ex.19.5.mercury.df
EPA.09.Ex.19.5.mercury.df
A data frame with 36 observations on the following 6 variables.
Event
a factor indicating the time of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
Mercury.ppb.orig
a character vector of original mercury concentrations (ppb)
Mercury.ppb
a numeric vector of mercury concentrations (ppb) with nondetects set to their detection limit
Censored
a logical vector indicating which observations are censored
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.19-33.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Nickel concentrations (ppb) at a single well. Eight monthly measures during the background period and eight monthly measures during the compliance period.
EPA.09.Ex.20.1.nickel.df
EPA.09.Ex.20.1.nickel.df
A data frame with 16 observations on the following 4 variables.
Month
a factor indicating the month of collection
Year
a factor indicating the year of collection
Period
a factor indicating the period (baseline vs. compliance)
Nickel.ppb
a numeric vector of nickel concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.20-4.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Aldicarb concentrations (ppb) at three compliance wells (four monthly measures at each well).
EPA.09.Ex.21.1.aldicarb.df
EPA.09.Ex.21.1.aldicarb.df
A data frame with 12 observations on the following 3 variables.
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Aldicarb.ppb
a numeric vector of aldicarb concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.21-4.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Benzene concentrations (ppb) collected at a landfill that previously handled smelter waste and is now undergoing remediation efforts.
EPA.09.Ex.21.2.benzene.df
EPA.09.Ex.21.2.benzene.df
A data frame with 8 observations on the following 4 variables.
Month
a numeric vector indicating the month of collection
Benzene.ppb.orig
a character vector of original benzene concentrations (ppb)
Benzene.ppb
a numeric vector of benzene concentrations (ppb) with nondetects set to their detection limit
Censored
a logical vector indicating which observations are censored
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.21-7.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Beryllium concentrations (ppb) at one well (four years of quarterly measures).
data(EPA.09.Ex.21.5.beryllium.df)
data(EPA.09.Ex.21.5.beryllium.df)
A data frame with 16 observations on the following 3 variables.
Year
a factor indicating the year of collection
Quarter
a factor indicating the quarter of collection
Beryllium.ppb
a numeric vector of beryllium concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.21-18.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Nitrate concentrations (mg/L) at a well used for drinking water.
EPA.09.Ex.21.6.nitrate.df
EPA.09.Ex.21.6.nitrate.df
A data frame with 12 observations on the following 5 variables.
Sampling.Date
a character vector indicating the sampling date
Date
a Date object indicating the sampling date
Nitrate.mg.per.l.orig
a character vector of original nitrate concentrations (mg/L)
Nitrate.mg.per.l
a numeric vector of nitrate concentrations (mg/L) with nondetects set to their detection limit
Censored
a logical vector indicating which observations are censored
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.21-22.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Trichloroethylene (TCE) concentrations (ppb) at a site undergoing remediation.
EPA.09.Ex.21.7.TCE.df
EPA.09.Ex.21.7.TCE.df
A data frame with 10 observations on the following 2 variables.
Month
a numeric vector indicating the month of collection
TCE.ppb
a numeric vector of TCE concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.21-26.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Vinyl Chloride (VC) concentrations (ppb) during detection monitoring for two compliance wells. Four years of quarterly measures at each well. Compliance monitoring began with Year 2 of the sampling record.
EPA.09.Ex.22.1.VC.df
EPA.09.Ex.22.1.VC.df
A data frame with 32 observations on the following 5 variables.
Year
a factor indicating the year of collection
Quarter
a factor indicating the quarter of collection
Period
a factor indicating the period (background vs. compliance)
Well
a factor indicating the well number
VC.ppb
a numeric vector of VC concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.22-6.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Specific conductance (mho) collected over several years at two wells at a hazardous waste facility.
EPA.09.Ex.22.2.Specific.Conductance.df
EPA.09.Ex.22.2.Specific.Conductance.df
A data frame with 43 observations on the following 3 variables.
Well
a factor indicating the well number
Date
a Date object indicating the date of collection
Specific.Conductance.umho
a numeric vector of specific conductance (mho)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.22-11.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Sulfate concentrations (ppm) at two background wells (five quarterly measures at each well).
EPA.09.Ex.6.3.sulfate.df
EPA.09.Ex.6.3.sulfate.df
A data frame with 10 observations on the following 4 variables.
Month
a numeric vector indicating the month the observations was taken
Year
a numeric vector indicating the year the observation was taken
Well
a factor indicating the well number
Sulfate.ppm
a numeric vector of sulfate concentrations (ppm)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.6-20.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Arsenic concentrations (g/L) at a single well, consisting of:
8 historical observations,
4 future observations for Case 1, and
4 future observations for Case 2.
EPA.09.Ex.7.1.arsenic.df
EPA.09.Ex.7.1.arsenic.df
A data frame with 16 observations on the following 2 variables.
Data.Source
a factor with levels Historical
, Case.1
, Case.2
Arsenic.ug.per.l
a numeric vector of arsenic concentrations (g/L)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.7-26.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Time series of trichloroethene (TCE) concentrations (mg/L) taken at 2 separate
wells. Some observations are annotated with a data qualifier of U
(nondetect)
or J
(estimated detected concentration).
EPA.09.Table.9.1.TCE.df
EPA.09.Table.9.1.TCE.df
A data frame with 30 observations on the following 5 variables.
Date.Collected
a factor indicating the date of collection
Date
a Date object indicating the date of collection
Well
a factor indicating the well number
TCE.mg.per.L
a numeric vector indicating the TCE concnetrations (mg/L)
Data.Qualifier
a factor indicating the data qualifier
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.9-3.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Arsenic, mercury, and strontium concentrations (mg/L) from a single well
collected approximately quarterly. Nondetects are indicated by the
data qualifier U
.
EPA.09.Table.9.3.df
EPA.09.Table.9.3.df
A data frame with 15 observations on the following 8 variables.
Date.Collected
a factor indicating the date of collection
Date
a Date object indicating the date of collection
Arsenic.mg.per.L
a numeric vector of arsenic concentrations (mg/L)
Arsenic.Data.Qualifier
a factor indicating the data qualifier for arsenic
Mercury.mg.per.L
a numeric vector of mercury concentrations (mg/L)
Mercury.Data.Qualifier
a factor indicating the data qualifier for mercury
Strontium.mg.per.L
a numeric vector of strontium concentrations
Strontium.Data.Qualifier
a factor indicating the data qualifier for strontium
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.9-13.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Nickel concentrations (ppb) from a single well.
EPA.09.Table.9.4.nickel.vec
EPA.09.Table.9.4.nickel.vec
a numeric vector of nickel concentrations (ppb)
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.9-18.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Aldicarb concentrations (ppb) at three compliance wells (four monthly samples at each well).
EPA.89b.aldicarb1.df
EPA.89b.aldicarb1.df
A data frame with 12 observations on the following 3 variables.
Aldicarb
Aldicarb concentrations (ppb)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.6-4.
Aldicarb concentrations (ppm) at three compliance wells (four monthly samples at each well).
EPA.89b.aldicarb2.df
EPA.89b.aldicarb2.df
A data frame with 12 observations on the following 3 variables.
Aldicarb
Aldicarb concentrations (ppm)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.6-13.
Benzene concentrations (ppm) at one background and five compliance wells (four monthly samples for each well).
EPA.89b.benzene.df
EPA.89b.benzene.df
A data frame with 24 observations on the following 6 variables.
Benzene.orig
a character vector of the original observations
Benzene
a numeric vector with <1
observations coded as 1
Censored
a logical vector indicating which observations are censored
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.5-18.
Cadmium concentrations (mg/L) at one set of background and one set of compliance wells. Nondetects reported as "BDL". Detection limit not given.
EPA.89b.cadmium.df
EPA.89b.cadmium.df
A data frame with 88 observations on the following 4 variables.
Cadmium.orig
a character vector of the original cadmium observations (mg/L)
Cadmium
a numeric vector with BDL
coded as 0
Censored
a logical vector indicating which observations are censored
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.8-6.
Chlordane concentrations (ppm) in 24 water samples. Two possible phases: dissolved (18 observations) and immiscible (6 observations).
EPA.89b.chlordane1.df
EPA.89b.chlordane1.df
A data frame with 24 observations on the following 2 variables.
Chlordane
Chlordane concentrations (ppm)
Phase
a factor indicating the phase (dissolved vs. immiscible)
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.4-8.
Chlordane concentrations (ppb) at one background and one compliance well. Observations taken during four separate months over two years. Four replicates taken for each “month/year/well type” combination.
data(EPA.89b.chlordane2.df)
data(EPA.89b.chlordane2.df)
A data frame with 32 observations on the following 5 variables.
Chlordane
Chlordane concentration (ppb)
Month
a factor indicating the month of collection
Year
a numeric vector indicating the year of collection (85 or 86)
Replicate
a factor indicating the replicate number
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.5-27.
EDB concentrations (ppb) at three compliance wells (four monthly samples at each well).
EPA.89b.edb.df
EPA.89b.edb.df
A data frame with 12 observations on the following 3 variables.
EDB
EDB concentrations (ppb)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.6-6.
Lead concentrations (ppm) at two background and four compliance wells (four monthly samples for each well).
EPA.89b.lead.df
EPA.89b.lead.df
A data frame with 24 observations on the following 4 variables.
Lead
Lead concentrations (ppm)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.5-23.
Log-transformed lead concentrations (g/L) at two background and four
compliance wells (four monthly samples for each well).
EPA.89b.loglead.df
EPA.89b.loglead.df
A data frame with 24 observations on the following 4 variables.
LogLead
Natural logarithm of lead concentrations (g/L)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.5-11.
Manganese concentrations at six monitoring wells (four monthly samples for each well).
EPA.89b.manganese.df
EPA.89b.manganese.df
A data frame with 24 observations on the following 3 variables.
Manganese
Manganese concentrations
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.4-19.
Sulfate concentrations (mg/L). Nondetects reported as <1450
.
data(EPA.89b.sulfate.df)
data(EPA.89b.sulfate.df)
A data frame with 24 observations on the following 3 variables.
Sulfate.orig
a character vector of original sulfate concentration (mg/L)
Sulfate
a numeric vector of sulfate concentations with <1450
coded as 1450
Censored
a logical vector indicating which observations are censored
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.8-9.
T-29 concentrations (ppm) at two compliance wells (four monthly samples at each well, four replicates within each month). Detection limit is not given.
EPA.89b.t29.df
EPA.89b.t29.df
A data frame with 32 observations on the following 6 variables.
T29.orig
a character vector of the original T-29 concentrations (ppm)
T29
a numeric vector of T-29 concentrations with <?
coded as 0
Censored
a logical vector indicating which observations are censored
Month
a factor indicating the month of collection
Replicate
a factor indicating the replicate number
Well
a factor indicating the well number
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.6-10.
Numeric vector containing total organic carbon (TOC) concentrations (mg/L).
EPA.89b.toc.vec
EPA.89b.toc.vec
A numeric vector with 19 elements containing TOC concentrations (mg/L).
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.8-13.
Arsenic concentrations (ppm) at six monitoring wells (four monthly samples for each well).
EPA.92c.arsenic1.df
EPA.92c.arsenic1.df
A data frame with 24 observations on the following 3 variables.
Arsenic
Arsenic concentrations (ppm)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.21.
Arsenic concentrations (ppb) at three background wells and one compliance well
(six monthly samples for each well; first four missing at compliance well). Nondetects
reported as <5
.
EPA.92c.arsenic2.df
EPA.92c.arsenic2.df
A data frame with 24 observations on the following 6 variables.
Arsenic.orig
a character vector of original arsenic concentrations (ppb)
Arsenic
a numeric vector of arsenic concentrations with <5
coded as 5
Censored
a logical vector indicating which observations are censored
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.60.
Arsenic concentrations at one background and one compliance monitoring well. Three years of observations for background well, two years of observations for compliance well, four samples per year for each well.
EPA.92c.arsenic3.df
EPA.92c.arsenic3.df
A data frame with 20 observations on the following 3 variables.
Arsenic
a numeric vector of arsenic concentrations
Year
a factor indicating the year of collection
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C.
Benzene concentrations (ppb) at six background wells
(six monthly samples for each well). Nondetects reported as <2
.
EPA.92c.benzene1.df
EPA.92c.benzene1.df
A data frame with 36 observations on the following 5 variables.
Benzene.orig
a character vector of original benzene concentrations (ppb)
Benzene
a numeric vector of benzene concentrations with <2
coded as 2
Censored
a logical vector indicating which observations are censored
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.36.
Benzene concentrations (ppb) at one background and one compliance well. Four observations per month for each well. Background well sampled in months 1,2, and 3; compliance well sampled in months 4 and 5.
EPA.92c.benzene2.df
EPA.92c.benzene2.df
A data frame with 20 observations on the following 3 variables.
Benzene
a numeric vector of benzene concentrations (ppb)
Month
a factor indicating the month of collection
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.56.
Carbon tetrachloride (CCL4) concentrations (ppb) at five wells (four monthly samples at each well).
EPA.92c.ccl4.df
EPA.92c.ccl4.df
A data frame with 20 observations on the following 3 variables.
CCL4
a numeric vector of carbon tetrachloride concentrations (ppb)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.80.
Chrysene concentrations (ppb) at five compliance wells (four monthly samples for each well).
EPA.92c.chrysene.df
EPA.92c.chrysene.df
A data frame with 20 observations on the following 3 variables.
Chrysene
a numeric vector of chrysene concentrations (ppb)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.52.
Copper concentrations (ppb) at two background and one compliance wells (six monthly samples for each well).
EPA.92c.copper1.df
EPA.92c.copper1.df
A data frame with 18 observations on the following 4 variables.
Copper
a numeric vector of copper concentrations (ppb)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.47.
Copper concentrations (ppb) at three background and two compliance wells
(eight monthly samples for each well; first four missing at compliance wells).
Nondetects reported as <5
.
EPA.92c.copper2.df
EPA.92c.copper2.df
A data frame with 40 observations on the following 6 variables.
Copper.orig
a character vector of original copper concentrations (ppb)
Copper
a numeric vector of copper concentrations with <5
coded as 5
Censored
a logical vector indicating which observations are censored
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.55.
Log-transformed nickel concentrations (ppb) at four monitoring wells (five monthly samples for each well).
EPA.92c.lognickel1.df
EPA.92c.lognickel1.df
A data frame with 20 observations on the following 3 variables.
LogNickel
a numeric vector of log-transformed nickel concentrations (ppb)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.15.
Nickel concentrations (ppb) at four monitoring wells (five monthly samples for each well).
EPA.92c.nickel1.df
EPA.92c.nickel1.df
A data frame with 20 observations on the following 3 variables.
Nickel
a numeric vector of nickel concentrations (ppb)
Month
a factor indicating the month of collection
Well
a factor indicating the well number
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.7.
Nickel concentrations (ppb) at a monitoring well (eight months of samples, two samples for each sampling occasion).
EPA.92c.nickel2.df
EPA.92c.nickel2.df
A data frame with 16 observations on the following 3 variables.
Nickel
a numeric vector of nickel concentrations (ppb)
Month
a factor indicating the month of collection
Sample
a factor indicating the sample (replicate) number
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.78.
Toluene concentrations (ppb) at two background and three compliance wells
(five monthly samples at each well). Nondetects reported as <5
.
EPA.92c.toluene.df
EPA.92c.toluene.df
A data frame with 25 observations on the following 6 variables.
Toluene.orig
a character vector of original toluene concentrations (ppb)
Toluene
a numeric vector of toluene concentrations with <5
coded as 5
Censored
a logical vector indicating which observations are censored
Month
a factor indicating the month of collection
Well
a factor indicating the well number
Well.type
a factor indicating the well type (background vs. compliance)
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.43.
Zinc concentrations (ppb) at five background wells
(eight samples for each well). Nondetects reported as <7
.
EPA.92c.zinc.df
EPA.92c.zinc.df
A data frame with 40 observations on the following 5 variables.
Zinc.orig
a character vector of original zinc concentrations (ppb)
Zinc
a numeric vector of zinc concentrations with <7
coded as 7
Censored
a logical vector indicating which observations are censored
Sample
a factor indicating the sample number
Well
a factor indicating the well number
USEPA. (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C. p.30.
Chromium concentrations (mg/kg) in soil samples collected randomly over a Superfund site.
EPA.92d.chromium.df
EPA.92d.chromium.df
A data frame with 15 observations on the following variable.
Cr
a numeric vector of chromium concentrations (mg/kg)
USEPA. (1992d). Supplemental Guidance to RAGS: Calculating the Concentration Term. Publication 9285.7-081, May 1992. Intermittent Bulletin, Volume 1, Number 1. Office of Emergency and Remedial Response, Hazardous Site Evaluation Division, OS-230. Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
Chromium concentrations (mg/kg) in soil samples collected randomly over a Superfund site.
EPA.92d.chromium.vec
EPA.92d.chromium.vec
A numeric vector with 15 observations.
USEPA. (1992d). Supplemental Guidance to RAGS: Calculating the Concentration Term. Publication 9285.7-081, May 1992. Intermittent Bulletin, Volume 1, Number 1. Office of Emergency and Remedial Response, Hazardous Site Evaluation Division, OS-230. Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
Lead concentrations (mg/Kg) in soil samples at a reference area and a
cleanup area. Nondetects reported as <39
. There are 14 observations
for each area.
EPA.94b.lead.df
EPA.94b.lead.df
A data frame with 28 observations on the following 4 variables.
Lead.orig
a character vector of original lead concentrations (mg/Kg)
Lead
a numeric vector of lead concentrations with <39
coded as 39
Censored
a logical vector indicating which observations are censored
Area
a factor indicating the area (cleanup vs. reference)
USEPA. (1994b). Statistical Methods for Evaluating the Attainment of Cleanup Standards, Volume 3: Reference-Based Standards for Soils and Solid Media. EPA/230-R-94-004. Office of Policy, Planning, and Evaluation, U.S. Environmental Protection Agency, Washington, D.C. pp.6.20–6.21.
1,2,3,4-Tetrachlorobenzene (TcCB) concentrations (ppb) in soil samples at a
reference area and a cleanup area. There are 47 observations for the reference area
and 77 for the cleanup area. There is only one nondetect in the dataset (it's in the
cleanup area), and it is reported as ND
. Here it is assumed the nondetect is
less than the smallest reported value, which is 0.09 ppb. Note that on page 6.23 of
USEPA (1994b), a value of 25.5 for the Cleanup Unit was erroneously omitted.
EPA.94b.tccb.df
EPA.94b.tccb.df
A data frame with 124 observations on the following 4 variables.
TcCB.orig
a character vector with the original tetrachlorobenzene concentrations (ppb)
TcCB
a numeric vector of tetrachlorobenzene with <0.99
coded as 0.99
Censored
a logical vector indicating which observations are censored
Area
a factor indicating the area (cleanup vs. reference)
USEPA. (1994b). Statistical Methods for Evaluating the Attainment of Cleanup Standards, Volume 3: Reference-Based Standards for Soils and Solid Media. EPA/230-R-94-004. Office of Policy, Planning, and Evaluation, U.S. Environmental Protection Agency, Washington, D.C. pp.6.22-6.25.
Calibration data for cadmium at mass 111 (ng/L; method 1638 ICPMS) that appeared in Gibbons et al. (1997b) and were provided to them by the U.S. EPA.
EPA.97.cadmium.111.df
EPA.97.cadmium.111.df
A data frame with 35 observations on the following 2 variables.
Cadmium
Observed concentation of cadmium (ng/L)
Spike
“True” concentration of cadmium taken from a standard (ng/L)
Gibbons, R.D., D.E. Coleman, and R.F. Maddalone. (1997b). Response to Comment on "An Alternative Minimum Level Definition for Analytical Quantification". Environmental Science and Technology, 31(12), 3729–3731.
Estimate the location and shape parameters of a Pareto distribution.
epareto(x, method = "mle", plot.pos.con = 0.375)
epareto(x, method = "mle", plot.pos.con = 0.375)
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Possible values are
|
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant used to construct the values of the empirical cdf. The default value is
|
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from a Pareto distribution with
parameters
location=
and
shape=
.
Maximum Likelihood Estimatation (method="mle"
)
The maximum likelihood estimators (mle's) of and
are
given by (Evans et al., 1993; p.122; Johnson et al., 1994, p.581):
where denotes the first order statistic (i.e., the minimum value).
Least-Squares Estimation (method="lse"
)
The least-squares estimators (lse's) of and
are derived as
follows. Let
denote a Pareto random variable with parameters
location=
and
shape=
. It can be shown that
where denotes the cumulative distribution function of
. Set
where denotes the empirical cumulative distribution function
evaluated at
. The least-squares estimates of
and
are obtained by solving the regression equation
and setting
(Johnson et al., 1994, p.580).
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The Pareto distribution is named after Vilfredo Pareto (1848-1923), a professor
of economics. It is derived from Pareto's law, which states that the number of
persons having income
is given by:
where denotes Pareto's constant and is the shape parameter for the
probability distribution.
The Pareto distribution takes values on the positive real line. All values must be
larger than the “location” parameter , which is really a threshold
parameter. There are three kinds of Pareto distributions. The one described here
is the Pareto distribution of the first kind. Stable Pareto distributions have
. Note that the
'th moment only exists if
.
The Pareto distribution is related to the
exponential distribution and
logistic distribution as follows.
Let denote a Pareto random variable with
location=
and
shape=
. Then
has an exponential distribution
with parameter
rate=
, and
has a logistic distribution with parameters
location=
and
scale=
.
The Pareto distribution has a very long right-hand tail. It is often applied in the study of socioeconomic data, including the distribution of income, firm size, population, and stock price fluctuations.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
# Generate 30 observations from a Pareto distribution with parameters # location=1 and shape=1 then estimate the parameters. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpareto(30, location = 1, shape = 1) epareto(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Pareto # #Estimated Parameter(s): location = 1.009046 # shape = 1.079850 # #Estimation Method: mle # #Data: dat # #Sample Size: 30 #---------- # Compare the results of using the least-squares estimators: epareto(dat, method="lse")$parameters #location shape #1.085924 1.144180 #---------- # Clean up #--------- rm(dat)
# Generate 30 observations from a Pareto distribution with parameters # location=1 and shape=1 then estimate the parameters. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpareto(30, location = 1, shape = 1) epareto(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Pareto # #Estimated Parameter(s): location = 1.009046 # shape = 1.079850 # #Estimation Method: mle # #Data: dat # #Sample Size: 30 #---------- # Compare the results of using the least-squares estimators: epareto(dat, method="lse")$parameters #location shape #1.085924 1.144180 #---------- # Clean up #--------- rm(dat)
Produces an empirical probability density function plot.
epdfPlot(x, discrete = FALSE, density.arg.list = NULL, plot.it = TRUE, add = FALSE, epdf.col = "black", epdf.lwd = 3 * par("cex"), epdf.lty = 1, curve.fill = FALSE, curve.fill.col = "cyan", ..., type = ifelse(discrete, "h", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
epdfPlot(x, discrete = FALSE, density.arg.list = NULL, plot.it = TRUE, add = FALSE, epdf.col = "black", epdf.lwd = 3 * par("cex"), epdf.lty = 1, curve.fill = FALSE, curve.fill.col = "cyan", ..., type = ifelse(discrete, "h", "l"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
x |
numeric vector of observations. Missing ( |
discrete |
logical scalar indicating whether the assumed parent distribution of |
density.arg.list |
list with arguments to the |
plot.it |
logical scalar indicating whether to produce a plot or add to the current plot (see |
add |
logical scalar indicating whether to add the empirical pdf to the current plot
( |
epdf.col |
a numeric scalar or character string determining the color of the empirical pdf
line or points. The default value is |
epdf.lwd |
a numeric scalar determining the width of the empirical pdf line.
The default value is |
epdf.lty |
a numeric scalar determining the line type of the empirical pdf line.
The default value is |
curve.fill |
a logical scalar indicating whether to fill in the area below the empirical pdf
curve with the
color specified by |
curve.fill.col |
a numeric scalar or character string indicating what color to use to fill in the
area below the empirical pdf curve. The default value is
|
type , main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters (see |
When a distribution is discrete and can only take on a finite number of values,
the empirical pdf plot is the same as the standard relative frequency histogram;
that is, each bar of the histogram represents the proportion of the sample
equal to that particular number (or category). When a distribution is continuous,
the function epdfPlot
calls the R function density
to
compute the estimated probability density at a number of evenly spaced points
between the minimum and maximum values.
epdfPlot
invisibly returns a list with the following components:
x |
numeric vector of ordered quantiles. |
f.x |
numeric vector of the associated estimated values of the pdf. |
An empirical probability density function (epdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms and boxplots to assess the characteristics of a set of data.
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA.
See the REFERENCES section in the help file for density
.
Empirical, pdfPlot
, ecdfPlot
,
cdfPlot
, cdfCompare
, qqPlot
.
# Using Reference Area TcCB data in EPA.94b.tccb.df, # create a histogram of the log-transformed observations, # then superimpose the empirical pdf plot. dev.new() log.TcCB <- with(EPA.94b.tccb.df, log(TcCB[Area == "Reference"])) hist(log.TcCB, freq = FALSE, xlim = c(-2, 1), col = "cyan", xlab = "log [ TcCB (ppb) ]", ylab = "Relative Frequency", main = "Reference Area TcCB with Empirical PDF") epdfPlot(log.TcCB, add = TRUE) #========== # Generate 20 observations from a Poisson distribution with # parameter lambda = 10, and plot the empirical PDF. set.seed(875) x <- rpois(20, lambda = 10) dev.new() epdfPlot(x, discrete = TRUE) #========== # Clean up #--------- rm(log.TcCB, x) graphics.off()
# Using Reference Area TcCB data in EPA.94b.tccb.df, # create a histogram of the log-transformed observations, # then superimpose the empirical pdf plot. dev.new() log.TcCB <- with(EPA.94b.tccb.df, log(TcCB[Area == "Reference"])) hist(log.TcCB, freq = FALSE, xlim = c(-2, 1), col = "cyan", xlab = "log [ TcCB (ppb) ]", ylab = "Relative Frequency", main = "Reference Area TcCB with Empirical PDF") epdfPlot(log.TcCB, add = TRUE) #========== # Generate 20 observations from a Poisson distribution with # parameter lambda = 10, and plot the empirical PDF. set.seed(875) x <- rpois(20, lambda = 10) dev.new() epdfPlot(x, discrete = TRUE) #========== # Clean up #--------- rm(log.TcCB, x) graphics.off()
Estimate the mean of a Poisson distribution, and optionally construct a confidence interval for the mean.
epois(x, method = "mle/mme/mvue", ci = FALSE, ci.type = "two-sided", ci.method = "exact", conf.level = 0.95)
epois(x, method = "mle/mme/mvue", ci = FALSE, ci.type = "two-sided", ci.method = "exact", conf.level = 0.95)
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Currently the only possible
value is |
ci |
logical scalar indicating whether to compute a confidence interval for the
location or scale parameter. The default value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence interval
for the location or scale parameter. Possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from a Poisson distribution with
parameter
lambda=
. It can be shown (e.g., Forbes et al., 2009)
that if
is defined as:
then is an observation from a Poisson distribution with parameter
lambda=
.
Estimation
The maximum likelihood, method of moments, and minimum variance unbiased estimator
(mle/mme/mvue) of is given by:
where
Confidence Intervals
There are three possible ways to construct a confidence interval for
: based on the exact distribution of the estimator of
(
ci.type="exact"
), based on an approximation of
Pearson and Hartley (ci.type="pearson.hartley.approx"
), or based on the
normal approximation
(ci.type="normal.approx"
).
Exact Confidence Interval (ci.method="exact"
)
If ci.type="two-sided"
, an exact confidence interval
for
can be constructed as
, where the confidence
limits are computed such that:
where is defined in equation (1) and
denotes a Poisson random
variable with parameter
lambda=
.
If ci.type="lower"
, is replaced with
in
equation (4) and
is set to
.
If ci.type="upper"
, is replaced with
in
equation (5) and
is set to 0.
Note that an exact upper confidence bound can be computed even when all
observations are 0.
Pearson-Hartley Approximation (ci.method="pearson.hartley.approx"
)
For a two-sided confidence interval for
, the
Pearson and Hartley approximation (Zar, 2010, p.587; Pearson and Hartley, 1970, p.81)
is given by:
where denotes the
'th quantile of the
chi-square distribution with
degrees of freedom.
One-sided confidence intervals are computed in a similar fashion.
Normal Approximation (ci.method="normal.approx"
)
An approximate confidence interval for
can be
constructed assuming the distribution of the estimator of
is
approximately normally distributed. A two-sided confidence interval is constructed
as:
where is the
'th quantile of the standard normal distribution, and
the quantity
denotes the estimated asymptotic standard deviation of the estimator of
.
One-sided confidence intervals are constructed in a similar manner.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The Poisson distribution is named after Poisson, who
derived this distribution as the limiting distribution of the
binomial distribution with parameters size=
and
prob=
, where
tends to infinity,
tends to 0, and
stays constant.
In this context, the Poisson distribution was used by Bortkiewicz (1898) to model
the number of deaths (per annum) from kicks by horses in Prussian Army Corps. In
this case, , the probability of death from this cause, was small, but the
number of soldiers exposed to this risk,
, was large.
The Poisson distribution has been applied in a variety of fields, including quality control (modeling number of defects produced in a process), ecology (number of organisms per unit area), and queueing theory. Gibbons (1987b) used the Poisson distribution to model the number of detected compounds per scan of the 32 volatile organic priority pollutants (VOC), and also to model the distribution of chemical concentration (in ppb).
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Gibbons, R.D. (1987b). Statistical Models for the Analysis of Volatile Organic Compounds in Waste Disposal Sites. Ground Water 25, 572-580.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 4.
Pearson, E.S., and H.O. Hartley, eds. (1970). Biometrika Tables for Statisticians, Volume 1. Cambridge Universtiy Press, New York, p.81.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, pp. 585–586.
# Generate 20 observations from a Poisson distribution with parameter # lambda=2, then estimate the parameter and construct a 90% confidence # interval. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpois(20, lambda = 2) epois(dat, ci = TRUE, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 20 # #Confidence Interval for: lambda # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = 1.336558 # UCL = 2.377037 #---------- # Compare the different ways of constructing confidence intervals for # lambda using the same data as in the previous example: epois(dat, ci = TRUE, ci.method = "pearson", conf.level = 0.9)$interval$limits # LCL UCL #1.336558 2.377037 epois(dat, ci = TRUE, ci.method = "normal.approx", conf.level = 0.9)$interval$limits # LCL UCL #1.306544 2.293456 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from a Poisson distribution with parameter # lambda=2, then estimate the parameter and construct a 90% confidence # interval. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpois(20, lambda = 2) epois(dat, ci = TRUE, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 20 # #Confidence Interval for: lambda # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Interval: LCL = 1.336558 # UCL = 2.377037 #---------- # Compare the different ways of constructing confidence intervals for # lambda using the same data as in the previous example: epois(dat, ci = TRUE, ci.method = "pearson", conf.level = 0.9)$interval$limits # LCL UCL #1.336558 2.377037 epois(dat, ci = TRUE, ci.method = "normal.approx", conf.level = 0.9)$interval$limits # LCL UCL #1.306544 2.293456 #---------- # Clean up #--------- rm(dat)
Estimate the mean of a Poisson distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for the mean.
epoisCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", ci.sample.size = sum(!censored))
epoisCensored(x, censored, method = "mle", censoring.side = "left", ci = FALSE, ci.method = "profile.likelihood", ci.type = "two-sided", conf.level = 0.95, n.bootstraps = 1000, pivot.statistic = "z", ci.sample.size = sum(!censored))
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
method |
character string specifying the method of estimation. The possible values are:
|
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean or variance. The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. The possible values are |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
n.bootstraps |
numeric scalar indicating how many bootstraps to use to construct the
confidence interval for the mean when |
pivot.statistic |
character string indicating which pivot statistic to use in the construction
of the confidence interval for the mean when |
ci.sample.size |
numeric scalar indicating what sample size to assume to construct the
confidence interval for the mean if |
If x
or censored
contain any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let denote a vector of
observations from a
Poisson distribution with mean
lambda=
.
Assume (
) of these
observations are known and
(
) of these observations are
all censored below (left-censored) or all censored above (right-censored) at
fixed censoring levels
For the case when , the data are said to be Type I
multiply censored. For the case when
,
set
. If the data are left-censored
and all
known observations are greater
than or equal to
, or if the data are right-censored and all
known observations are less than or equal to
, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let denote the number of observations censored below or above censoring
level
for
, so that
Let denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
right-censored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For left-censored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity does not necessarily represent
the
'th “largest” observation from the (unknown) complete sample.
Finally, let (omega) denote the set of
subscripts in the
“ordered” sample that correspond to uncensored observations.
Estimation
Maximum Likelihood Estimation (method="mle"
)
For Type I left censored data, the likelihood function is given by:
where and
denote the probability density function (pdf) and
cumulative distribution function (cdf) of the population
(Cohen, 1963; Cohen, 1991, pp.6, 50). That is,
(Johnson et al., 1992, p.151). For left singly censored data, equation (3) simplifies to:
Similarly, for Type I right censored data, the likelihood function is given by:
and for right singly censored data this simplifies to:
The maximum likelihood estimators are computed by maximizing the likelihood function.
For right-censored data, taking the derivative of the log-likelihood function
with respect to and setting this to 0 produces the following equation:
where
Note that the quantity defined in equation (10) is simply the mean of the uncensored observations.
For left-censored data, taking the derivative of the log-likelihood function with
respect to and setting this to 0 produces the following equation:
The function epoisCensored
computes the maximum likelihood estimator
of by solving Equation (9) (right-censored data) or
Equation (11) (left-censored data); it uses the sample mean of
the uncensored observations as the initial value.
Setting Censored Observations to Half the Censoring Level (method="half.cen.level"
)
This method is applicable only to left censored data.
This method involves simply replacing all the censored observations with half their
detection limit, and then computing the mean and standard deviation with the usual
formulas (see epois
).
This method is included only to allow comparison of this method to other methods.
Setting left-censored observations to half the censoring level is not
recommended.
Confidence Intervals
This section explains how confidence intervals for the mean are
computed.
Likelihood Profile (ci.method="profile.likelihood"
)
This method was proposed by Cox (1970, p.88), and Venzon and Moolgavkar (1988)
introduced an efficient method of computation. This method is also discussed by
Stryhn and Christensen (2003) and Royston (2007).
The idea behind this method is to invert the likelihood-ratio test to obtain a
confidence interval for the mean . Equation (3) above
shows the form of the likelihood function
for
multiply left-censored data, and Equation (7) shows the function for
multiply right-censored data.
Following Stryhn and Christensen (2003), denote the maximum likelihood estimate
of the mean by . The likelihood
ratio test statistic (
) of the hypothesis
(where
is a fixed value) equals the drop in
between the
“full” model and the reduced model with
fixed at
, i.e.,
.
Under the null hypothesis, the test statistic follows a
chi-squared distribution with 1 degree of freedom.
A two-sided confidence interval for the mean
consists of all values of
for which the test is not significant at
level
:
where denotes the
'th quantile of the
chi-squared distribution with
degrees of freedom.
One-sided lower and one-sided upper confidence intervals are computed in a similar
fashion, except that the quantity
in Equation (12) is replaced with
.
Normal Approximation (ci.method="normal.approx"
)
This method constructs approximate confidence intervals for
based on the assumption that the estimator of
is
approximately normally distributed. That is, a two-sided
confidence interval for
is constructed as:
where denotes the estimate of
,
denotes the estimated asymptotic standard
deviation of the estimator of
,
denotes the assumed sample
size for the confidence interval, and
denotes the
'th
quantile of Student's t-distribuiton with
degrees of freedom. One-sided confidence intervals are computed in a
similar fashion.
The argument ci.sample.size
determines the value of and by
default is equal to the number of uncensored observations.
This is simply an ad-hoc method of constructing
confidence intervals and is not based on any published theoretical results.
When pivot.statistic="z"
, the 'th quantile from the
standard normal distribution is used in place of the
'th quantile from Student's t-distribution.
When is estimated with the maximum likelihood estimator
(
method="mle"
), the variance of is
estimated based on the inverse of the Fisher Information matrix. When
is estimated using the half-censoring-level method
(
method="half.cen.level"
), the variance of is
estimated as:
where denotes the assumed sample size (see above).
Bootstrap and Bias-Corrected Bootstrap Approximation (ci.method="bootstrap"
)
The bootstrap is a nonparametric method of estimating the distribution
(and associated distribution parameters and quantiles) of a sample statistic,
regardless of the distribution of the population from which the sample was drawn.
The bootstrap was introduced by Efron (1979) and a general reference is
Efron and Tibshirani (1993).
In the context of deriving an approximate confidence interval
for the population mean
, the bootstrap can be broken down into the
following steps:
Create a bootstrap sample by taking a random sample of size from
the observations in
, where sampling is done with
replacement. Note that because sampling is done with replacement, the same
element of
can appear more than once in the bootstrap
sample. Thus, the bootstrap sample will usually not look exactly like the
original sample (e.g., the number of censored observations in the bootstrap
sample will often differ from the number of censored observations in the
original sample).
Estimate based on the bootstrap sample created in Step 1, using
the same method that was used to estimate
using the original
observations in
. Because the bootstrap sample usually
does not match the original sample, the estimate of
based on the
bootstrap sample will usually differ from the original estimate based on
.
Repeat Steps 1 and 2 times, where
is some large number.
For the function
epoisCensored
, the number of bootstraps is
determined by the argument
n.bootstraps
(see the section ARGUMENTS above).
The default value of n.bootstraps
is 1000
.
Use the estimated values of
to compute the empirical
cumulative distribution function of this estimator of
(see
ecdfPlot
), and then create a confidence interval for
based on this estimated cdf.
The two-sided percentile interval (Efron and Tibshirani, 1993, p.170) is computed as:
where denotes the empirical cdf evaluated at
and thus
denotes the
'th empirical quantile, that is,
the
'th quantile associated with the empirical cdf. Similarly, a one-sided lower
confidence interval is computed as:
and a one-sided upper confidence interval is computed as:
The function epoisCensored
calls the R function quantile
to compute the empirical quantiles used in Equations (15)-(17).
The percentile method bootstrap confidence interval is only first-order
accurate (Efron and Tibshirani, 1993, pp.187-188), meaning that the probability
that the confidence interval will contain the true value of can be
off by
, where
is some constant. Efron and Tibshirani
(1993, pp.184-188) proposed a bias-corrected and accelerated interval that is
second-order accurate, meaning that the probability that the confidence interval
will contain the true value of
may be off by
instead of
. The two-sided bias-corrected and accelerated confidence interval is
computed as:
where
where the quantity denotes the estimate of
using
all the values in
except the
'th one, and
A one-sided lower confidence interval is given by:
and a one-sided upper confidence interval is given by:
where and
are computed as for a two-sided confidence
interval, except
is replaced with
in Equations (19) and (20).
The constant incorporates the bias correction, and the constant
is the acceleration constant. The term “acceleration” refers
to the rate of change of the standard error of the estimate of
with
respect to the true value of
(Efron and Tibshirani, 1993, p.186). For a
normal (Gaussian) distribution, the standard error of the estimate of
does not depend on the value of
, hence the acceleration constant is not
really necessary.
When ci.method="bootstrap"
, the function epoisCensored
computes both
the percentile method and bias-corrected and accelerated method bootstrap confidence
intervals.
a list of class "estimateCensored"
containing the estimated parameters
and other information. See estimateCensored.object
for details.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation. Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators for parameters of a normal or lognormal distribution based on censored environmental data.
In practice, it is better to use a confidence interval for the mean or a joint confidence region for the mean and standard deviation (or coefficient of variation), rather than rely on a single point-estimate of the mean. Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and coefficient of variation of a Poisson distribution when data are subjected to single or multiple censoring.
Steven P. Millard ([email protected])
Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, 312pp.
Cox, D.R. (1970). Analysis of Binary Data. Chapman & Hall, London. 142pp.
Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics 7, 1–26.
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, 436pp.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions, Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York, Chapter 4.
Millard, S.P., P. Dixon, and N.K. Neerchal. (2014; in preparation). Environmental Statistics with R. CRC Press, Boca Raton, Florida.
Nelson, W. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, 634pp.
Royston, P. (2007). Profile Likelihood for Estimation and Confdence Intervals. The Stata Journal 7(3), pp. 376–387.
Stryhn, H., and J. Christensen. (2003). Confidence Intervals by the Profile Likelihood Method, with Applications in Veterinary Epidemiology. Contributed paper at ISVEE X (November 2003, Chile). https://gilvanguedes.com/wp-content/uploads/2019/05/Profile-Likelihood-CI.pdf.
Venzon, D.J., and S.H. Moolgavkar. (1988). A Method for Computing Profile-Likelihood-Based Confidence Intervals. Journal of the Royal Statistical Society, Series C (Applied Statistics) 37(1), pp. 87–94.
Poisson, epois
, estimateCensored.object
.
# Generate 20 observations from a Poisson distribution with # parameter lambda=10, and censor the values less than 10. # Then generate 20 more observations from the same distribution # and censor the values less than 20. Then estimate the mean # using the maximum likelihood method. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(300) dat.1 <- rpois(20, lambda=10) censored.1 <- dat.1 < 10 dat.1[censored.1] <- 10 dat.2 <- rpois(20, lambda=10) censored.2 <- dat.2 < 20 dat.2[censored.2] <- 20 dat <- c(dat.1, dat.2) censored <- c(censored.1, censored.2) epoisCensored(dat, censored, ci = TRUE) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Poisson # #Censoring Side: left # #Censoring Level(s): 10 20 # #Estimated Parameter(s): lambda = 11.05402 # #Estimation Method: MLE # #Data: dat # #Censoring Variable: censored # #Sample Size: 40 # #Percent Censored: 65% # #Confidence Interval for: lambda # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 9.842894 # UCL = 12.846484 #---------- # Clean up #--------- rm(dat.1, censored.1, dat.2, censored.2, dat, censored)
# Generate 20 observations from a Poisson distribution with # parameter lambda=10, and censor the values less than 10. # Then generate 20 more observations from the same distribution # and censor the values less than 20. Then estimate the mean # using the maximum likelihood method. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(300) dat.1 <- rpois(20, lambda=10) censored.1 <- dat.1 < 10 dat.1[censored.1] <- 10 dat.2 <- rpois(20, lambda=10) censored.2 <- dat.2 < 20 dat.2[censored.2] <- 20 dat <- c(dat.1, dat.2) censored <- c(censored.1, censored.2) epoisCensored(dat, censored, ci = TRUE) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Poisson # #Censoring Side: left # #Censoring Level(s): 10 20 # #Estimated Parameter(s): lambda = 11.05402 # #Estimation Method: MLE # #Data: dat # #Censoring Variable: censored # #Sample Size: 40 # #Percent Censored: 65% # #Confidence Interval for: lambda # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 9.842894 # UCL = 12.846484 #---------- # Clean up #--------- rm(dat.1, censored.1, dat.2, censored.2, dat, censored)
Estimate quantiles of a beta distribution.
eqbeta(x, p = 0.5, method = "mle", digits = 0)
eqbeta(x, p = 0.5, method = "mle", digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a beta distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method to use to estimate the shape and scale
parameters of the distribution. The possible values are
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqbeta
returns estimated quantiles as well as
estimates of the shape1 and shape2 parameters.
Quantiles are estimated by 1) estimating the shape1 and shape2 parameters by
calling ebeta
, and then 2) calling the function
qbeta
and using the estimated values for
shape1 and shape2.
If x
is a numeric vector, eqbeta
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqbeta
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The beta distribution takes real values between 0 and 1. Special cases of the
beta are the Uniform[0,1] when shape1=1
and
shape2=1
, and the arcsin distribution when shape1=0.5
and shape2=0.5
. The arcsin distribution appears in the theory of random walks.
The beta distribution is used in Bayesian analyses as a conjugate to the binomial
distribution.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
# Generate 20 observations from a beta distribution with parameters # shape1=2 and shape2=4, then estimate the parameters via # maximum likelihood and estimate the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rbeta(20, shape1 = 2, shape2 = 4) eqbeta(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Beta # #Estimated Parameter(s): shape1 = 5.392221 # shape2 = 11.823233 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 0.4592796 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up rm(dat)
# Generate 20 observations from a beta distribution with parameters # shape1=2 and shape2=4, then estimate the parameters via # maximum likelihood and estimate the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rbeta(20, shape1 = 2, shape2 = 4) eqbeta(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Beta # #Estimated Parameter(s): shape1 = 5.392221 # shape2 = 11.823233 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 0.4592796 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up rm(dat)
Estimate quantiles of a binomial distribution.
eqbinom(x, size = NULL, p = 0.5, method = "mle/mme/mvue", digits = 0)
eqbinom(x, size = NULL, p = 0.5, method = "mle/mme/mvue", digits = 0)
x |
numeric or logical vector of observations, or an object resulting from a call to an
estimating function that assumes a binomial distribution
(e.g., |
size |
positive integer indicating the of number of trials; |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimation. The only possible value is
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqbinom
returns estimated quantiles as well as
estimates of the prob
parameter.
Quantiles are estimated by 1) estimating the prob parameter by
calling ebinom
, and then 2) calling the function
qbinom
and using the estimated value for
prob
.
If x
is a numeric vector, eqbinom
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqbinom
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The binomial distribution is used to model processes with binary (Yes-No, Success-Failure,
Heads-Tails, etc.) outcomes. It is assumed that the outcome of any one trial is independent
of any other trial, and that the probability of “success”, , is the same on
each trial. A binomial discrete random variable
is the number of “successes” in
independent trials. A special case of the binomial distribution occurs when
,
in which case
is also called a Bernoulli random variable.
In the context of environmental statistics, the binomial distribution is sometimes used to model
the proportion of times a chemical concentration exceeds a set standard in a given period of
time (e.g., Gilbert, 1987, p.143). The binomial distribution is also used to compute an upper
bound on the overall Type I error rate for deciding whether a facility or location is in
compliance with some set standard. Assume the null hypothesis is that the facility is in compliance.
If a test of hypothesis is conducted periodically over time to test compliance and/or several tests
are performed during each time period, and the facility or location is always in compliance, and
each single test has a Type I error rate of , and the result of each test is
independent of the result of any other test (usually not a reasonable assumption), then the number
of times the facility is declared out of compliance when in fact it is in compliance is a
binomial random variable with probability of “success”
being the
probability of being declared out of compliance (see USEPA, 2009).
Steven P. Millard ([email protected])
Agresti, A., and B.A. Coull. (1998). Approximate is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126.
Agresti, A., and B. Caffo. (2000). Simple and Effective Confidence Intervals for Proportions and Differences of Proportions Result from Adding Two Successes and Two Failures. The American Statistician, 54(4), 280–288.
Berthouex, P.M., and L.C. Brown. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, FL, Chapters 2 and 15.
Cochran, W.G. (1977). Sampling Techniques. John Wiley and Sons, New York, Chapter 3.
Fisher, R.A., and F. Yates. (1963). Statistical Tables for Biological, Agricultural, and Medical Research. 6th edition. Hafner, New York, 146pp.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. Second Edition. John Wiley and Sons, New York, Chapters 1-2.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, Chapter 11.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 3.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Newcombe, R.G. (1998a). Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods. Statistics in Medicine, 17, 857–872.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL, Chapter 4.
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.6-38.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 24.
ebinom
, Binomial
,
estimate.object
.
# Generate 20 observations from a binomial distribution with # parameters size=1 and prob=0.2, then estimate the 'prob' # parameter and the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example. set.seed(251) dat <- rbinom(20, size = 1, prob = 0.2) eqbinom(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 20.0 # prob = 0.1 # #Estimation Method: mle/mme/mvue for 'prob' # #Estimated Quantile(s): 90'th %ile = 4 # #Quantile Estimation Method: Quantile(s) Based on # mle/mme/mvue for 'prob' Estimators # #Data: dat # #Sample Size: 20 # # # #---------- # Clean up rm(dat)
# Generate 20 observations from a binomial distribution with # parameters size=1 and prob=0.2, then estimate the 'prob' # parameter and the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example. set.seed(251) dat <- rbinom(20, size = 1, prob = 0.2) eqbinom(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Binomial # #Estimated Parameter(s): size = 20.0 # prob = 0.1 # #Estimation Method: mle/mme/mvue for 'prob' # #Estimated Quantile(s): 90'th %ile = 4 # #Quantile Estimation Method: Quantile(s) Based on # mle/mme/mvue for 'prob' Estimators # #Data: dat # #Sample Size: 20 # # # #---------- # Clean up rm(dat)
Estimate quantiles of an extreme value distribution.
eqevd(x, p = 0.5, method = "mle", pwme.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), digits = 0)
eqevd(x, p = 0.5, method = "mle", pwme.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes an extreme value distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method to use to estimate the location and scale
parameters. Possible values are
|
pwme.method |
character string specifying what method to use to compute the
probability-weighted moments when |
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for the
plotting positions when |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqevd
returns estimated quantiles as well as
estimates of the location and scale parameters.
Quantiles are estimated by 1) estimating the location and scale parameters by
calling eevd
, and then 2) calling the function
qevd
and using the estimated values for
location and scale.
If x
is a numeric vector, eqevd
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqevd
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
There are three families of extreme value distributions. The one
described here is the Type I, also called the Gumbel extreme value
distribution or simply Gumbel distribution. The name
“extreme value” comes from the fact that this distribution is
the limiting distribution (as approaches infinity) of the
greatest value among
independent random variables each
having the same continuous distribution.
The Gumbel extreme value distribution is related to the
exponential distribution as follows.
Let be an exponential random variable
with parameter
rate=
. Then
has an extreme value distribution with parameters
location=
and
scale=
.
The distribution described above and assumed by eevd
is the
largest extreme value distribution. The smallest extreme value
distribution is the limiting distribution (as approaches infinity)
of the smallest value among
independent random variables each having the same continuous distribution.
If
has a largest extreme value distribution with parameters
location=
and
scale=
, then
has a smallest extreme value distribution with parameters
location=
and
scale=
. The smallest
extreme value distribution is related to the Weibull distribution
as follows. Let
be a Weibull random variable with
parameters
shape=
and
scale=
. Then
has a smallest extreme value distribution with parameters
location=
and
scale=
.
The extreme value distribution has been used extensively to model the distribution of streamflow, flooding, rainfall, temperature, wind speed, and other meteorological variables, as well as material strength and life data.
Steven P. Millard ([email protected])
Castillo, E. (1988). Extreme Value Theory in Engineering. Academic Press, New York, pp.184–198.
Downton, F. (1966). Linear Estimates of Parameters in the Extreme Value Distribution. Technometrics 8(1), 3–17.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Greenwood, J.A., J.M. Landwehr, N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments: Definition and Relation to Parameters of Several Distributions Expressible in Inverse Form. Water Resources Research 15(5), 1049–1054.
Hosking, J.R.M., J.R. Wallis, and E.F. Wood. (1985). Estimation of the Generalized Extreme-Value Distribution by the Method of Probability-Weighted Moments. Technometrics 27(3), 251–261.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Landwehr, J.M., N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments Compared With Some Traditional Techniques in Estimating Gumbel Parameters and Quantiles. Water Resources Research 15(5), 1055–1064.
Tiago de Oliveira, J. (1963). Decision Results for the Parameters of the Extreme Value (Gumbel) Distribution Based on the Mean and Standard Deviation. Trabajos de Estadistica 14, 61–81.
eevd
, Extreme Value Distribution,
estimate.object
.
# Generate 20 observations from an extreme value distribution with # parameters location=2 and scale=1, then estimate the parameters # and estimate the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- revd(20, location = 2) eqevd(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Extreme Value # #Estimated Parameter(s): location = 1.9684093 # scale = 0.7481955 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 3.652124 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from an extreme value distribution with # parameters location=2 and scale=1, then estimate the parameters # and estimate the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- revd(20, location = 2) eqevd(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Extreme Value # #Estimated Parameter(s): location = 1.9684093 # scale = 0.7481955 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 3.652124 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up #--------- rm(dat)
Estimate quantiles of an exponential distribution.
eqexp(x, p = 0.5, method = "mle/mme", digits = 0)
eqexp(x, p = 0.5, method = "mle/mme", digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes an exponential distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method to use to estimate the rate parameter.
Currently the only possible value is |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqexp
returns estimated quantiles as well as
the estimate of the rate parameter.
Quantiles are estimated by 1) estimating the rate parameter by
calling eexp
, and then 2) calling the function
qexp
and using the estimated value for
rate.
If x
is a numeric vector, eqexp
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqexp
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The exponential distribution is a special case of the gamma distribution, and takes on positive real values. A major use of the exponential distribution is in life testing where it is used to model the lifetime of a product, part, person, etc.
The exponential distribution is the only continuous distribution with a
“lack of memory” property. That is, if the lifetime of a part follows
the exponential distribution, then the distribution of the time until failure
is the same as the distribution of the time until failure given that the part
has survived to time .
The exponential distribution is related to the double exponential (also called Laplace) distribution, and to the extreme value distribution.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
eexp
, Exponential,
estimate.object
.
# Generate 20 observations from an exponential distribution with parameter # rate=2, then estimate the parameter and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rexp(20, rate = 2) eqexp(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Exponential # #Estimated Parameter(s): rate = 2.260587 # #Estimation Method: mle/mme # #Estimated Quantile(s): 90'th %ile = 1.018578 # #Quantile Estimation Method: Quantile(s) Based on # mle/mme Estimators # #Data: dat # #Sample Size: 20 # #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from an exponential distribution with parameter # rate=2, then estimate the parameter and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rexp(20, rate = 2) eqexp(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Exponential # #Estimated Parameter(s): rate = 2.260587 # #Estimation Method: mle/mme # #Estimated Quantile(s): 90'th %ile = 1.018578 # #Quantile Estimation Method: Quantile(s) Based on # mle/mme Estimators # #Data: dat # #Sample Size: 20 # #---------- # Clean up #--------- rm(dat)
Estimate quantiles of a gamma distribution, and optionally construct a confidence interval for a quantile.
eqgamma(x, p = 0.5, method = "mle", ci = FALSE, ci.type = "two-sided", conf.level = 0.95, normal.approx.transform = "kulkarni.powar", digits = 0) eqgammaAlt(x, p = 0.5, method = "mle", ci = FALSE, ci.type = "two-sided", conf.level = 0.95, normal.approx.transform = "kulkarni.powar", digits = 0)
eqgamma(x, p = 0.5, method = "mle", ci = FALSE, ci.type = "two-sided", conf.level = 0.95, normal.approx.transform = "kulkarni.powar", digits = 0) eqgammaAlt(x, p = 0.5, method = "mle", ci = FALSE, ci.type = "two-sided", conf.level = 0.95, normal.approx.transform = "kulkarni.powar", digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a gamma distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method to use to estimate the shape and scale
parameters of the distribution. The possible values are
|
ci |
logical scalar indicating whether to compute a confidence interval for the quantile.
The default value is |
ci.type |
character string indicating what kind of confidence interval for the quantile to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
normal.approx.transform |
character string indicating which power transformation to use.
Possible values are |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqgamma
returns estimated quantiles as well as
estimates of the shape and scale parameters.
The function eqgammaAlt
returns estimated quantiles as well as
estimates of the mean and coefficient of variation.
Quantiles are estimated by 1) estimating the shape and scale parameters by
calling egamma
, and then 2) calling the function
qgamma
and using the estimated values for shape
and scale.
The confidence interval for a quantile is computed by:
using a power transformation on the original data to induce approximate normality,
using eqnorm
to compute the confidence interval,
and then
back-transforming the interval to create a confidence interval on the original scale.
This is similar to what is done to create tolerance intervals for a gamma distribuiton
(Krishnamoorthy et al., 2008), and there is a one-to-one relationship between confidence
intervals for a quantile and tolerance intervals (see the DETAILS section of the
help file for eqnorm
). The value normal.approx.transform="cube.root"
uses the cube root transformation suggested by Wilson and Hilferty (1931) and used by
Krishnamoorthy et al. (2008) and Singh et al. (2010b), and the value
normal.approx.transform="fourth.root"
uses the fourth root transformation suggested
by Hawkins and Wixley (1986) and used by Singh et al. (2010b).
The default value normal.approx.transform="kulkarni.powar"
uses the “Optimum Power Normal Approximation Method” of Kulkarni and Powar (2010).
The “optimum” power is determined by:
|
if |
|
if |
where denotes the estimate of the shape parameter. Although
Kulkarni and Powar (2010) use the maximum likelihood estimate of shape to
determine the power
, for the functions
eqgamma
and
eqgammaAlt
the power is based on whatever estimate of
shape is used
(e.g., method="mle"
, method="bcmle"
, etc.).
If x
is a numeric vector, eqgamma
and eqgammaAlt
return a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqgamma
and
eqgammaAlt
return a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
. In addition, if ci=TRUE
,
the returned list contains a component called interval
containing the
confidence interval information. If x
already has a component called
interval
, this component is replaced with the confidence interval information.
The gamma distribution takes values on the positive real line. Special cases of the gamma are the exponential distribution and the chi-square distributions. Applications of the gamma include life testing, statistical ecology, queuing theory, inventory control, and precipitation processes. A gamma distribution starts to resemble a normal distribution as the shape parameter a tends to infinity.
Some EPA guidance documents (e.g., Singh et al., 2002; Singh et al., 2010a,b) strongly recommend against using a lognormal model for environmental data and recommend trying a gamma distribuiton instead.
Percentiles are sometimes used in environmental standards and regulations. For example, Berthouex and Brown (2002, p.71) note that England has water quality limits based on the 90th and 95th percentiles of monitoring data not exceeding specified levels. They also note that the U.S. EPA has specifications for air quality monitoring, aquatic standards on toxic chemicals, and maximum daily limits for industrial effluents that are all based on percentiles. Given the importance of these quantities, it is essential to characterize the amount of uncertainty associated with the estimates of these quantities. This is done with confidence intervals.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hawkins, D. M., and R.A.J. Wixley. (1986). A Note on the Transformation of Chi-Squared Variables to Normality. The American Statistician, 40, 296–298.
Johnson, N.L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York, Chapter 17.
Krishnamoorthy K., T. Mathew, and S. Mukherjee. (2008). Normal-Based Methods for a Gamma Distribution: Prediction and Tolerance Intervals and Stress-Strength Reliability. Technometrics, 50(1), 69–78.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Kulkarni, H.V., and S.K. Powar. (2010). A New Method for Interval Estimation of the Mean of the Gamma Distribution. Lifetime Data Analysis, 16, 431–447.
Singh, A., A.K. Singh, and R.J. Iaci. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Wilson, E.B., and M.M. Hilferty. (1931). The Distribution of Chi-Squares. Proceedings of the National Academy of Sciences, 17, 684–688.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
egamma
, GammaDist
,
estimate.object
, eqnorm
, tolIntGamma
.
# Generate 20 observations from a gamma distribution with parameters # shape=3 and scale=2, then estimate the 90th percentile and create # a one-sided upper 95% confidence interval for that percentile. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rgamma(20, shape = 3, scale = 2) eqgamma(dat, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.203862 # scale = 2.174928 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 9.113446 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 13.79733 #---------- # Compare these results with the true 90'th percentile: qgamma(p = 0.9, shape = 3, scale = 2) #[1] 10.64464 #---------- # Using the same data as in the previous example, use egammaAlt # to estimate the mean and cv based on the bias-corrected # estimate of shape, and use the cube-root transformation to # normality. eqgammaAlt(dat, p = 0.9, method = "bcmle", ci = TRUE, ci.type = "upper", normal.approx.transform = "cube.root") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): mean = 4.7932408 # cv = 0.7242165 # #Estimation Method: bcmle of 'shape' # #Estimated Quantile(s): 90'th %ile = 9.428 # #Quantile Estimation Method: Quantile(s) Based on # bcmle of 'shape' # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact using # Wilson & Hilferty (1931) cube-root # transformation to Normality # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 12.89643 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and # 95% confidence using chrysene data and assuming a lognormal # distribution. Here we will use the same chrysene data but assume a # gamma distribution. # A beta-content upper tolerance limit with 95% coverage and # 95% confidence is equivalent to the 95% upper confidence limit for # the 95th percentile. attach(EPA.09.Ex.17.3.chrysene.df) Chrysene <- Chrysene.ppb[Well.type == "Background"] eqgamma(Chrysene, p = 0.95, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.806929 # scale = 5.286026 # #Estimation Method: mle # #Estimated Quantile(s): 95'th %ile = 31.74348 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: Chrysene # #Sample Size: 8 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 69.32425 #---------- # Clean up rm(Chrysene) detach("EPA.09.Ex.17.3.chrysene.df")
# Generate 20 observations from a gamma distribution with parameters # shape=3 and scale=2, then estimate the 90th percentile and create # a one-sided upper 95% confidence interval for that percentile. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rgamma(20, shape = 3, scale = 2) eqgamma(dat, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.203862 # scale = 2.174928 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 9.113446 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 13.79733 #---------- # Compare these results with the true 90'th percentile: qgamma(p = 0.9, shape = 3, scale = 2) #[1] 10.64464 #---------- # Using the same data as in the previous example, use egammaAlt # to estimate the mean and cv based on the bias-corrected # estimate of shape, and use the cube-root transformation to # normality. eqgammaAlt(dat, p = 0.9, method = "bcmle", ci = TRUE, ci.type = "upper", normal.approx.transform = "cube.root") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): mean = 4.7932408 # cv = 0.7242165 # #Estimation Method: bcmle of 'shape' # #Estimated Quantile(s): 90'th %ile = 9.428 # #Quantile Estimation Method: Quantile(s) Based on # bcmle of 'shape' # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact using # Wilson & Hilferty (1931) cube-root # transformation to Normality # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 12.89643 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and # 95% confidence using chrysene data and assuming a lognormal # distribution. Here we will use the same chrysene data but assume a # gamma distribution. # A beta-content upper tolerance limit with 95% coverage and # 95% confidence is equivalent to the 95% upper confidence limit for # the 95th percentile. attach(EPA.09.Ex.17.3.chrysene.df) Chrysene <- Chrysene.ppb[Well.type == "Background"] eqgamma(Chrysene, p = 0.95, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.806929 # scale = 5.286026 # #Estimation Method: mle # #Estimated Quantile(s): 95'th %ile = 31.74348 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: Chrysene # #Sample Size: 8 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 69.32425 #---------- # Clean up rm(Chrysene) detach("EPA.09.Ex.17.3.chrysene.df")
Estimate quantiles of a geometric distribution.
eqgeom(x, p = 0.5, method = "mle/mme", digits = 0)
eqgeom(x, p = 0.5, method = "mle/mme", digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a geometric distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method to use to estimate the probability parameter.
Possible values are |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqgeom
returns estimated quantiles as well as
the estimate of the rate parameter.
Quantiles are estimated by 1) estimating the probability parameter by
calling egeom
, and then 2) calling the function
qgeom
and using the estimated value for
the probability parameter.
If x
is a numeric vector, eqgeom
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqgeom
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The geometric distribution with parameter
prob=
is a special case of the
negative binomial distribution with parameters
size=1
and prob=p
.
The negative binomial distribution has its roots in a gambling game where participants would bet on the number of tosses of a coin necessary to achieve a fixed number of heads. The negative binomial distribution has been applied in a wide variety of fields, including accident statistics, birth-and-death processes, and modeling spatial distributions of biological organisms.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 5.
egeom
, Geometric, enbinom
,
NegBinomial, estimate.object
.
# Generate an observation from a geometric distribution with parameter # prob=0.2, then estimate the parameter prob and the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgeom(1, prob = 0.2) dat #[1] 4 eqgeom(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Geometric # #Estimated Parameter(s): prob = 0.2 # #Estimation Method: mle/mme # #Estimated Quantile(s): 90'th %ile = 10 # #Quantile Estimation Method: Quantile(s) Based on # mle/mme Estimators # #Data: dat # #Sample Size: 1 #---------- # Clean up #--------- rm(dat)
# Generate an observation from a geometric distribution with parameter # prob=0.2, then estimate the parameter prob and the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgeom(1, prob = 0.2) dat #[1] 4 eqgeom(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Geometric # #Estimated Parameter(s): prob = 0.2 # #Estimation Method: mle/mme # #Estimated Quantile(s): 90'th %ile = 10 # #Quantile Estimation Method: Quantile(s) Based on # mle/mme Estimators # #Data: dat # #Sample Size: 1 #---------- # Clean up #--------- rm(dat)
Estimate quantiles of a generalized extreme value distribution.
eqgevd(x, p = 0.5, method = "mle", pwme.method = "unbiased", tsoe.method = "med", plot.pos.cons = c(a = 0.35, b = 0), digits = 0)
eqgevd(x, p = 0.5, method = "mle", pwme.method = "unbiased", tsoe.method = "med", plot.pos.cons = c(a = 0.35, b = 0), digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a generalized extreme value distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method to use to estimate the location, scale, and
threshold parameters. Possible values are
|
pwme.method |
character string specifying what method to use to compute the
probability-weighted moments when |
tsoe.method |
character string specifying the robust function to apply in the second stage of
the two-stage order-statistics estimator when |
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for the
plotting positions when |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqgevd
returns estimated quantiles as well as
estimates of the location, scale and threshold parameters.
Quantiles are estimated by 1) estimating the location, scale, and threshold
parameters by calling egevd
, and then 2) calling the function
qgevd
and using the estimated values for
location, scale, and threshold.
If x
is a numeric vector, eqevd
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqevd
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
Two-parameter extreme value distributions (EVD) have been applied extensively since the 1930's to several fields of study, including the distributions of hydrological and meteorological variables, human lifetimes, and strength of materials. The three-parameter generalized extreme value distribution (GEVD) was introduced by Jenkinson (1955) to model annual maximum and minimum values of meteorological events. Since then, it has been used extensively in the hydological and meteorological fields.
The three families of EVDs are all special kinds of GEVDs. When the shape
parameter , the GEVD reduces to the Type I extreme value (Gumbel)
distribution. (The function
zTestGevdShape
allows you to test
the null hypothesis .) When
, the GEVD is
the same as the Type II extreme value distribution, and when
it is the same as the Type III extreme value distribution.
Hosking et al. (1985) compare the asymptotic and small-sample statistical
properties of the PWME with the MLE and Jenkinson's (1969) method of sextiles.
Castillo and Hadi (1994) compare the small-sample statistical properties of the
MLE, PWME, and TSOE. Hosking and Wallis (1995) compare the small-sample properties
of unbaised -moment estimators vs. plotting-position
-moment
estimators. (PWMEs can be written as linear combinations of
-moments and
thus have equivalent statistical properties.) Hosking and Wallis (1995) conclude
that unbiased estimators should be used for almost all applications.
Steven P. Millard ([email protected])
Castillo, E., and A. Hadi. (1994). Parameter and Quantile Estimation for the Generalized Extreme-Value Distribution. Environmetrics 5, 417–432.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Greenwood, J.A., J.M. Landwehr, N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments: Definition and Relation to Parameters of Several Distributions Expressible in Inverse Form. Water Resources Research 15(5), 1049–1054.
Hosking, J.R.M. (1984). Testing Whether the Shape Parameter is Zero in the Generalized Extreme-Value Distribution. Biometrika 71(2), 367–374.
Hosking, J.R.M. (1985). Algorithm AS 215: Maximum-Likelihood Estimation of the Parameters of the Generalized Extreme-Value Distribution. Applied Statistics 34(3), 301–310.
Hosking, J.R.M., J.R. Wallis, and E.F. Wood. (1985). Estimation of the Generalized Extreme-Value Distribution by the Method of Probability-Weighted Moments. Technometrics 27(3), 251–261.
Jenkinson, A.F. (1969). Statistics of Extremes. Technical Note 98, World Meteorological Office, Geneva.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Landwehr, J.M., N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments Compared With Some Traditional Techniques in Estimating Gumbel Parameters and Quantiles. Water Resources Research 15(5), 1055–1064.
Macleod, A.J. (1989). Remark AS R76: A Remark on Algorithm AS 215: Maximum Likelihood Estimation of the Parameters of the Generalized Extreme-Value Distribution. Applied Statistics 38(1), 198–199.
Prescott, P., and A.T. Walden. (1980). Maximum Likelihood Estimation of the Parameters of the Generalized Extreme-Value Distribution. Biometrika 67(3), 723–724.
Prescott, P., and A.T. Walden. (1983). Maximum Likelihood Estimation of the Three-Parameter Generalized Extreme-Value Distribution from Censored Samples. Journal of Statistical Computing and Simulation 16, 241–250.
egevd
, Generalized Extreme Value Distribution,
Extreme Value Distribution, eevd
, estimate.object
.
# Generate 20 observations from a generalized extreme value distribution # with parameters location=2, scale=1, and shape=0.2, then compute the # MLEs of location, shape,and threshold, and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(498) dat <- rgevd(20, location = 2, scale = 1, shape = 0.2) eqgevd(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Generalized Extreme Value # #Estimated Parameter(s): location = 1.6144631 # scale = 0.9867007 # shape = 0.2632493 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 3.289912 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from a generalized extreme value distribution # with parameters location=2, scale=1, and shape=0.2, then compute the # MLEs of location, shape,and threshold, and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(498) dat <- rgevd(20, location = 2, scale = 1, shape = 0.2) eqgevd(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Generalized Extreme Value # #Estimated Parameter(s): location = 1.6144631 # scale = 0.9867007 # shape = 0.2632493 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 3.289912 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up #--------- rm(dat)
Estimate quantiles of a hypergeometric distribution.
eqhyper(x, m = NULL, total = NULL, k = NULL, p = 0.5, method = "mle", digits = 0)
eqhyper(x, m = NULL, total = NULL, k = NULL, p = 0.5, method = "mle", digits = 0)
x |
non-negative integer indicating the number of white balls out of a sample of
size |
m |
non-negative integer indicating the number of white balls in the urn.
You must supply |
total |
positive integer indicating the total number of balls in the urn (i.e.,
|
k |
positive integer indicating the number of balls drawn without replacement from the
urn. Missing values ( |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimating the parameters of the
hypergeometric distribution. Possible values are
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqhyper
returns estimated quantiles as well as
estimates of the hypergeometric distribution parameters.
Quantiles are estimated by 1) estimating the distribution parameters by
calling ehyper
, and then 2) calling the function
qhyper
and using the estimated values for
the distribution parameters.
If x
is a numeric vector, eqhyper
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqhyper
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The hypergeometric distribution can be described by
an urn model with white balls and
black balls. If
balls
are drawn with replacement, then the number of white balls in the sample
of size
follows a binomial distribution with
parameters
size=
and
prob=
. If
balls are
drawn without replacement, then the number of white balls in the sample of
size
follows a hypergeometric distribution
with parameters
m=
,
n=
, and
k=
.
The name “hypergeometric” comes from the fact that the probabilities associated with this distribution can be written as successive terms in the expansion of a function of a Gaussian hypergeometric series.
The hypergeometric distribution is applied in a variety of fields, including quality control and estimation of animal population size. It is also the distribution used to compute probabilities for Fishers's exact test for a 2x2 contingency table.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 6.
ehyper
, Hypergeometric, estimate.object
.
# Generate an observation from a hypergeometric distribution with # parameters m=10, n=30, and k=5, then estimate the parameter m, and # the 80'th percentile. # Note: the call to set.seed simply allows you to reproduce this example. # Also, the only parameter actually estimated is m; once m is estimated, # n is computed by subtracting the estimated value of m (8 in this example) # from the given of value of m+n (40 in this example). The parameters # n and k are shown in the output in order to provide information on # all of the parameters associated with the hypergeometric distribution. set.seed(250) dat <- rhyper(nn = 1, m = 10, n = 30, k = 5) dat #[1] 1 eqhyper(dat, total = 40, k = 5, p = 0.8) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Hypergeometric # #Estimated Parameter(s): m = 8 # n = 32 # k = 5 # #Estimation Method: mle for 'm' # #Estimated Quantile(s): 80'th %ile = 2 # #Quantile Estimation Method: Quantile(s) Based on # mle for 'm' Estimators # #Data: dat # #Sample Size: 1 #---------- # Clean up #--------- rm(dat)
# Generate an observation from a hypergeometric distribution with # parameters m=10, n=30, and k=5, then estimate the parameter m, and # the 80'th percentile. # Note: the call to set.seed simply allows you to reproduce this example. # Also, the only parameter actually estimated is m; once m is estimated, # n is computed by subtracting the estimated value of m (8 in this example) # from the given of value of m+n (40 in this example). The parameters # n and k are shown in the output in order to provide information on # all of the parameters associated with the hypergeometric distribution. set.seed(250) dat <- rhyper(nn = 1, m = 10, n = 30, k = 5) dat #[1] 1 eqhyper(dat, total = 40, k = 5, p = 0.8) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Hypergeometric # #Estimated Parameter(s): m = 8 # n = 32 # k = 5 # #Estimation Method: mle for 'm' # #Estimated Quantile(s): 80'th %ile = 2 # #Quantile Estimation Method: Quantile(s) Based on # mle for 'm' Estimators # #Data: dat # #Sample Size: 1 #---------- # Clean up #--------- rm(dat)
Estimate quantiles of a lognormal distribution, and optionally construct a confidence interval for a quantile.
eqlnorm(x, p = 0.5, method = "qmle", ci = FALSE, ci.method = "exact", ci.type = "two-sided", conf.level = 0.95, digits = 0)
eqlnorm(x, p = 0.5, method = "qmle", ci = FALSE, ci.method = "exact", ci.type = "two-sided", conf.level = 0.95, digits = 0)
x |
a numeric vector of positive observations, or an object resulting from a call to an
estimating function that assumes a lognormal distribution
(i.e., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string indicating what method to use to estimate the quantile(s).
The possible values are |
ci |
logical scalar indicating whether to compute a confidence interval for the quantile.
The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the quantile. The possible values are |
ci.type |
character string indicating what kind of confidence interval for the quantile to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Quantiles and their associated confidence intervals are constructed by calling
the function eqnorm
using the log-transformed data and then
exponentiating the quantiles and confidence limits.
In the special case when p=0.5
and method="mvue"
, the estimated
median is computed using the method given in Gilbert (1987, p.172) and
Bradu and Mundlak (1970).
If x
is a numeric vector, eqlnorm
returns a list of class
"estimate"
containing the estimated quantile(s) and other information.
See estimate.object
for details.
If x
is the result of calling an estimation function, eqlnorm
returns a list whose class is the same as x
. The list contains the same
components as x
, as well as components called quantiles
and
quantile.method
. In addition, if ci=TRUE
, the returned list
contains a component called interval
containing the confidence interval
information. If x
already has a component called interval
, this
component is replaced with the confidence interval information.
Percentiles are sometimes used in environmental standards and regulations. For example, Berthouex and Brown (2002, p.71) note that England has water quality limits based on the 90th and 95th percentiles of monitoring data not exceeding specified levels. They also note that the U.S. EPA has specifications for air quality monitoring, aquatic standards on toxic chemicals, and maximum daily limits for industrial effluents that are all based on percentiles. Given the importance of these quantities, it is essential to characterize the amount of uncertainty associated with the estimates of these quantities. This is done with confidence intervals.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Bradu, D., and Y. Mundlak. (1970). Estimation in Lognormal Linear Models. Journal of the American Statistical Association 65, 198-211.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, pp.88-90.
Johnson, N.L., and B.L. Welch. (1940). Applications of the Non-Central t-Distribution. Biometrika 31, 362-389.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.
Stedinger, J. (1983). Confidence Intervals for Design Events. Journal of Hydraulic Engineering 109(1), 13-27.
Stedinger, J.R., R.M. Vogel, and E. Foufoula-Georgiou. (1993). Frequency Analysis of Extreme Events. In: Maidment, D.R., ed. Handbook of Hydrology. McGraw-Hill, New York, Chapter 18, pp.29-30.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
eqnorm
, Lognormal
, elnorm
,
estimate.object
.
# Generate 20 observations from a lognormal distribution with # parameters meanlog=3 and sdlog=0.5, then estimate the 90th # percentile and create a one-sided upper 95% confidence interval # for that percentile. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(47) dat <- rlnorm(20, meanlog = 3, sdlog = 0.5) eqlnorm(dat, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.9482139 # sdlog = 0.4553215 # #Estimation Method: mvue # #Estimated Quantile(s): 90'th %ile = 34.18312 # #Quantile Estimation Method: qmle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 45.84008 #---------- # Compare these results with the true 90'th percentile: qlnorm(p = 0.9, meanlog = 3, sdlog = 0.5) #[1] 38.1214 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal # distribution. # A beta-content upper tolerance limit with 95% coverage and 95% # confidence is equivalent to the 95% upper confidence limit for the # 95th percentile. attach(EPA.09.Ex.17.3.chrysene.df) Chrysene <- Chrysene.ppb[Well.type == "Background"] eqlnorm(Chrysene, p = 0.95, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.5085773 # sdlog = 0.6279479 # #Estimation Method: mvue # #Estimated Quantile(s): 95'th %ile = 34.51727 # #Quantile Estimation Method: qmle # #Data: Chrysene # #Sample Size: 8 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.0000 # UCL = 90.9247 #---------- # Clean up rm(Chrysene) detach("EPA.09.Ex.17.3.chrysene.df")
# Generate 20 observations from a lognormal distribution with # parameters meanlog=3 and sdlog=0.5, then estimate the 90th # percentile and create a one-sided upper 95% confidence interval # for that percentile. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(47) dat <- rlnorm(20, meanlog = 3, sdlog = 0.5) eqlnorm(dat, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.9482139 # sdlog = 0.4553215 # #Estimation Method: mvue # #Estimated Quantile(s): 90'th %ile = 34.18312 # #Quantile Estimation Method: qmle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 45.84008 #---------- # Compare these results with the true 90'th percentile: qlnorm(p = 0.9, meanlog = 3, sdlog = 0.5) #[1] 38.1214 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal # distribution. # A beta-content upper tolerance limit with 95% coverage and 95% # confidence is equivalent to the 95% upper confidence limit for the # 95th percentile. attach(EPA.09.Ex.17.3.chrysene.df) Chrysene <- Chrysene.ppb[Well.type == "Background"] eqlnorm(Chrysene, p = 0.95, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.5085773 # sdlog = 0.6279479 # #Estimation Method: mvue # #Estimated Quantile(s): 95'th %ile = 34.51727 # #Quantile Estimation Method: qmle # #Data: Chrysene # #Sample Size: 8 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.0000 # UCL = 90.9247 #---------- # Clean up rm(Chrysene) detach("EPA.09.Ex.17.3.chrysene.df")
Estimate quantiles of a three-parameter lognormal distribution.
eqlnorm3(x, p = 0.5, method = "lmle", digits = 0)
eqlnorm3(x, p = 0.5, method = "lmle", digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a three-parameter lognormal distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimating the distribution parameters.
Possible values are
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Quantiles are estimated by 1) estimating the distribution parameters by
calling elnorm3
, and then 2) calling the function
qlnorm3
and using the estimated distribution
parameters.
If x
is a numeric vector, eqlnorm3
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqlnorm3
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The problem of estimating the parameters of a three-parameter lognormal distribution has been extensively discussed by Aitchison and Brown (1957, Chapter 6), Calitz (1973), Cohen (1951), Cohen (1988), Cohen and Whitten (1980), Cohen et al. (1985), Griffiths (1980), Harter and Moore (1966), Hill (1963), and Royston (1992b). Stedinger (1980) and Hoshi et al. (1984) discuss fitting the three-parameter lognormal distribution to hydrologic data.
The global maximum likelihood estimates are inadmissible. In the past, several
researchers have found that the local maximum likelihood estimates (lmle's)
occasionally fail because of convergence problems, but they were not using the
likelihood profile and reparameterization of Griffiths (1980). Cohen (1988)
recommends the modified methods of moments estimators over lmle's because they are
easy to compute, they are unbiased with respect to and
(the
mean and standard deviation on the log-scale), their variances are minimal or near
minimal, and they do not suffer from regularity problems.
Because the distribution of the lmle of the threshold parameter is far
from normal for moderate sample sizes (Griffiths, 1980), it is questionable whether
confidence intervals for
or the median based on asymptotic variances
and covariances will perform well. Cohen and Whitten (1980) and Cohen et al. (1985),
however, found that the asymptotic variances and covariances are reasonably close to
corresponding simulated variances and covariances for the modified method of moments
estimators (
method="mmme"
). In a simulation study (5000 monte carlo trials),
Royston (1992b) found that the coverage of confidence intervals for
based on the likelihood profile (
ci.method="likelihood.profile"
) was very
close the nominal level (94.1% for a nominal level of 95%), although not
symmetric. Royston (1992b) also found that the coverage of confidence intervals
for based on the skewness method (
ci.method="skewness"
) was also
very close (95.4%) and symmetric.
Steven P. Millard ([email protected])
Aitchison, J., and J.A.C. Brown (1957). The Lognormal Distribution (with special references to its uses in economics). Cambridge University Press, London, Chapter 5.
Calitz, F. (1973). Maximum Likelihood Estimation of the Parameters of the Three-Parameter Lognormal Distribution–a Reconsideration. Australian Journal of Statistics 15(3), 185–190.
Cohen, A.C. (1951). Estimating Parameters of Logarithmic-Normal Distributions by Maximum Likelihood. Journal of the American Statistical Association 46, 206–212.
Cohen, A.C. (1988). Three-Parameter Estimation. In Crow, E.L., and K. Shimizu, eds. Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, Chapter 4.
Cohen, A.C., and B.J. Whitten. (1980). Estimation in the Three-Parameter Lognormal Distribution. Journal of the American Statistical Association 75, 399–404.
Cohen, A.C., B.J. Whitten, and Y. Ding. (1985). Modified Moment Estimation for the Three-Parameter Lognormal Distribution. Journal of Quality Technology 17, 92–99.
Crow, E.L., and K. Shimizu. (1988). Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, Chapter 2.
Griffiths, D.A. (1980). Interval Estimation for the Three-Parameter Lognormal Distribution via the Likelihood Function. Applied Statistics 29, 58–68.
Harter, H.L., and A.H. Moore. (1966). Local-Maximum-Likelihood Estimation of the Parameters of Three-Parameter Lognormal Populations from Complete and Censored Samples. Journal of the American Statistical Association 61, 842–851.
Heyde, C.C. (1963). On a Property of the Lognormal Distribution. Journal of the Royal Statistical Society, Series B 25, 392–393.
Hill, .B.M. (1963). The Three-Parameter Lognormal Distribution and Bayesian Analysis of a Point-Source Epidemic. Journal of the American Statistical Association 58, 72–84.
Hoshi, K., J.R. Stedinger, and J. Burges. (1984). Estimation of Log-Normal Quantiles: Monte Carlo Results and First-Order Approximations. Journal of Hydrology 71, 1–30.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897–912.
Stedinger, J.R. (1980). Fitting Lognormal Distributions to Hydrologic Data. Water Resources Research 16(3), 481–490.
elnorm3
, Lognormal3, Lognormal,
LognormalAlt, Normal.
# Generate 20 observations from a 3-parameter lognormal distribution # with parameters meanlog=1.5, sdlog=1, and threshold=10, then use # Cohen and Whitten's (1980) modified moments estimators to estimate # the parameters, and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnorm3(20, meanlog = 1.5, sdlog = 1, threshold = 10) eqlnorm3(dat, method = "mmme", p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: 3-Parameter Lognormal # #Estimated Parameter(s): meanlog = 1.5206664 # sdlog = 0.5330974 # threshold = 9.6620403 # #Estimation Method: mmme # #Estimated Quantile(s): 90'th %ile = 18.72194 # #Quantile Estimation Method: Quantile(s) Based on # mmme Estimators # #Data: dat # #Sample Size: 20 # Clean up #--------- rm(dat)
# Generate 20 observations from a 3-parameter lognormal distribution # with parameters meanlog=1.5, sdlog=1, and threshold=10, then use # Cohen and Whitten's (1980) modified moments estimators to estimate # the parameters, and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnorm3(20, meanlog = 1.5, sdlog = 1, threshold = 10) eqlnorm3(dat, method = "mmme", p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: 3-Parameter Lognormal # #Estimated Parameter(s): meanlog = 1.5206664 # sdlog = 0.5330974 # threshold = 9.6620403 # #Estimation Method: mmme # #Estimated Quantile(s): 90'th %ile = 18.72194 # #Quantile Estimation Method: Quantile(s) Based on # mmme Estimators # #Data: dat # #Sample Size: 20 # Clean up #--------- rm(dat)
Estimate quantiles of a lognormal distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for a quantile.
eqlnormCensored(x, censored, censoring.side = "left", p = 0.5, method = "mle", ci = FALSE, ci.method = "exact.for.complete", ci.type = "two-sided", conf.level = 0.95, digits = 0, nmc = 1000, seed = NULL)
eqlnormCensored(x, censored, censoring.side = "left", p = 0.5, method = "mle", ci = FALSE, ci.method = "exact.for.complete", ci.type = "two-sided", conf.level = 0.95, digits = 0, nmc = 1000, seed = NULL)
x |
a numeric vector of positive observations.
Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimating the mean and standard deviation on the log-scale. For singly censored data, the possible values are: For multiply censored data, the possible values are: See the DETAILS section for more information. |
ci |
logical scalar indicating whether to compute a confidence interval for the quantile.
The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the quantile. The possible values are: See the DETAILS section for more
information. This argument is ignored if |
ci.type |
character string indicating what kind of confidence interval for the quantile to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
nmc |
numeric scalar indicating the number of Monte Carlo simulations to run when
|
seed |
integer supplied to the function |
Quantiles and their associated confidence intervals are constructed by calling
the function eqnormCensored
using the log-transformed data and then
exponentiating the quantiles and confidence limits.
eqlnormCensored
returns a list of class "estimateCensored"
containing the estimated quantile(s) and other information.
See estimateCensored.object
for details.
Percentiles are sometimes used in environmental standards and regulations. For example, Berthouex and Brown (2002, p.71) note that England has water quality limits based on the 90th and 95th percentiles of monitoring data not exceeding specified levels. They also note that the U.S. EPA has specifications for air quality monitoring, aquatic standards on toxic chemicals, and maximum daily limits for industrial effluents that are all based on percentiles. Given the importance of these quantities, it is essential to characterize the amount of uncertainty associated with the estimates of these quantities. This is done with confidence intervals.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation.
Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators based on censored environmental data.
In practice, it is better to use a confidence interval for a percentile, rather than rely on a single point-estimate of percentile. Confidence intervals for percentiles of a normal distribution depend on the properties of the estimators for both the mean and standard deviation.
Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and standard deviation when data are subjected to single or multiple censoring (see, for example, Singh et al., 2006). Studies to evaluate the performance of a confidence interval for a percentile include: Caudill et al. (2007), Hewett and Ganner (2007), Kroll and Stedinger (1996), and Serasinghe (2010).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Caudill, S.P., L.-Y. Wong, W.E. Turner, R. Lee, A. Henderson, D. G. Patterson Jr. (2007). Percentile Estimation Using Variable Censored Data. Chemosphere 68, 169–180.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.
Ellison, B.E. (1964). On Two-Sided Tolerance Intervals for a Normal Distribution. Annals of Mathematical Statistics 35, 762-772.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, pp.132-136.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, pp.88-90.
Hewett, P., and G.H. Ganser. (2007). A Comparison of Several Methods for Analyzing Censored Data. Annals of Occupational Hygiene 51(7), 611–632.
Johnson, N.L., and B.L. Welch. (1940). Applications of the Non-Central t-Distribution. Biometrika 31, 362-389.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Kroll, C.N., and J.R. Stedinger. (1996). Estimation of Moments and Quantiles Using Censored Data. Water Resources Research 32(4), 1005–1012.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Odeh, R.E., and D.B. Owen. (1980). Tables for Normal Tolerance Limits, Sampling Plans, and Screening. Marcel Dekker, New York.
Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.
Serasinghe, S.K. (2010). A Simulation Comparison of Parametric and Nonparametric Estimators of Quantiles from Right Censored Data. A Report submitted in partial fulfillment of the requirements for the degree Master of Science, Department of Statistics, College of Arts and Sciences, Kansas State University, Manhattan, Kansas.
Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Stedinger, J. (1983). Confidence Intervals for Design Events. Journal of Hydraulic Engineering 109(1), 13-27.
Stedinger, J.R., R.M. Vogel, and E. Foufoula-Georgiou. (1993). Frequency Analysis of Extreme Events. In: Maidment, D.R., ed. Handbook of Hydrology. McGraw-Hill, New York, Chapter 18, pp.29-30.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Wald, A., and J. Wolfowitz. (1946). Tolerance Limits for a Normal Distribution. Annals of Mathematical Statistics 17, 208-215.
eqnormCensored
, enormCensored
,
tolIntNormCensored
,
elnormCensored
, Lognormal
, estimateCensored.object
.
# Generate 15 observations from a lognormal distribution with # parameters meanlog=3 and sdlog=0.5, and censor observations less than 10. # Then generate 15 more observations from this distribution and censor # observations less than 9. # Then estimate the 90th percentile and create a one-sided upper 95% # confidence interval for that percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(47) x.1 <- rlnorm(15, meanlog = 3, sdlog = 0.5) sort(x.1) # [1] 8.051717 9.651611 11.671282 12.271247 12.664108 17.446124 # [7] 17.707301 20.238069 20.487219 21.025510 21.208197 22.036554 #[13] 25.710773 28.661973 54.453557 censored.1 <- x.1 < 10 x.1[censored.1] <- 10 x.2 <- rlnorm(15, meanlog = 3, sdlog = 0.5) sort(x.2) # [1] 6.289074 7.511164 8.988267 9.179006 12.869408 14.130081 # [7] 16.941937 17.060513 19.287572 19.682126 20.363893 22.750203 #[13] 24.744306 28.089325 37.792873 censored.2 <- x.2 < 9 x.2[censored.2] <- 9 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) eqlnormCensored(x, censored, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 9 10 # #Estimated Parameter(s): meanlog = 2.8099300 # sdlog = 0.5137151 # #Estimation Method: MLE # #Estimated Quantile(s): 90'th %ile = 32.08159 # #Quantile Estimation Method: Quantile(s) Based on # MLE Estimators # #Data: x # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 16.66667% # #Confidence Interval for: 90'th %ile # #Assumed Sample Size: 30 # #Confidence Interval Method: Exact for # Complete Data # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 41.38716 #---------- # Compare these results with the true 90'th percentile: qlnorm(p = 0.9, meanlog = 3, sd = 0.5) #[1] 38.1214 #---------- # Clean up rm(x.1, censored.1, x.2, censored.2, x, censored) #-------------------------------------------------------------------- # Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and standard deviation using the MLE, # and then construct an upper 95% confidence limit for the 90th percentile. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean, standard deviation, and 90th percentile # on the log-scale using the MLE, and construct an upper 95% # confidence limit for the 90th percentile: #--------------------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, eqlnormCensored(Manganese.ppb, Censored, p = 0.9, ci = TRUE, ci.type = "upper")) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Estimated Quantile(s): 90'th %ile = 52.14674 # #Quantile Estimation Method: Quantile(s) Based on # MLE Estimators # #Data: Manganese.ppb # #Censoring Variable: censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: 90'th %ile # #Assumed Sample Size: 25 # #Confidence Interval Method: Exact for # Complete Data # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.0000 # UCL = 110.9305
# Generate 15 observations from a lognormal distribution with # parameters meanlog=3 and sdlog=0.5, and censor observations less than 10. # Then generate 15 more observations from this distribution and censor # observations less than 9. # Then estimate the 90th percentile and create a one-sided upper 95% # confidence interval for that percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(47) x.1 <- rlnorm(15, meanlog = 3, sdlog = 0.5) sort(x.1) # [1] 8.051717 9.651611 11.671282 12.271247 12.664108 17.446124 # [7] 17.707301 20.238069 20.487219 21.025510 21.208197 22.036554 #[13] 25.710773 28.661973 54.453557 censored.1 <- x.1 < 10 x.1[censored.1] <- 10 x.2 <- rlnorm(15, meanlog = 3, sdlog = 0.5) sort(x.2) # [1] 6.289074 7.511164 8.988267 9.179006 12.869408 14.130081 # [7] 16.941937 17.060513 19.287572 19.682126 20.363893 22.750203 #[13] 24.744306 28.089325 37.792873 censored.2 <- x.2 < 9 x.2[censored.2] <- 9 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) eqlnormCensored(x, censored, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 9 10 # #Estimated Parameter(s): meanlog = 2.8099300 # sdlog = 0.5137151 # #Estimation Method: MLE # #Estimated Quantile(s): 90'th %ile = 32.08159 # #Quantile Estimation Method: Quantile(s) Based on # MLE Estimators # #Data: x # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 16.66667% # #Confidence Interval for: 90'th %ile # #Assumed Sample Size: 30 # #Confidence Interval Method: Exact for # Complete Data # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.00000 # UCL = 41.38716 #---------- # Compare these results with the true 90'th percentile: qlnorm(p = 0.9, meanlog = 3, sd = 0.5) #[1] 38.1214 #---------- # Clean up rm(x.1, censored.1, x.2, censored.2, x, censored) #-------------------------------------------------------------------- # Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and standard deviation using the MLE, # and then construct an upper 95% confidence limit for the 90th percentile. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean, standard deviation, and 90th percentile # on the log-scale using the MLE, and construct an upper 95% # confidence limit for the 90th percentile: #--------------------------------------------------------------- with(EPA.09.Ex.15.1.manganese.df, eqlnormCensored(Manganese.ppb, Censored, p = 0.9, ci = TRUE, ci.type = "upper")) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Estimated Quantile(s): 90'th %ile = 52.14674 # #Quantile Estimation Method: Quantile(s) Based on # MLE Estimators # #Data: Manganese.ppb # #Censoring Variable: censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: 90'th %ile # #Assumed Sample Size: 25 # #Confidence Interval Method: Exact for # Complete Data # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = 0.0000 # UCL = 110.9305
Estimate quantiles of a logistic distribution.
eqlogis(x, p = 0.5, method = "mle", digits = 0)
eqlogis(x, p = 0.5, method = "mle", digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a logistic distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method to use to estimate the distribution parameters.
Possible values are
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqlogis
returns estimated quantiles as well as
estimates of the location and scale parameters.
Quantiles are estimated by 1) estimating the location and scale parameters by
calling elogis
, and then 2) calling the function
qlogis
and using the estimated values for
location and scale.
If x
is a numeric vector, eqlogis
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqlogis
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The logistic distribution is defined on the real line and is unimodal and symmetric about its location parameter (the mean). It has longer tails than a normal (Gaussian) distribution. It is used to model growth curves and bioassay data.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
elogis
, Logistic, estimate.object
.
# Generate 20 observations from a logistic distribution with # parameters location=0 and scale=1, then estimate the parameters # and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlogis(20) eqlogis(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Logistic # #Estimated Parameter(s): location = -0.2181845 # scale = 0.8152793 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 1.573167 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up rm(dat)
# Generate 20 observations from a logistic distribution with # parameters location=0 and scale=1, then estimate the parameters # and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlogis(20) eqlogis(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Logistic # #Estimated Parameter(s): location = -0.2181845 # scale = 0.8152793 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 1.573167 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up rm(dat)
Estimate quantiles of a negative binomial distribution.
eqnbinom(x, size = NULL, p = 0.5, method = "mle/mme", digits = 0)
eqnbinom(x, size = NULL, p = 0.5, method = "mle/mme", digits = 0)
x |
vector of non-negative integers indicating the number of trials that took place
before |
size |
vector of positive integers indicating the number of “successes” that
must be observed before the trials are stopped. Missing ( |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimating the probability parameter.
Possible values are
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqnbinom
returns estimated quantiles as well as
estimates of the prob
parameter.
Quantiles are estimated by 1) estimating the prob parameter by
calling enbinom
, and then 2) calling the function
qnbinom
and using the estimated value for
prob
.
If x
is a numeric vector, eqnbinom
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqnbinom
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The negative binomial distribution has its roots in a gambling game where participants would bet on the number of tosses of a coin necessary to achieve a fixed number of heads. The negative binomial distribution has been applied in a wide variety of fields, including accident statistics, birth-and-death processes, and modeling spatial distributions of biological organisms.
The geometric distribution with parameter prob=
is a special case of the negative binomial distribution with parameters
size=1
and prob=
.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 5.
enbinom
, NegBinomial, egeom
,
Geometric, estimate.object
.
# Generate an observation from a negative binomial distribution with # parameters size=2 and prob=0.2, then estimate the parameter prob # and the 90th percentile. # Note: the call to set.seed simply allows you to reproduce this example. # Also, the only parameter that is estimated is prob; the parameter # size is supplied in the call to enbinom. The parameter size is printed in # order to show all of the parameters associated with the distribution. set.seed(250) dat <- rnbinom(1, size = 2, prob = 0.2) dat #[1] 5 eqnbinom(dat, size = 2, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Negative Binomial # #Estimated Parameter(s): size = 2.0000000 # prob = 0.2857143 # #Estimation Method: mle/mme for 'prob' # #Estimated Quantile(s): 90'th %ile = 11 # #Quantile Estimation Method: Quantile(s) Based on # mle/mme for 'prob' Estimators # #Data: dat, 2 # #Sample Size: 1 #---------- # Clean up rm(dat)
# Generate an observation from a negative binomial distribution with # parameters size=2 and prob=0.2, then estimate the parameter prob # and the 90th percentile. # Note: the call to set.seed simply allows you to reproduce this example. # Also, the only parameter that is estimated is prob; the parameter # size is supplied in the call to enbinom. The parameter size is printed in # order to show all of the parameters associated with the distribution. set.seed(250) dat <- rnbinom(1, size = 2, prob = 0.2) dat #[1] 5 eqnbinom(dat, size = 2, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Negative Binomial # #Estimated Parameter(s): size = 2.0000000 # prob = 0.2857143 # #Estimation Method: mle/mme for 'prob' # #Estimated Quantile(s): 90'th %ile = 11 # #Quantile Estimation Method: Quantile(s) Based on # mle/mme for 'prob' Estimators # #Data: dat, 2 # #Sample Size: 1 #---------- # Clean up rm(dat)
Estimate quantiles of a normal distribution, and optionally construct a confidence interval for a quantile.
eqnorm(x, p = 0.5, method = "qmle", ci = FALSE, ci.method = "exact", ci.type = "two-sided", conf.level = 0.95, digits = 0, warn = TRUE)
eqnorm(x, p = 0.5, method = "qmle", ci = FALSE, ci.method = "exact", ci.type = "two-sided", conf.level = 0.95, digits = 0, warn = TRUE)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a normal (Gaussian) distribution
(i.e., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string indicating what method to use to estimate the quantile(s).
Currently the only possible value is |
ci |
logical scalar indicating whether to compute a confidence interval for the quantile.
The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the quantile. The possible values are |
ci.type |
character string indicating what kind of confidence interval for the quantile to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
warn |
logical scalar indicating whether to warn in the case when |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Quantiles are estimated by 1) estimating the mean and standard deviation parameters by
calling enorm
with method="mvue"
, and then
2) calling the function qnorm
and using the estimated values
for mean and standard deviation. This estimator of the 'th quantile is
sometimes called the quasi-maximum likelihood estimator (qmle; Cohn et al., 1989)
because if the maximum likelihood estimator of standard deviation were used
in place of the minimum variaince unbiased one, then this estimator of the quantile
would be the mle of the
'th quantile.
When ci=TRUE
and ci.method="exact"
, the confidence interval for a
quantile is computed by using the relationship between a confidence interval for
a quantile and a tolerance interval. Specifically, it can be shown
(e.g., Conover, 1980, pp.119-121) that an upper confidence interval for the
'th quantile with confidence level
is equivalent to
an upper
-content tolerance interval with coverage
and
confidence level
. Also, a lower confidence interval for
the
'th quantile with confidence level
is equivalent
to a lower
-content tolerance interval with coverage
and
confidence level
. See the help file for
tolIntNorm
for information on tolerance intervals for a normal distribution.
When ci=TRUE
and ci.method="normal.approx"
, the confidence interval for a
quantile is computed by assuming the estimated quantile has an approximately normal
distribution and using the asymptotic variance to construct the confidence interval
(see Stedinger, 1983; Stedinger et al., 1993).
If x
is a numeric vector, eqnorm
returns a list of class
"estimate"
containing the estimated quantile(s) and other information.
See estimate.object
for details.
If x
is the result of calling an estimation function, eqnorm
returns a list whose class is the same as x
. The list contains the same
components as x
, as well as components called quantiles
and
quantile.method
. In addition, if ci=TRUE
, the returned list
contains a component called interval
containing the confidence interval
information. If x
already has a component called interval
, this
component is replaced with the confidence interval information.
Percentiles are sometimes used in environmental standards and regulations. For example, Berthouex and Brown (2002, p.71) note that England has water quality limits based on the 90th and 95th percentiles of monitoring data not exceeding specified levels. They also note that the U.S. EPA has specifications for air quality monitoring, aquatic standards on toxic chemicals, and maximum daily limits for industrial effluents that are all based on percentiles. Given the importance of these quantities, it is essential to characterize the amount of uncertainty associated with the estimates of these quantities. This is done with confidence intervals.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, pp.132-136.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, pp.88-90.
Johnson, N.L., and B.L. Welch. (1940). Applications of the Non-Central t-Distribution. Biometrika 31, 362-389.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.
Stedinger, J. (1983). Confidence Intervals for Design Events. Journal of Hydraulic Engineering 109(1), 13-27.
Stedinger, J.R., R.M. Vogel, and E. Foufoula-Georgiou. (1993). Frequency Analysis of Extreme Events. In: Maidment, D.R., ed. Handbook of Hydrology. McGraw-Hill, New York, Chapter 18, pp.29-30.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
enorm
, tolIntNorm
, Normal
,
estimate.object
.
# Generate 20 observations from a normal distribution with # parameters mean=10 and sd=2, then estimate the 90th # percentile and create a one-sided upper 95% confidence interval # for that percentile. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(47) dat <- rnorm(20, mean = 10, sd = 2) eqnorm(dat, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 9.792856 # sd = 1.821286 # #Estimation Method: mvue # #Estimated Quantile(s): 90'th %ile = 12.12693 # #Quantile Estimation Method: qmle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 13.30064 #---------- # Compare these results with the true 90'th percentile: qnorm(p = 0.9, mean = 10, sd = 2) #[1] 12.56310 #---------- # Clean up rm(dat) #========== # Example 21-4 of USEPA (2009, p. 21-13) shows how to construct a # 99% lower confidence limit for the 95th percentile using chrysene # data and assuming a lognormal distribution. The data for this # example are stored in EPA.09.Ex.21.1.aldicarb.df. # The facility permit has established an ACL of 30 ppb that should not # be exceeded more than 5% of the time. Thus, if the lower confidence limit # for the 95th percentile is greater than 30 ppb, the well is deemed to be # out of compliance. # Look at the data #----------------- head(EPA.09.Ex.21.1.aldicarb.df) # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #5 1 Well.2 23.7 #6 2 Well.2 21.9 longToWide(EPA.09.Ex.21.1.aldicarb.df, "Aldicarb.ppb", "Month", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 #Month.1 19.9 23.7 5.6 #Month.2 29.6 21.9 3.3 #Month.3 18.7 26.9 2.3 #Month.4 24.2 26.1 6.9 # Estimate the 95th percentile and compute the lower # 99% confidence limit for Well 1. #--------------------------------------------------- with(EPA.09.Ex.21.1.aldicarb.df, eqnorm(Aldicarb.ppb[Well == "Well.1"], p = 0.95, ci = TRUE, ci.type = "lower", conf.level = 0.99)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 23.10000 # sd = 4.93491 # #Estimation Method: mvue # #Estimated Quantile(s): 95'th %ile = 31.2172 # #Quantile Estimation Method: qmle # #Data: Aldicarb.ppb[Well == "Well.1"] # #Sample Size: 4 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: lower # #Confidence Level: 99% # #Confidence Interval: LCL = 25.2855 # UCL = Inf # Now compute the 99% lower confidence limit for each of the three # wells all at once. #------------------------------------------------------------------ LCLs <- with(EPA.09.Ex.21.1.aldicarb.df, sapply(split(Aldicarb.ppb, Well), function(x) eqnorm(x, p = 0.95, method = "qmle", ci = TRUE, ci.type = "lower", conf.level = 0.99)$interval$limits["LCL"])) round(LCLs, 2) #Well.1.LCL Well.2.LCL Well.3.LCL # 25.29 25.66 5.46 LCLs > 30 #Well.1.LCL Well.2.LCL Well.3.LCL # FALSE FALSE FALSE # Clean up #--------- rm(LCLs) #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal # distribution. # A beta-content upper tolerance limit with 95% coverage and 95% # confidence is equivalent to the 95% upper confidence limit for the # 95th percentile. # Here we will construct a 95% upper confidence limit for the 95th # percentile based on the log-transformed data, then exponentiate the # result to get the confidence limit on the original scale. Note that # it is easier to just use the function eqlnorm with the original data # to achieve the same result. attach(EPA.09.Ex.17.3.chrysene.df) log.Chrysene <- log(Chrysene.ppb[Well.type == "Background"]) eqnorm(log.Chrysene, p = 0.95, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 2.5085773 # sd = 0.6279479 # #Estimation Method: mvue # #Estimated Quantile(s): 95'th %ile = 3.54146 # #Quantile Estimation Method: qmle # #Data: log.Chrysene # #Sample Size: 8 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 4.510032 exp(4.510032) #[1] 90.92473 #---------- # Clean up rm(log.Chrysene) detach("EPA.09.Ex.17.3.chrysene.df")
# Generate 20 observations from a normal distribution with # parameters mean=10 and sd=2, then estimate the 90th # percentile and create a one-sided upper 95% confidence interval # for that percentile. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(47) dat <- rnorm(20, mean = 10, sd = 2) eqnorm(dat, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 9.792856 # sd = 1.821286 # #Estimation Method: mvue # #Estimated Quantile(s): 90'th %ile = 12.12693 # #Quantile Estimation Method: qmle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 13.30064 #---------- # Compare these results with the true 90'th percentile: qnorm(p = 0.9, mean = 10, sd = 2) #[1] 12.56310 #---------- # Clean up rm(dat) #========== # Example 21-4 of USEPA (2009, p. 21-13) shows how to construct a # 99% lower confidence limit for the 95th percentile using chrysene # data and assuming a lognormal distribution. The data for this # example are stored in EPA.09.Ex.21.1.aldicarb.df. # The facility permit has established an ACL of 30 ppb that should not # be exceeded more than 5% of the time. Thus, if the lower confidence limit # for the 95th percentile is greater than 30 ppb, the well is deemed to be # out of compliance. # Look at the data #----------------- head(EPA.09.Ex.21.1.aldicarb.df) # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #5 1 Well.2 23.7 #6 2 Well.2 21.9 longToWide(EPA.09.Ex.21.1.aldicarb.df, "Aldicarb.ppb", "Month", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 #Month.1 19.9 23.7 5.6 #Month.2 29.6 21.9 3.3 #Month.3 18.7 26.9 2.3 #Month.4 24.2 26.1 6.9 # Estimate the 95th percentile and compute the lower # 99% confidence limit for Well 1. #--------------------------------------------------- with(EPA.09.Ex.21.1.aldicarb.df, eqnorm(Aldicarb.ppb[Well == "Well.1"], p = 0.95, ci = TRUE, ci.type = "lower", conf.level = 0.99)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 23.10000 # sd = 4.93491 # #Estimation Method: mvue # #Estimated Quantile(s): 95'th %ile = 31.2172 # #Quantile Estimation Method: qmle # #Data: Aldicarb.ppb[Well == "Well.1"] # #Sample Size: 4 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: lower # #Confidence Level: 99% # #Confidence Interval: LCL = 25.2855 # UCL = Inf # Now compute the 99% lower confidence limit for each of the three # wells all at once. #------------------------------------------------------------------ LCLs <- with(EPA.09.Ex.21.1.aldicarb.df, sapply(split(Aldicarb.ppb, Well), function(x) eqnorm(x, p = 0.95, method = "qmle", ci = TRUE, ci.type = "lower", conf.level = 0.99)$interval$limits["LCL"])) round(LCLs, 2) #Well.1.LCL Well.2.LCL Well.3.LCL # 25.29 25.66 5.46 LCLs > 30 #Well.1.LCL Well.2.LCL Well.3.LCL # FALSE FALSE FALSE # Clean up #--------- rm(LCLs) #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal # distribution. # A beta-content upper tolerance limit with 95% coverage and 95% # confidence is equivalent to the 95% upper confidence limit for the # 95th percentile. # Here we will construct a 95% upper confidence limit for the 95th # percentile based on the log-transformed data, then exponentiate the # result to get the confidence limit on the original scale. Note that # it is easier to just use the function eqlnorm with the original data # to achieve the same result. attach(EPA.09.Ex.17.3.chrysene.df) log.Chrysene <- log(Chrysene.ppb[Well.type == "Background"]) eqnorm(log.Chrysene, p = 0.95, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 2.5085773 # sd = 0.6279479 # #Estimation Method: mvue # #Estimated Quantile(s): 95'th %ile = 3.54146 # #Quantile Estimation Method: qmle # #Data: log.Chrysene # #Sample Size: 8 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 4.510032 exp(4.510032) #[1] 90.92473 #---------- # Clean up rm(log.Chrysene) detach("EPA.09.Ex.17.3.chrysene.df")
Estimate quantiles of a normal distribution given a sample of data that has been subjected to Type I censoring, and optionally construct a confidence interval for a quantile.
eqnormCensored(x, censored, censoring.side = "left", p = 0.5, method = "mle", ci = FALSE, ci.method = "exact.for.complete", ci.type = "two-sided", conf.level = 0.95, digits = 0, nmc = 1000, seed = NULL)
eqnormCensored(x, censored, censoring.side = "left", p = 0.5, method = "mle", ci = FALSE, ci.method = "exact.for.complete", ci.type = "two-sided", conf.level = 0.95, digits = 0, nmc = 1000, seed = NULL)
x |
a numeric vector of observations.
Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimating the mean and standard deviation. For singly censored data, the possible values are: For multiply censored data, the possible values are: See the DETAILS section for more information. |
ci |
logical scalar indicating whether to compute a confidence interval for the quantile.
The default value is |
ci.method |
character string indicating what method to use to construct the confidence interval
for the quantile. The possible values are: See the DETAILS section for more information. This argument is ignored if |
ci.type |
character string indicating what kind of confidence interval for the quantile to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
nmc |
numeric scalar indicating the number of Monte Carlo simulations to run when
|
seed |
integer supplied to the function |
Estimating Quantiles
Quantiles are estimated by:
estimating the mean and standard deviation parameters by
calling enormCensored
, and then
calling the function qnorm
and using the estimated
values for the mean and standard deviation.
The estimated quantile thus depends on the method of estimating the mean and
standard deviation.
Confidence Intervals for Quantiles
Exact Method When Data are Complete (ci.method="exact.for.complete"
)
When ci.method="exact.for.complete"
, the function eqnormCensored
calls the function eqnorm
, supplying it with the estimated mean
and standard deviation, and setting the argument ci.method="exact"
. Thus, this
is the exact method for computing a confidence interval for a quantile had the data
been complete. Because the data have been subjected to Type I censoring, this method
of constructing a confidence interval for the quantile is an approximation.
Normal Approximation (ci.method="normal.approx"
)
When ci.method="normal.approx"
, the function eqnormCensored
calls the function eqnorm
, supplying it with the estimated mean
and standard deviation, and setting the argument ci.method="normal.approx"
.
Thus, this is the normal approximation method for computing a confidence interval
for a quantile had the data been complete. Because the data have been subjected
to Type I censoring, this method of constructing a confidence interval for the
quantile is an approximation both because of the normal approximation and because
the estimates of the mean and standard devation are based on censored, instead of
complete, data.
Generalized Pivotal Quantity (ci.method="gpq"
)
When ci.method="gpq"
, the function eqnormCensored
uses the
relationship between confidence intervals for quantiles and tolerance intervals
and calls the function tolIntNormCensored
with the argument
ti.method="gpq"
to construct the confidence interval.
Specifically, it can be shown
(e.g., Conover, 1980, pp.119-121) that an upper confidence interval for the
'th quantile with confidence level
is equivalent to
an upper
-content tolerance interval with coverage
and
confidence level
. Also, a lower confidence interval for
the
'th quantile with confidence level
is equivalent
to a lower
-content tolerance interval with coverage
and
confidence level
.
eqnormCensored
returns a list of class "estimateCensored"
containing the estimated quantile(s) and other information.
See estimateCensored.object
for details.
Percentiles are sometimes used in environmental standards and regulations. For example, Berthouex and Brown (2002, p.71) note that England has water quality limits based on the 90th and 95th percentiles of monitoring data not exceeding specified levels. They also note that the U.S. EPA has specifications for air quality monitoring, aquatic standards on toxic chemicals, and maximum daily limits for industrial effluents that are all based on percentiles. Given the importance of these quantities, it is essential to characterize the amount of uncertainty associated with the estimates of these quantities. This is done with confidence intervals.
A sample of data contains censored observations if some of the observations are reported only as being below or above some censoring level. In environmental data analysis, Type I left-censored data sets are common, with values being reported as “less than the detection limit” (e.g., Helsel, 2012). Data sets with only one censoring level are called singly censored; data sets with multiple censoring levels are called multiply or progressively censored.
Statistical methods for dealing with censored data sets have a long history in the field of survival analysis and life testing. More recently, researchers in the environmental field have proposed alternative methods of computing estimates and confidence intervals in addition to the classical ones such as maximum likelihood estimation.
Helsel (2012, Chapter 6) gives an excellent review of past studies of the properties of various estimators based on censored environmental data.
In practice, it is better to use a confidence interval for a percentile, rather than rely on a single point-estimate of percentile. Confidence intervals for percentiles of a normal distribution depend on the properties of the estimators for both the mean and standard deviation.
Few studies have been done to evaluate the performance of methods for constructing confidence intervals for the mean or joint confidence regions for the mean and standard deviation when data are subjected to single or multiple censoring (see, for example, Singh et al., 2006). Studies to evaluate the performance of a confidence interval for a percentile include: Caudill et al. (2007), Hewett and Ganner (2007), Kroll and Stedinger (1996), and Serasinghe (2010).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Caudill, S.P., L.-Y. Wong, W.E. Turner, R. Lee, A. Henderson, D. G. Patterson Jr. (2007). Percentile Estimation Using Variable Censored Data. Chemosphere 68, 169–180.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.
Ellison, B.E. (1964). On Two-Sided Tolerance Intervals for a Normal Distribution. Annals of Mathematical Statistics 35, 762-772.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, pp.132-136.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, pp.88-90.
Hewett, P., and G.H. Ganser. (2007). A Comparison of Several Methods for Analyzing Censored Data. Annals of Occupational Hygiene 51(7), 611–632.
Johnson, N.L., and B.L. Welch. (1940). Applications of the Non-Central t-Distribution. Biometrika 31, 362-389.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Kroll, C.N., and J.R. Stedinger. (1996). Estimation of Moments and Quantiles Using Censored Data. Water Resources Research 32(4), 1005–1012.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Odeh, R.E., and D.B. Owen. (1980). Tables for Normal Tolerance Limits, Sampling Plans, and Screening. Marcel Dekker, New York.
Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.
Serasinghe, S.K. (2010). A Simulation Comparison of Parametric and Nonparametric Estimators of Quantiles from Right Censored Data. A Report submitted in partial fulfillment of the requirements for the degree Master of Science, Department of Statistics, College of Arts and Sciences, Kansas State University, Manhattan, Kansas.
Singh, A., R. Maichle, and S. Lee. (2006). On the Computation of a 95% Upper Confidence Limit of the Unknown Population Mean Based Upon Data Sets with Below Detection Limit Observations. EPA/600/R-06/022, March 2006. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Stedinger, J. (1983). Confidence Intervals for Design Events. Journal of Hydraulic Engineering 109(1), 13-27.
Stedinger, J.R., R.M. Vogel, and E. Foufoula-Georgiou. (1993). Frequency Analysis of Extreme Events. In: Maidment, D.R., ed. Handbook of Hydrology. McGraw-Hill, New York, Chapter 18, pp.29-30.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Wald, A., and J. Wolfowitz. (1946). Tolerance Limits for a Normal Distribution. Annals of Mathematical Statistics 17, 208-215.
enormCensored
, tolIntNormCensored
, Normal
,
estimateCensored.object
.
# Generate 15 observations from a normal distribution with # parameters mean=10 and sd=2, and censor observations less than 8. # Then generate 15 more observations from this distribution and censor # observations less than 7. # Then estimate the 90th percentile and create a one-sided upper 95% # confidence interval for that percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(47) x.1 <- rnorm(15, mean = 10, sd = 2) sort(x.1) # [1] 6.343542 7.068499 7.828525 8.029036 8.155088 9.436470 # [7] 9.495908 10.030262 10.079205 10.182946 10.217551 10.370811 #[13] 10.987640 11.422285 13.989393 censored.1 <- x.1 < 8 x.1[censored.1] <- 8 x.2 <- rnorm(15, mean = 10, sd = 2) sort(x.2) # [1] 5.355255 6.065562 6.783680 6.867676 8.219412 8.593224 # [7] 9.319168 9.347066 9.837844 9.918844 10.055054 10.498296 #[13] 10.834382 11.341558 12.528482 censored.2 <- x.2 < 7 x.2[censored.2] <- 7 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) eqnormCensored(x, censored, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 7 8 # #Estimated Parameter(s): mean = 9.390624 # sd = 1.827156 # #Estimation Method: MLE # #Estimated Quantile(s): 90'th %ile = 11.73222 # #Quantile Estimation Method: Quantile(s) Based on # MLE Estimators # #Data: x # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 16.66667% # #Confidence Interval for: 90'th %ile # #Assumed Sample Size: 30 # #Confidence Interval Method: Exact for # Complete Data # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 12.63808 #---------- # Compare these results with the true 90'th percentile: qnorm(p = 0.9, mean = 10, sd = 2) #[1] 12.56310 #---------- # Clean up rm(x.1, censored.1, x.2, censored.2, x, censored) #========== # Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and standard deviation using the MLE, # and then construct an upper 95% confidence limit for the 90th percentile. # We will log-transform the original observations and then call # eqnormCensored. Alternatively, we could have more simply called # eqlnormCensored. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean, standard deviation, and 90th percentile # on the log-scale using the MLE, and construct an upper 95% # confidence limit for the 90th percentile on the log-scale: #--------------------------------------------------------------- est.list <- with(EPA.09.Ex.15.1.manganese.df, eqnormCensored(log(Manganese.ppb), Censored, p = 0.9, ci = TRUE, ci.type = "upper")) est.list #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 0.6931472 1.6094379 # #Estimated Parameter(s): mean = 2.215905 # sd = 1.356291 # #Estimation Method: MLE # #Estimated Quantile(s): 90'th %ile = 3.954062 # #Quantile Estimation Method: Quantile(s) Based on # MLE Estimators # #Data: log(Manganese.ppb) # #Censoring Variable: censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: 90'th %ile # #Assumed Sample Size: 25 # #Confidence Interval Method: Exact for # Complete Data # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 4.708904 # To estimate the 90th percentile on the original scale, # we need to exponentiate the results #------------------------------------------------------- exp(est.list$quantiles) #90'th %ile # 52.14674 exp(est.list$interval$limits) # LCL UCL # 0.0000 110.9305 #---------- # Clean up #--------- rm(est.list)
# Generate 15 observations from a normal distribution with # parameters mean=10 and sd=2, and censor observations less than 8. # Then generate 15 more observations from this distribution and censor # observations less than 7. # Then estimate the 90th percentile and create a one-sided upper 95% # confidence interval for that percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(47) x.1 <- rnorm(15, mean = 10, sd = 2) sort(x.1) # [1] 6.343542 7.068499 7.828525 8.029036 8.155088 9.436470 # [7] 9.495908 10.030262 10.079205 10.182946 10.217551 10.370811 #[13] 10.987640 11.422285 13.989393 censored.1 <- x.1 < 8 x.1[censored.1] <- 8 x.2 <- rnorm(15, mean = 10, sd = 2) sort(x.2) # [1] 5.355255 6.065562 6.783680 6.867676 8.219412 8.593224 # [7] 9.319168 9.347066 9.837844 9.918844 10.055054 10.498296 #[13] 10.834382 11.341558 12.528482 censored.2 <- x.2 < 7 x.2[censored.2] <- 7 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) eqnormCensored(x, censored, p = 0.9, ci = TRUE, ci.type = "upper") #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 7 8 # #Estimated Parameter(s): mean = 9.390624 # sd = 1.827156 # #Estimation Method: MLE # #Estimated Quantile(s): 90'th %ile = 11.73222 # #Quantile Estimation Method: Quantile(s) Based on # MLE Estimators # #Data: x # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 16.66667% # #Confidence Interval for: 90'th %ile # #Assumed Sample Size: 30 # #Confidence Interval Method: Exact for # Complete Data # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 12.63808 #---------- # Compare these results with the true 90'th percentile: qnorm(p = 0.9, mean = 10, sd = 2) #[1] 12.56310 #---------- # Clean up rm(x.1, censored.1, x.2, censored.2, x, censored) #========== # Chapter 15 of USEPA (2009) gives several examples of estimating the mean # and standard deviation of a lognormal distribution on the log-scale using # manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will estimate the mean and standard deviation using the MLE, # and then construct an upper 95% confidence limit for the 90th percentile. # We will log-transform the original observations and then call # eqnormCensored. Alternatively, we could have more simply called # eqlnormCensored. # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now estimate the mean, standard deviation, and 90th percentile # on the log-scale using the MLE, and construct an upper 95% # confidence limit for the 90th percentile on the log-scale: #--------------------------------------------------------------- est.list <- with(EPA.09.Ex.15.1.manganese.df, eqnormCensored(log(Manganese.ppb), Censored, p = 0.9, ci = TRUE, ci.type = "upper")) est.list #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 0.6931472 1.6094379 # #Estimated Parameter(s): mean = 2.215905 # sd = 1.356291 # #Estimation Method: MLE # #Estimated Quantile(s): 90'th %ile = 3.954062 # #Quantile Estimation Method: Quantile(s) Based on # MLE Estimators # #Data: log(Manganese.ppb) # #Censoring Variable: censored # #Sample Size: 25 # #Percent Censored: 24% # #Confidence Interval for: 90'th %ile # #Assumed Sample Size: 25 # #Confidence Interval Method: Exact for # Complete Data # #Confidence Interval Type: upper # #Confidence Level: 95% # #Confidence Interval: LCL = -Inf # UCL = 4.708904 # To estimate the 90th percentile on the original scale, # we need to exponentiate the results #------------------------------------------------------- exp(est.list$quantiles) #90'th %ile # 52.14674 exp(est.list$interval$limits) # LCL UCL # 0.0000 110.9305 #---------- # Clean up #--------- rm(est.list)
Estimate quantiles of a distribution, and optionally create confidence intervals for them, without making any assumptions about the form of the distribution.
eqnpar(x, p = 0.5, type = 7, ci = FALSE, lcl.rank = NULL, ucl.rank = NULL, lb = -Inf, ub = Inf, ci.type = "two-sided", ci.method = "interpolate", digits = getOption("digits"), approx.conf.level = 0.95, min.coverage = TRUE, tol = 0)
eqnpar(x, p = 0.5, type = 7, ci = FALSE, lcl.rank = NULL, ucl.rank = NULL, lb = -Inf, ub = Inf, ci.type = "two-sided", ci.method = "interpolate", digits = getOption("digits"), approx.conf.level = 0.95, min.coverage = TRUE, tol = 0)
x |
a numeric vector of observations. Missing ( |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
type |
an integer between 1 and 9 indicating which algorithm to use to estimate the
quantile. The default value is |
ci |
logical scalar indicating whether to compute a confidence interval for the quantile.
The default value is |
lcl.rank , ucl.rank
|
positive integers indicating the ranks of the order statistics that are used
for the lower and upper bounds of the confidence interval for the specified
quantile. Both arguments must be integers between 1 and the number of non-missing
values in |
lb , ub
|
scalars indicating lower and upper bounds on the distribution. By default, |
ci.type |
character string indicating what kind of confidence interval to compute.
The possible values are |
ci.method |
character string indicating the method to use to construct the confidence interval.
The possible values are |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
approx.conf.level |
a scalar between 0 and 1 indicating the desired confidence level of the confidence
interval. The default value is |
min.coverage |
for the case when |
tol |
for the case when |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Estimation
The function eqnpar
calls the R function quantile
to estimate quantiles.
Confidence Intervals
Let denote a sample of
independent and
identically distributed random variables from some arbitrary distribution.
Furthermore, let
denote the
'th order statistic for these
random variables. That is,
Finally, let denote the
'th quantile of the distribution, that is:
It can be shown (e.g., Conover, 1980, pp. 114-116) that for the 'th order
statistic:
for a continuous distribution and
for a discrete distribution,
where denotes the cumulative distribution function of a
binomial random variable with parameters
size=
and
prob=
evaluated at
. These facts are used to construct
confidence intervals for quantiles (see below).
Two-Sided Confidence Interval (ci.type="two-sided"
)
A two-sided nonparametric confidence interval for the 'th quantile is
constructed as:
where
Note that the argument lcl.rank
corresponds to , and the argument
ucl.rank
corresponds to .
This confidence interval has an associated confidence level that is at least as large as:
for a discrete distribution and exactly equal to this value for a continuous distribution. This is because by Equations (4)-(6) above:
with equality if the distribution is continuous.
Exact Method (ci.method="exact"
)
When lcl.rank
() and
ucl.rank
() are not supplied by
the user, and
ci.method="exact"
, and
are initially chosen such that
is the smallest integer satisfying equation (13) below, and
is
the largest integer satisfying equation (14) below:
where approx.conf.level
.
The values of and
are then each varied by
(with the
restrictions
,
, and
), and confidence levels
computed for each of these combinations. If
min.coverage=TRUE
, the
combination of and
is selected that provides the closest coverage
to
approx.conf.level
, with coverage greater than or equal to
approx.conf.level
. If min.coverage=FALSE
, the
combination of and
is selected that provides the closest coverage
to
approx.conf.level
, with coverage less than or equal to
approx.conf.level + tol
.
For this method, the confidence level associated with the confidence interval
is exact if the underlying distribution is continuous.
Approximate Method (ci.method="approx"
)
Here the term “Approximate” simply refers to the method of initially
choosing the ranks for the lower and upper bounds. As for ci.method="exact"
,
the confidence level associated with the confidence interval is exact if the
underlying distribution is continuous.
When lcl.rank
() and
ucl.rank
() are not supplied by
the user and
ci.method="normal.approx"
, and
are initially chosen
such that:
where
and denotes the
'th quantile of
Student's t-distribution with
degrees of freedom,
and
approx.conf.level
(Conover, 1980, p. 112).
With the restictions that and
,
is rounded down to the nearest integer
and
is rounded up to the nearest integer
.
Again, with the restictions that
,
if the confidence level using
and
is less than or equal
to
approx.conf.level
, then is set to
.
Once this has been checked, with the restriction that
,
if the confidence level using the current value of
and
is less than or equal to
approx.conf.level
, then
is set to
.
Interpolate Method (ci.method="interpolate"
)
Let denote the desired confidence level associated with the
confidence interval for the
'th quantile.
Based on the work of Hettmansperger and Sheather (1986), Nyblom (1992) showed that
if
is a one-sided upper confidence interval for the
'th quantile with associated confidence level
, and
, then the one-sided upper
confidence interval
where
has associated confidence level approximately equal to for a wide
range of distributions.
Thus, to construct an approximate two-sided confidence interval for the
'th quantile with confidence level
, if
has confidence level
and
has confidence level
, then
the lower bound of the two-sided confidence interval is computed as:
where
and the upper bound of the two-sided confidence interval is computed as:
where
The values of and
in Equations (21) and (23) are computed by
using
ci.method="exact"
with the argument min.coverage=TRUE
.
One-Sided Lower Confidence Interval (ci.type="lower"
)
A one-sided lower nonparametric confidence interval for the 'th quantile is
constructed as:
where denotes the value of the
ub
argument (the user-supplied
upper bound).
Exact Method (ci.method="exact"
)
When lcl.rank
() is not supplied by the user, and
ci.method="exact"
, is initially chosen such that it is the
smallest integer satisfying the following equation:
where approx.conf.level
.
The value of is varied by
(with the
restrictions
and
), and confidence levels
computed for each of these combinations. If
min.coverage=TRUE
, the
value of is selected that provides the closest coverage
to
approx.conf.level
, with coverage greater than or equal to
approx.conf.level
. If min.coverage=FALSE
, the
value of is selected that provides the closest coverage
to
approx.conf.level
, with coverage less than or equal to
approx.conf.level + tol
.
For this method, the confidence level associated with the confidence interval
is exact if the underlying distribution is continuous.
Approximate Method (ci.method="approx"
)
When lcl.rank
() is not supplied by the user and
ci.method="normal.approx"
, is initially chosen such that
With the restriction that and
,
if
p
is less than 0.5 then is rounded up to the nearest integer,
otherwise it is rounded down to the nearest integer. Denote this value by
. With the restriction that
, if the confidence level using
is less than or equal to
approx.conf.level
, then
is set to
.
Interpolate Method (ci.method="interpolate"
)
Let denote the desired confidence level associated with the
confidence interval for the
'th quantile.
To construct an approximate one-sided lower confidence interval for the
'th quantile with confidence level
, if
has confidence level
and
has confidence level
, then
the lower bound of the confidence interval is computed as:
where
The value of in Equation (28) is computed by
using
ci.method="exact"
with the arguments
ci.type="lower"
and min.coverage=TRUE
.
One-Sided Upper Confidence Interval (ci.type="upper"
)
A one-sided upper nonparametric confidence interval for the 'th quantile is
constructed as:
where denotes the value of the
lb
argument (the user-supplied
lower bound).
Exact Method (ci.method="exact"
)
When ucl.rank
() is not supplied by the user, and
ci.method="exact"
, is initially chosen such that it is the
largest integer satisfying the following equation:
where approx.conf.level
.
The value of is varied by
(with the
restrictions
and
), and confidence levels
computed for each of these combinations. If
min.coverage=TRUE
, the
value of is selected that provides the closest coverage
to
approx.conf.level
, with coverage greater than or equal to
approx.conf.level
. If min.coverage=FALSE
, the
value of is selected that provides the closest coverage
to
approx.conf.level
, with coverage less than or equal to
approx.conf.level + tol
.
For this method, the confidence level associated with the confidence interval
is exact if the underlying distribution is continuous.
Approximate Method (ci.method="approx"
)
When ucl.rank
() is not supplied by the user and
ci.method="normal.approx"
, is initially chosen such that
With the restriction that and
,
if
p
is greater than 0.5 then is rounded down to the nearest integer,
otherwise it is rounded up to the nearest integer. Denote this value by
. With the restriction that
, if the confidence level using
is less than or equal to
approx.conf.level
, then
is set to
.
For this method, the confidence level associated with the confidence interval
is exact if the underlying distribution is continuous.
Interpolate Method (ci.method="interpolate"
)
Let denote the desired confidence level associated with the
confidence interval for the
'th quantile.
To construct an approximate one-sided upper confidence interval for the
'th quantile with confidence level
, if
has confidence level
and
has confidence level
, then
the upper bound of the confidence interval is computed as:
where
The value of in Equation (32) is computed by
using
ci.method="exact"
with the arguments
ci.type = "upper"
, and min.coverage=TRUE
.
Note on Value of Confidence Level
Because of the discrete nature of order statistics, when ci.method="exact"
or ci.method="normal.approx"
, the value of the confidence level returned by
eqnpar
will usually differ from the desired confidence level indicated by
the value of the argument approx.conf.level
. When ci.method="interpolate"
,
eqnpar
returns for the confidence level the value of the argument
approx.conf.level
. Nyblom (1992) and Hettmasperger and Sheather (1986) have
shown that the Interpolate method produces confidence intervals with confidence levels
quite close to the assumed confidence level for a wide range of distributions.
a list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
Percentiles are sometimes used in environmental standards and regulations. For example, Berthouex and Brown (2002, p.71) note that England has water quality limits based on the 90th and 95th percentiles of monitoring data not exceeding specified levels. They also note that the U.S. EPA has specifications for air quality monitoring, aquatic standards on toxic chemicals, and maximum daily limits for industrial effluents that are all based on percentiles. Given the importance of these quantities, it is essential to characterize the amount of uncertainty associated with the estimates of these quantities. This is done with confidence intervals.
It can be shown (e.g., Conover, 1980, pp.119-121) that an upper confidence interval for the
'th quantile with confidence level
is equivalent to an
upper
-content tolerance interval with coverage
and confidence level
. Also, a lower confidence interval for the
'th quantile with
confidence level
is equivalent to a lower
-content tolerance
interval with coverage
and confidence level
. See the
help file for
tolIntNpar
for more information on nonparametric tolerance
intervals.
Steven P. Millard ([email protected])
The author is grateful to Michael H?hle,
Department of Mathematics, Stockholm University
(http://www2.math.su.se/~hoehle) for making me aware of the work of Nyblom (1992),
and for suggesting improvements to the algorithm that was used in EnvStats
Version 2.1.1 to construct a confidence interval when ci.method="exact"
.
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, pp.132-136.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, pp.88-90.
Hettmansperger, T.P., and Sheather, S.J. (1986). Confidence Intervals Based on Interpolated Order Statistics. Statistics & Probability Letters, 4, 75–79.
Nyblom, J. (1992). Note on Interpolated Order Statistics. Statistics & Probability Letters, 14, 129–131.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
quantile
, tolIntNpar
,
Estimating Distribution Quantiles,
Tolerance Intervals, estimate.object
.
# The data frame ACE.13.TCE.df contains observations on # Trichloroethylene (TCE) concentrations (mg/L) at # 10 groundwater monitoring wells before and after remediation. # # Compute the median concentration for each period along with # a 95% confidence interval for the median. # # Before remediation: 20.3 [8.8, 35.9] # After remediation: 2.5 [0.8, 5.9] with(ACE.13.TCE.df, eqnpar(TCE.mg.per.L[Period=="Before"], ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- #Assumed Distribution: None #Estimated Quantile(s): Median = 20.3 #Quantile Estimation Method: Nonparametric #Data: TCE.mg.per.L[Period == "Before"] #Sample Size: 10 #Confidence Interval for: 50'th %ile #Confidence Interval Method: interpolate (Nyblom, 1992) #Confidence Interval Type: two-sided #Confidence Level: 95% #Confidence Limit Rank(s): 2 9 # 3 8 #Confidence Interval: LCL = 8.804775 # UCL = 35.874775 #---------- with(ACE.13.TCE.df, eqnpar(TCE.mg.per.L[Period=="After"], ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- #Assumed Distribution: None #Estimated Quantile(s): Median = 2.48 #Quantile Estimation Method: Nonparametric #Data: TCE.mg.per.L[Period == "After"] #Sample Size: 10 #Confidence Interval for: 50'th %ile #Confidence Interval Method: interpolate (Nyblom, 1992) #Confidence Interval Type: two-sided #Confidence Level: 95% #Confidence Limit Rank(s): 2 9 # 3 8 #Confidence Interval: LCL = 0.7810901 # UCL = 5.8763063 #========== # Generate 20 observations from a cauchy distribution with parameters # location=0, scale=1. The true 75th percentile of this distribution is 1. # Use eqnpar to estimate the 75th percentile and construct a 90% confidence interval. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rcauchy(20, location = 0, scale = 1) #------------------------------------------------------- # First, use the default method, ci.method="interpolate" #------------------------------------------------------- eqnpar(dat, p = 0.75, ci = TRUE, approx.conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 75'th %ile = 1.524903 # #Quantile Estimation Method: Nonparametric # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 75'th %ile # #Confidence Interval Method: interpolate (Nyblom, 1992) # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Limit Rank(s): 12 19 # 13 18 # #Confidence Interval: LCL = 0.8191423 # UCL = 2.1215570 #---------- #------------------------------------------------------------- # Now use ci.method="exact". # Note that the returned confidence level is greater than 90%. #------------------------------------------------------------- eqnpar(dat, p = 0.75, ci = TRUE, approx.conf.level = 0.9, ci.method = "exact") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 75'th %ile = 1.524903 # #Quantile Estimation Method: Nonparametric # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 75'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 93.47622% # #Confidence Limit Rank(s): 12 19 # #Confidence Interval: LCL = 0.7494692 # UCL = 2.2156601 #---------- #---------------------------------------------------------- # Now use ci.method="exact" with min.coverage=FALSE. # Note that the returned confidence level is less than 90%. #---------------------------------------------------------- eqnpar(dat, p = 0.75, ci = TRUE, approx.conf.level = 0.9, ci.method = "exact", min.coverage = FALSE, ) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 75'th %ile = 1.524903 # #Quantile Estimation Method: Nonparametric # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 75'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 89.50169% # #Confidence Limit Rank(s): 13 20 # #Confidence Interval: LCL = 1.018038 # UCL = 5.002399 #---------- #----------------------------------------------------------- # Now supply our own bounds for the confidence interval. # The first example above based on the Interpolate method # used lcl.rank=12, ucl.rank=19 and lcl.rank=13, ucl.rank=18 # and interpolated between these two confidence intervals. # Here we will specify lcl.rank=13 and ucl.rank=18. The # resulting confidence level is 81%. #----------------------------------------------------------- eqnpar(dat, p = 0.75, ci = TRUE, lcl.rank = 13, ucl.rank = 18) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 75'th %ile = 1.524903 # #Quantile Estimation Method: Nonparametric # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 75'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 80.69277% # #Confidence Limit Rank(s): 13 18 # #Confidence Interval: LCL = 1.018038 # UCL = 2.071172 #---------- # Clean up rm(dat) #========== # Modify Example 17-4 on page 17-21 of USEPA (2009). This example uses # copper concentrations (ppb) from 3 background wells to set an upper # limit for 2 compliance wells. Here we will attempt to compute an upper # 95% confidence interval for the 95'th percentile of the distribution of # copper concentrations in the background wells. # # The data are stored in EPA.92c.copper2.df. # # Note that even though these data are Type I left singly censored, # it is still possible to compute an estimate of the 95'th percentile. EPA.92c.copper2.df # Copper.orig Copper Censored Month Well Well.type #1 <5 5.0 TRUE 1 1 Background #2 <5 5.0 TRUE 2 1 Background #3 7.5 7.5 FALSE 3 1 Background #... #9 9.2 9.2 FALSE 1 2 Background #10 <5 5.0 TRUE 2 2 Background #11 <5 5.0 TRUE 3 2 Background #... #17 <5 5.0 TRUE 1 3 Background #18 5.4 5.4 FALSE 2 3 Background #19 6.7 6.7 FALSE 3 3 Background #... #29 6.2 6.2 FALSE 5 4 Compliance #30 <5 5.0 TRUE 6 4 Compliance #31 7.8 7.8 FALSE 7 4 Compliance #... #38 <5 5.0 TRUE 6 5 Compliance #39 5.6 5.6 FALSE 7 5 Compliance #40 <5 5.0 TRUE 8 5 Compliance # Because of the small sample size of n=24 observations, it is not possible # to create a nonparametric confidence interval for the 95th percentile # that has an associated confidence level of 95%. If we tried to do this, # we would get an error message: # with(EPA.92c.copper2.df, # eqnpar(Copper[Well.type=="Background"], p = 0.95, ci = TRUE, lb = 0, # ci.type = "upper", approx.conf.level = 0.95)) # #Error in ci.qnpar.interpolate(x = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, : # Minimum coverage of 0.95 is not possible with the given sample size. # So instead, we will use ci.method="exact" with min.coverage=FALSE # to construct the confidence interval. Note that the associated # confidence level is only 71%. with(EPA.92c.copper2.df, eqnpar(Copper[Well.type=="Background"], p = 0.95, ci = TRUE, ci.method = "exact", min.coverage = FALSE, ci.type = "upper", lb = 0, approx.conf.level = 0.95)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 95'th %ile = 7.925 # #Quantile Estimation Method: Nonparametric # #Data: Copper[Well.type == "Background"] # #Sample Size: 24 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: upper # #Confidence Level: 70.8011% # #Confidence Limit Rank(s): NA 24 # #Confidence Interval: LCL = 0.0 # UCL = 9.2 #---------- # For the above example, the true confidence level is 71% instead of 95%. # This is a function of the small sample size. In fact, as Example 17-4 on # pages 17-21 of USEPA (2009) shows, the largest quantile for which you can # construct a nonparametric confidence interval that will have associated # confidence level of 95% is the 88'th percentile: with(EPA.92c.copper2.df, eqnpar(Copper[Well.type=="Background"], p = 0.88, ci = TRUE, ci.type = "upper", lb = 0, ucl.rank = 24, approx.conf.level = 0.95)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 88'th %ile = 6.892 # #Quantile Estimation Method: Nonparametric # #Data: Copper[Well.type == "Background"] # #Sample Size: 24 # #Confidence Interval for: 88'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: upper # #Confidence Level: 95.3486% # #Confidence Limit Rank(s): NA 24 # #Confidence Interval: LCL = 0.0 # UCL = 9.2 #========== # Reproduce Example 21-6 on pages 21-21 to 21-22 of USEPA (2009). # Use 12 measurements of nitrate (mg/L) at a well used for drinking water # to determine with 95% confidence whether or not the infant-based, acute # risk standard of 10 mg/L has been violated. Assume that the risk # standard represents an upper 95'th percentile limit on nitrate # concentrations. So what we need to do is construct a one-sided # lower nonparametric confidence interval for the 95'th percentile # that has associated confidence level of no more than 95%, and we will # compare the lower confidence limit with the MCL of 10 mg/L. # # The data for this example are stored in EPA.09.Ex.21.6.nitrate.df. # Look at the data: #------------------ EPA.09.Ex.21.6.nitrate.df # Sampling.Date Date Nitrate.mg.per.l.orig Nitrate.mg.per.l Censored #1 7/28/1999 1999-07-28 <5.0 5.0 TRUE #2 9/3/1999 1999-09-03 12.3 12.3 FALSE #3 11/24/1999 1999-11-24 <5.0 5.0 TRUE #4 5/3/2000 2000-05-03 <5.0 5.0 TRUE #5 7/14/2000 2000-07-14 8.1 8.1 FALSE #6 10/31/2000 2000-10-31 <5.0 5.0 TRUE #7 12/14/2000 2000-12-14 11 11.0 FALSE #8 3/27/2001 2001-03-27 35.1 35.1 FALSE #9 6/13/2001 2001-06-13 <5.0 5.0 TRUE #10 9/16/2001 2001-09-16 <5.0 5.0 TRUE #11 11/26/2001 2001-11-26 9.3 9.3 FALSE #12 3/2/2002 2002-03-02 10.3 10.3 FALSE # Determine what order statistic to use for the lower confidence limit # in order to achieve no more than 95% confidence. #--------------------------------------------------------------------- conf.levels <- ciNparConfLevel(n = 12, p = 0.95, lcl.rank = 1:12, ci.type = "lower") names(conf.levels) <- 1:12 round(conf.levels, 2) # 1 2 3 4 5 6 7 8 9 10 11 12 #1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.88 0.54 # Using the 11'th largest observation for the lower confidence limit # yields a confidence level of 88%. Using the 10'th largest # observation yields a confidence level of 98%. The example in # USEPA (2009) uses the 10'th largest observation. # # The 10'th largest observation is 11 mg/L which exceeds the # MCL of 10 mg/L, so there is evidence of contamination. #-------------------------------------------------------------------- with(EPA.09.Ex.21.6.nitrate.df, eqnpar(Nitrate.mg.per.l, p = 0.95, ci = TRUE, ci.type = "lower", lcl.rank = 10)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 95'th %ile = 22.56 # #Quantile Estimation Method: Nonparametric # #Data: Nitrate.mg.per.l # #Sample Size: 12 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: lower # #Confidence Level: 98.04317% # #Confidence Limit Rank(s): 10 NA # #Confidence Interval: LCL = 11 # UCL = Inf #========== # Clean up #--------- rm(conf.levels)
# The data frame ACE.13.TCE.df contains observations on # Trichloroethylene (TCE) concentrations (mg/L) at # 10 groundwater monitoring wells before and after remediation. # # Compute the median concentration for each period along with # a 95% confidence interval for the median. # # Before remediation: 20.3 [8.8, 35.9] # After remediation: 2.5 [0.8, 5.9] with(ACE.13.TCE.df, eqnpar(TCE.mg.per.L[Period=="Before"], ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- #Assumed Distribution: None #Estimated Quantile(s): Median = 20.3 #Quantile Estimation Method: Nonparametric #Data: TCE.mg.per.L[Period == "Before"] #Sample Size: 10 #Confidence Interval for: 50'th %ile #Confidence Interval Method: interpolate (Nyblom, 1992) #Confidence Interval Type: two-sided #Confidence Level: 95% #Confidence Limit Rank(s): 2 9 # 3 8 #Confidence Interval: LCL = 8.804775 # UCL = 35.874775 #---------- with(ACE.13.TCE.df, eqnpar(TCE.mg.per.L[Period=="After"], ci = TRUE)) #Results of Distribution Parameter Estimation #-------------------------------------------- #Assumed Distribution: None #Estimated Quantile(s): Median = 2.48 #Quantile Estimation Method: Nonparametric #Data: TCE.mg.per.L[Period == "After"] #Sample Size: 10 #Confidence Interval for: 50'th %ile #Confidence Interval Method: interpolate (Nyblom, 1992) #Confidence Interval Type: two-sided #Confidence Level: 95% #Confidence Limit Rank(s): 2 9 # 3 8 #Confidence Interval: LCL = 0.7810901 # UCL = 5.8763063 #========== # Generate 20 observations from a cauchy distribution with parameters # location=0, scale=1. The true 75th percentile of this distribution is 1. # Use eqnpar to estimate the 75th percentile and construct a 90% confidence interval. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rcauchy(20, location = 0, scale = 1) #------------------------------------------------------- # First, use the default method, ci.method="interpolate" #------------------------------------------------------- eqnpar(dat, p = 0.75, ci = TRUE, approx.conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 75'th %ile = 1.524903 # #Quantile Estimation Method: Nonparametric # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 75'th %ile # #Confidence Interval Method: interpolate (Nyblom, 1992) # #Confidence Interval Type: two-sided # #Confidence Level: 90% # #Confidence Limit Rank(s): 12 19 # 13 18 # #Confidence Interval: LCL = 0.8191423 # UCL = 2.1215570 #---------- #------------------------------------------------------------- # Now use ci.method="exact". # Note that the returned confidence level is greater than 90%. #------------------------------------------------------------- eqnpar(dat, p = 0.75, ci = TRUE, approx.conf.level = 0.9, ci.method = "exact") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 75'th %ile = 1.524903 # #Quantile Estimation Method: Nonparametric # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 75'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 93.47622% # #Confidence Limit Rank(s): 12 19 # #Confidence Interval: LCL = 0.7494692 # UCL = 2.2156601 #---------- #---------------------------------------------------------- # Now use ci.method="exact" with min.coverage=FALSE. # Note that the returned confidence level is less than 90%. #---------------------------------------------------------- eqnpar(dat, p = 0.75, ci = TRUE, approx.conf.level = 0.9, ci.method = "exact", min.coverage = FALSE, ) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 75'th %ile = 1.524903 # #Quantile Estimation Method: Nonparametric # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 75'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 89.50169% # #Confidence Limit Rank(s): 13 20 # #Confidence Interval: LCL = 1.018038 # UCL = 5.002399 #---------- #----------------------------------------------------------- # Now supply our own bounds for the confidence interval. # The first example above based on the Interpolate method # used lcl.rank=12, ucl.rank=19 and lcl.rank=13, ucl.rank=18 # and interpolated between these two confidence intervals. # Here we will specify lcl.rank=13 and ucl.rank=18. The # resulting confidence level is 81%. #----------------------------------------------------------- eqnpar(dat, p = 0.75, ci = TRUE, lcl.rank = 13, ucl.rank = 18) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 75'th %ile = 1.524903 # #Quantile Estimation Method: Nonparametric # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 75'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 80.69277% # #Confidence Limit Rank(s): 13 18 # #Confidence Interval: LCL = 1.018038 # UCL = 2.071172 #---------- # Clean up rm(dat) #========== # Modify Example 17-4 on page 17-21 of USEPA (2009). This example uses # copper concentrations (ppb) from 3 background wells to set an upper # limit for 2 compliance wells. Here we will attempt to compute an upper # 95% confidence interval for the 95'th percentile of the distribution of # copper concentrations in the background wells. # # The data are stored in EPA.92c.copper2.df. # # Note that even though these data are Type I left singly censored, # it is still possible to compute an estimate of the 95'th percentile. EPA.92c.copper2.df # Copper.orig Copper Censored Month Well Well.type #1 <5 5.0 TRUE 1 1 Background #2 <5 5.0 TRUE 2 1 Background #3 7.5 7.5 FALSE 3 1 Background #... #9 9.2 9.2 FALSE 1 2 Background #10 <5 5.0 TRUE 2 2 Background #11 <5 5.0 TRUE 3 2 Background #... #17 <5 5.0 TRUE 1 3 Background #18 5.4 5.4 FALSE 2 3 Background #19 6.7 6.7 FALSE 3 3 Background #... #29 6.2 6.2 FALSE 5 4 Compliance #30 <5 5.0 TRUE 6 4 Compliance #31 7.8 7.8 FALSE 7 4 Compliance #... #38 <5 5.0 TRUE 6 5 Compliance #39 5.6 5.6 FALSE 7 5 Compliance #40 <5 5.0 TRUE 8 5 Compliance # Because of the small sample size of n=24 observations, it is not possible # to create a nonparametric confidence interval for the 95th percentile # that has an associated confidence level of 95%. If we tried to do this, # we would get an error message: # with(EPA.92c.copper2.df, # eqnpar(Copper[Well.type=="Background"], p = 0.95, ci = TRUE, lb = 0, # ci.type = "upper", approx.conf.level = 0.95)) # #Error in ci.qnpar.interpolate(x = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, : # Minimum coverage of 0.95 is not possible with the given sample size. # So instead, we will use ci.method="exact" with min.coverage=FALSE # to construct the confidence interval. Note that the associated # confidence level is only 71%. with(EPA.92c.copper2.df, eqnpar(Copper[Well.type=="Background"], p = 0.95, ci = TRUE, ci.method = "exact", min.coverage = FALSE, ci.type = "upper", lb = 0, approx.conf.level = 0.95)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 95'th %ile = 7.925 # #Quantile Estimation Method: Nonparametric # #Data: Copper[Well.type == "Background"] # #Sample Size: 24 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: upper # #Confidence Level: 70.8011% # #Confidence Limit Rank(s): NA 24 # #Confidence Interval: LCL = 0.0 # UCL = 9.2 #---------- # For the above example, the true confidence level is 71% instead of 95%. # This is a function of the small sample size. In fact, as Example 17-4 on # pages 17-21 of USEPA (2009) shows, the largest quantile for which you can # construct a nonparametric confidence interval that will have associated # confidence level of 95% is the 88'th percentile: with(EPA.92c.copper2.df, eqnpar(Copper[Well.type=="Background"], p = 0.88, ci = TRUE, ci.type = "upper", lb = 0, ucl.rank = 24, approx.conf.level = 0.95)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 88'th %ile = 6.892 # #Quantile Estimation Method: Nonparametric # #Data: Copper[Well.type == "Background"] # #Sample Size: 24 # #Confidence Interval for: 88'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: upper # #Confidence Level: 95.3486% # #Confidence Limit Rank(s): NA 24 # #Confidence Interval: LCL = 0.0 # UCL = 9.2 #========== # Reproduce Example 21-6 on pages 21-21 to 21-22 of USEPA (2009). # Use 12 measurements of nitrate (mg/L) at a well used for drinking water # to determine with 95% confidence whether or not the infant-based, acute # risk standard of 10 mg/L has been violated. Assume that the risk # standard represents an upper 95'th percentile limit on nitrate # concentrations. So what we need to do is construct a one-sided # lower nonparametric confidence interval for the 95'th percentile # that has associated confidence level of no more than 95%, and we will # compare the lower confidence limit with the MCL of 10 mg/L. # # The data for this example are stored in EPA.09.Ex.21.6.nitrate.df. # Look at the data: #------------------ EPA.09.Ex.21.6.nitrate.df # Sampling.Date Date Nitrate.mg.per.l.orig Nitrate.mg.per.l Censored #1 7/28/1999 1999-07-28 <5.0 5.0 TRUE #2 9/3/1999 1999-09-03 12.3 12.3 FALSE #3 11/24/1999 1999-11-24 <5.0 5.0 TRUE #4 5/3/2000 2000-05-03 <5.0 5.0 TRUE #5 7/14/2000 2000-07-14 8.1 8.1 FALSE #6 10/31/2000 2000-10-31 <5.0 5.0 TRUE #7 12/14/2000 2000-12-14 11 11.0 FALSE #8 3/27/2001 2001-03-27 35.1 35.1 FALSE #9 6/13/2001 2001-06-13 <5.0 5.0 TRUE #10 9/16/2001 2001-09-16 <5.0 5.0 TRUE #11 11/26/2001 2001-11-26 9.3 9.3 FALSE #12 3/2/2002 2002-03-02 10.3 10.3 FALSE # Determine what order statistic to use for the lower confidence limit # in order to achieve no more than 95% confidence. #--------------------------------------------------------------------- conf.levels <- ciNparConfLevel(n = 12, p = 0.95, lcl.rank = 1:12, ci.type = "lower") names(conf.levels) <- 1:12 round(conf.levels, 2) # 1 2 3 4 5 6 7 8 9 10 11 12 #1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.88 0.54 # Using the 11'th largest observation for the lower confidence limit # yields a confidence level of 88%. Using the 10'th largest # observation yields a confidence level of 98%. The example in # USEPA (2009) uses the 10'th largest observation. # # The 10'th largest observation is 11 mg/L which exceeds the # MCL of 10 mg/L, so there is evidence of contamination. #-------------------------------------------------------------------- with(EPA.09.Ex.21.6.nitrate.df, eqnpar(Nitrate.mg.per.l, p = 0.95, ci = TRUE, ci.type = "lower", lcl.rank = 10)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Estimated Quantile(s): 95'th %ile = 22.56 # #Quantile Estimation Method: Nonparametric # #Data: Nitrate.mg.per.l # #Sample Size: 12 # #Confidence Interval for: 95'th %ile # #Confidence Interval Method: exact # #Confidence Interval Type: lower # #Confidence Level: 98.04317% # #Confidence Limit Rank(s): 10 NA # #Confidence Interval: LCL = 11 # UCL = Inf #========== # Clean up #--------- rm(conf.levels)
Estimate quantiles of a Pareto distribution.
eqpareto(x, p = 0.5, method = "mle", plot.pos.con = 0.375, digits = 0)
eqpareto(x, p = 0.5, method = "mle", plot.pos.con = 0.375, digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a Pareto distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimating the distribution parameters.
Possible values are
|
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant used to construct the values of the empirical cdf. The default value is
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqpareto
returns estimated quantiles as well as
estimates of the location and scale parameters.
Quantiles are estimated by 1) estimating the location and scale parameters by
calling epareto
, and then 2) calling the function
qpareto
and using the estimated values for
location and scale.
If x
is a numeric vector, eqpareto
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqpareto
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The Pareto distribution is named after Vilfredo Pareto (1848-1923), a professor
of economics. It is derived from Pareto's law, which states that the number of
persons having income
is given by:
where denotes Pareto's constant and is the shape parameter for the
probability distribution.
The Pareto distribution takes values on the positive real line. All values must be
larger than the “location” parameter , which is really a threshold
parameter. There are three kinds of Pareto distributions. The one described here
is the Pareto distribution of the first kind. Stable Pareto distributions have
. Note that the
'th moment only exists if
.
The Pareto distribution is related to the
exponential distribution and
logistic distribution as follows.
Let denote a Pareto random variable with
location=
and
shape=
. Then
has an exponential distribution
with parameter
rate=
, and
has a logistic distribution with parameters
location=
and
scale=
.
The Pareto distribution has a very long right-hand tail. It is often applied in the study of socioeconomic data, including the distribution of income, firm size, population, and stock price fluctuations.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
epareto
, Pareto, estimate.object
.
# Generate 30 observations from a Pareto distribution with # parameters location=1 and shape=1 then estimate the parameters # and the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpareto(30, location = 1, shape = 1) eqpareto(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Pareto # #Estimated Parameter(s): location = 1.009046 # shape = 1.079850 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 8.510708 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 30 #---------- # Clean up #--------- rm(dat)
# Generate 30 observations from a Pareto distribution with # parameters location=1 and shape=1 then estimate the parameters # and the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpareto(30, location = 1, shape = 1) eqpareto(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Pareto # #Estimated Parameter(s): location = 1.009046 # shape = 1.079850 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 8.510708 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 30 #---------- # Clean up #--------- rm(dat)
Estimate quantiles of an Poisson distribution, and optionally contruct a confidence interval for a quantile.
eqpois(x, p = 0.5, method = "mle/mme/mvue", ci = FALSE, ci.method = "exact", ci.type = "two-sided", conf.level = 0.95, digits = 0)
eqpois(x, p = 0.5, method = "mle/mme/mvue", ci = FALSE, ci.method = "exact", ci.type = "two-sided", conf.level = 0.95, digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes an Poisson distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method to use to estimate the mean. Currently the
only possible value is |
ci |
logical scalar indicating whether to compute a confidence interval for the
specified quantile. The default value is |
ci.method |
character string indicating what method to use to construct the confidence
interval for the quantile. The only possible value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqpois
returns estimated quantiles as well as
the estimate of the mean parameter.
Estimation
Let denote a Poisson random variable with parameter
lambda=
. Let
denote the
'th
quantile of the distribution. That is,
Note that due to the discrete nature of the Poisson distribution, there will be
several values of associated with one value of
. For example,
for
, the value
is the
'th quantile for any value of
between 0.14 and 0.406.
Let denote a vector of
observations from a Poisson
distribution with parameter
lambda=
. The
'th quantile
is estimated as the
'th quantile from a Poisson distribution assuming the
true value of
is equal to the estimated value of
.
That is:
where
Because the estimator in equation (3) is the maximum likelihood estimator of
(see the help file for
epois
), the estimated
quantile is the maximum likelihood estimator.
Quantiles are estimated by 1) estimating the mean parameter by
calling epois
, and then 2) calling the function
qpois
and using the estimated value for
the mean parameter.
Confidence Intervals
It can be shown (e.g., Conover, 1980, pp.119-121) that an upper confidence
interval for the 'th quantile with confidence level
is equivalent to an upper
-content tolerance interval with coverage
and confidence level
. Also, a lower
confidence interval for the
'th quantile with confidence level
is equivalent to a lower
-content tolerance
interval with coverage
and confidence level
.
Thus, based on the theory of tolerance intervals for a Poisson distribution
(see tolIntPois
), if ci.type="upper"
, a one-sided upper
confidence interval for the
'th quantile is constructed
as:
where denotes the upper
confidence limit for
(see the help file for
epois
for information on how
is computed).
Similarly, if ci.type="lower"
, a one-sided lower
confidence interval for the
'th quantile is constructed as:
where denotes the lower
confidence limit for
(see the help file for
epois
for information on how
is computed).
Finally, if ci.type="two-sided"
, a two-sided
confidence interval for the
'th quantile is constructed as:
where and
denote the two-sided lower and upper
confidence limits for
(see the help file for
epois
for information on how and
are computed).
If x
is a numeric vector, eqpois
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqpois
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
Percentiles are sometimes used in environmental standards and regulations. For example, Berthouex and Brown (2002, p.71) state:
The U.S. EPA has specifications for air quality monitoring that are, in effect, percentile limitations. ... The U.S. EPA has provided guidance for setting aquatic standards on toxic chemicals that require estimating 99th percentiles and using this statistic to make important decisions about monitoring and compliance. They have also used the 99th percentile to establish maximum daily limits for industrial effluents (e.g., pulp and paper).
Given the importance of these quantities, it is essential to characterize the amount of uncertainty associated with the estimates of these quantities. This is done with confidence intervals.
The Poisson distribution is named after Poisson, who
derived this distribution as the limiting distribution of the
binomial distribution with parameters size=
and
prob=
, where
tends to infinity,
tends to 0, and
stays constant.
In this context, the Poisson distribution was used by Bortkiewicz (1898) to model
the number of deaths (per annum) from kicks by horses in Prussian Army Corps. In
this case, , the probability of death from this cause, was small, but the
number of soldiers exposed to this risk,
, was large.
The Poisson distribution has been applied in a variety of fields, including quality control (modeling number of defects produced in a process), ecology (number of organisms per unit area), and queueing theory. Gibbons (1987b) used the Poisson distribution to model the number of detected compounds per scan of the 32 volatile organic priority pollutants (VOC), and also to model the distribution of chemical concentration (in ppb).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Berthouex, P.M., and I. Hau. (1991). Difficulties Related to Using Extreme Percentiles for Water Quality Regulations. Research Journal of the Water Pollution Control Federation 63(6), 873–879.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York, Chapter 3.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Gibbons, R.D. (1987b). Statistical Models for the Analysis of Volatile Organic Compounds in Waste Disposal Sites. Ground Water 25, 572-580.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 4.
Pearson, E.S., and H.O. Hartley, eds. (1970). Biometrika Tables for Statisticians, Volume 1. Cambridge Universtiy Press, New York, p.81.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
epois
, Poisson, estimate.object
.
# Generate 20 observations from a Poisson distribution with parameter # lambda=2. The true 90'th percentile of this distribution is 4 (actually, # 4 is the p'th quantile for any value of p between 0.86 and 0.947). # Here we will use eqpois to estimate the 90'th percentile and construct a # two-sided 95% confidence interval for this percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpois(20, lambda = 2) eqpois(dat, p = 0.9, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Estimated Quantile(s): 90'th %ile = 4 # #Quantile Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 3 # UCL = 5 # Clean up #--------- rm(dat)
# Generate 20 observations from a Poisson distribution with parameter # lambda=2. The true 90'th percentile of this distribution is 4 (actually, # 4 is the p'th quantile for any value of p between 0.86 and 0.947). # Here we will use eqpois to estimate the 90'th percentile and construct a # two-sided 95% confidence interval for this percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpois(20, lambda = 2) eqpois(dat, p = 0.9, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Estimated Quantile(s): 90'th %ile = 4 # #Quantile Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Confidence Interval for: 90'th %ile # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 3 # UCL = 5 # Clean up #--------- rm(dat)
Estimate quantiles of a uniform distribution.
equnif(x, p = 0.5, method = "mle", digits = 0)
equnif(x, p = 0.5, method = "mle", digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a uniform distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimating the distribution parameters.
The possible values are
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function equnif
returns estimated quantiles as well as
estimates of the location and scale parameters.
Quantiles are estimated by 1) estimating the location and scale parameters by
calling eunif
, and then 2) calling the function
qunif
and using the estimated values for
location and scale.
If x
is a numeric vector, equnif
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, equnif
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The uniform distribution (also called the rectangular
distribution) with parameters min
and max
takes on values on the
real line between min
and max
with equal probability. It has been
used to represent the distribution of round-off errors in tabulated values. Another
important application is that the distribution of the cumulative distribution
function (cdf) of any kind of continuous random variable follows a uniform
distribution with parameters min=0
and max=1
.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
eunif
, Uniform, estimate.object
.
# Generate 20 observations from a uniform distribution with parameters # min=-2 and max=3, then estimate the parameters via maximum likelihood # and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- runif(20, min = -2, max = 3) equnif(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Uniform # #Estimated Parameter(s): min = -1.574529 # max = 2.837006 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 2.395852 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up rm(dat)
# Generate 20 observations from a uniform distribution with parameters # min=-2 and max=3, then estimate the parameters via maximum likelihood # and estimate the 90th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- runif(20, min = -2, max = 3) equnif(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Uniform # #Estimated Parameter(s): min = -1.574529 # max = 2.837006 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 2.395852 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up rm(dat)
Estimate quantiles of a Weibull distribution.
eqweibull(x, p = 0.5, method = "mle", digits = 0)
eqweibull(x, p = 0.5, method = "mle", digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a Weibull distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimating the distribution parameters.
Possible values are
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqweibull
returns estimated quantiles as well as
estimates of the shape and scale parameters.
Quantiles are estimated by 1) estimating the shape and scale parameters by
calling eweibull
, and then 2) calling the function
qweibull
and using the estimated values for
shape and scale.
If x
is a numeric vector, eqweibull
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqweibull
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The Weibull distribution is named after the Swedish physicist Waloddi Weibull, who used this distribution to model breaking strengths of materials. The Weibull distribution has been extensively applied in the fields of reliability and quality control.
The exponential distribution is a special case of the
Weibull distribution: a Weibull random variable with parameters shape=
and
scale=
is equivalent to an exponential random variable with
parameter
rate=
.
The Weibull distribution is related to the
Type I extreme value (Gumbel) distribution as follows:
if is a random variable from a Weibull distribution with parameters
shape=
and
scale=
, then
is a random variable from an extreme value distribution with parameters
location=
and
scale=
.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
eweibull
, Weibull, Exponential,
EVD, estimate.object
.
# Generate 20 observations from a Weibull distribution with parameters # shape=2 and scale=3, then estimate the parameters via maximum likelihood, # and estimate the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rweibull(20, shape = 2, scale = 3) eqweibull(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Weibull # #Estimated Parameter(s): shape = 2.673098 # scale = 3.047762 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 4.163755 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from a Weibull distribution with parameters # shape=2 and scale=3, then estimate the parameters via maximum likelihood, # and estimate the 90'th percentile. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rweibull(20, shape = 2, scale = 3) eqweibull(dat, p = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Weibull # #Estimated Parameter(s): shape = 2.673098 # scale = 3.047762 # #Estimation Method: mle # #Estimated Quantile(s): 90'th %ile = 4.163755 # #Quantile Estimation Method: Quantile(s) Based on # mle Estimators # #Data: dat # #Sample Size: 20 #---------- # Clean up #--------- rm(dat)
Estimate quantiles of a zero-modified lognormal distribution or a zero-modified lognormal distribution (alternative parameterization).
eqzmlnorm(x, p = 0.5, method = "mvue", digits = 0) eqzmlnormAlt(x, p = 0.5, method = "mvue", digits = 0)
eqzmlnorm(x, p = 0.5, method = "mvue", digits = 0) eqzmlnormAlt(x, p = 0.5, method = "mvue", digits = 0)
x |
a numeric vector of positive observations, or an object resulting from a call to an estimating function that assumes a zero-modified lognormal distribution. For For If |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimation. The only possible value is
|
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The functions eqzmlnorm
and eqzmlnormAlt
return estimated quantiles
as well as estimates of the distribution parameters.
Quantiles are estimated by:
estimating the distribution parameters by calling ezmlnorm
or
ezmlnormAlt
, and then
calling the function qzmlnorm
or
qzmlnormAlt
and using the estimated
distribution parameters.
If x
is a numeric vector, eqzmlnorm
and eqzmlnormAlt
return a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqzmlnorm
and
eqzmlnormAlt
return a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The zero-modified lognormal (delta) distribution is sometimes used to model chemical concentrations for which some observations are reported as “Below Detection Limit” (the nondetects are assumed equal to 0). See, for example, Gilliom and Helsel (1986), Owen and DeRouen (1980), and Gibbons et al. (2009, Chapter 12). USEPA (2009, Chapter 15) recommends this strategy only in specific situations, and Helsel (2012, Chapter 1) strongly discourages this approach to dealing with non-detects.
A variation of the zero-modified lognormal (delta) distribution is the zero-modified normal distribution, in which a normal distribution is mixed with a positive probability mass at 0.
One way to try to assess whether a zero-modified lognormal (delta),
zero-modified normal, censored normal, or censored lognormal is the best
model for the data is to construct both censored and detects-only probability
plots (see qqPlotCensored
).
Steven P. Millard ([email protected])
Aitchison, J. (1955). On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association 50, 901–908.
Aitchison, J., and J.A.C. Brown (1957). The Lognormal Distribution (with special reference to its uses in economics). Cambridge University Press, London. pp.94-99.
Crow, E.L., and K. Shimizu. (1988). Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, pp.47–51.
Gibbons, RD., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring. Second Edition. John Wiley and Sons, Hoboken, NJ.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135–146.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R. Second Edition. John Wiley and Sons, Hoboken, NJ, Chapter 1.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, p.312.
Owen, W., and T. DeRouen. (1980). Estimation of the Mean for Lognormal Data Containing Zeros and Left-Censored Values, with Applications to the Measurement of Worker Exposure to Air Contaminants. Biometrics 36, 707–719.
USEPA (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, Permits and State Programs Division, US Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
ezmlnorm
, Zero-Modified Lognormal,
ezmlnormAlt
,
Zero-Modified Lognormal (Alternative Parameterization),
Zero-Modified Normal, Lognormal.
# Generate 100 observations from a zero-modified lognormal (delta) # distribution with mean=2, cv=1, and p.zero=0.5, then estimate the # parameters and also the 80'th and 90'th percentiles. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rzmlnormAlt(100, mean = 2, cv = 1, p.zero = 0.5) eqzmlnormAlt(dat, p = c(0.8, 0.9)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Lognormal (Delta) # #Estimated Parameter(s): mean = 1.9604561 # cv = 0.9169411 # p.zero = 0.4500000 # mean.zmlnorm = 1.0782508 # cv.zmlnorm = 1.5307175 # #Estimation Method: mvue # #Estimated Quantile(s): 80'th %ile = 1.897451 # 90'th %ile = 2.937976 # #Quantile Estimation Method: Quantile(s) Based on # mvue Estimators # #Data: dat # #Sample Size: 100 #---------- # Compare the estimated quatiles with the true quantiles qzmlnormAlt(mean = 2, cv = 1, p.zero = 0.5, p = c(0.8, 0.9)) #[1] 1.746299 2.849858 #---------- # Clean up rm(dat)
# Generate 100 observations from a zero-modified lognormal (delta) # distribution with mean=2, cv=1, and p.zero=0.5, then estimate the # parameters and also the 80'th and 90'th percentiles. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rzmlnormAlt(100, mean = 2, cv = 1, p.zero = 0.5) eqzmlnormAlt(dat, p = c(0.8, 0.9)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Lognormal (Delta) # #Estimated Parameter(s): mean = 1.9604561 # cv = 0.9169411 # p.zero = 0.4500000 # mean.zmlnorm = 1.0782508 # cv.zmlnorm = 1.5307175 # #Estimation Method: mvue # #Estimated Quantile(s): 80'th %ile = 1.897451 # 90'th %ile = 2.937976 # #Quantile Estimation Method: Quantile(s) Based on # mvue Estimators # #Data: dat # #Sample Size: 100 #---------- # Compare the estimated quatiles with the true quantiles qzmlnormAlt(mean = 2, cv = 1, p.zero = 0.5, p = c(0.8, 0.9)) #[1] 1.746299 2.849858 #---------- # Clean up rm(dat)
Estimate quantiles of a zero-modified normal distribution.
eqzmnorm(x, p = 0.5, method = "mvue", digits = 0)
eqzmnorm(x, p = 0.5, method = "mvue", digits = 0)
x |
a numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a zero-modified normal distribution
(e.g., |
p |
numeric vector of probabilities for which quantiles will be estimated.
All values of |
method |
character string specifying the method of estimating the disribution parameters.
Currently, the only possible
value is |
digits |
an integer indicating the number of decimal places to round to when printing out
the value of |
The function eqzmnorm
returns estimated quantiles as well as
estimates of the distribution parameters.
Quantiles are estimated by 1) estimating the distribution parameters by
calling ezmnorm
, and then 2) calling the function
qzmnorm
and using the estimated values for
the distribution parameters.
If x
is a numeric vector, eqzmnorm
returns a
list of class "estimate"
containing the estimated quantile(s) and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, eqzmnorm
returns a list whose class is the same as x
. The list
contains the same components as x
, as well as components called
quantiles
and quantile.method
.
The zero-modified normal distribution is sometimes used to model chemical concentrations for which some observations are reported as “Below Detection Limit”. See, for example USEPA (1992c, pp.27-34). In most cases, however, the zero-modified lognormal (delta) distribution will be more appropriate, since chemical concentrations are bounded below at 0 (e.g., Gilliom and Helsel, 1986; Owen and DeRouen, 1980).
Once you estimate the parameters of the zero-modified normal distribution, it is often useful to characterize the uncertainty in the estimate of the mean. This is done with a confidence interval.
One way to try to assess whether a
zero-modified lognormal (delta),
zero-modified normal, censored normal, or
censored lognormal is the best model for the data is to construct both
censored and detects-only probability plots (see qqPlotCensored
).
Steven P. Millard ([email protected])
Aitchison, J. (1955). On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association 50, 901–908.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135–146.
Owen, W., and T. DeRouen. (1980). Estimation of the Mean for Lognormal Data Containing Zeros and Left-Censored Values, with Applications to the Measurement of Worker Exposure to Air Contaminants. Biometrics 36, 707–719.
USEPA (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, Permits and State Programs Division, US Environmental Protection Agency, Washington, D.C.
ZeroModifiedNormal, Normal,
ezmlnorm
, ZeroModifiedLognormal, estimate.object
.
# Generate 100 observations from a zero-modified normal distribution # with mean=4, sd=2, and p.zero=0.5, then estimate the parameters and # the 80th and 90th percentiles. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rzmnorm(100, mean = 4, sd = 2, p.zero = 0.5) eqzmnorm(dat, p = c(0.8, 0.9)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Normal # #Estimated Parameter(s): mean = 4.037732 # sd = 1.917004 # p.zero = 0.450000 # mean.zmnorm = 2.220753 # sd.zmnorm = 2.465829 # #Estimation Method: mvue # #Estimated Quantile(s): 80'th %ile = 4.706298 # 90'th %ile = 5.779250 # #Quantile Estimation Method: Quantile(s) Based on # mvue Estimators # #Data: dat # #Sample Size: 100 #---------- # Compare the estimated quantiles with the true quantiles qzmnorm(mean = 4, sd = 2, p.zero = 0.5, p = c(0.8, 0.9)) #[1] 4.506694 5.683242 #---------- # Clean up rm(dat)
# Generate 100 observations from a zero-modified normal distribution # with mean=4, sd=2, and p.zero=0.5, then estimate the parameters and # the 80th and 90th percentiles. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rzmnorm(100, mean = 4, sd = 2, p.zero = 0.5) eqzmnorm(dat, p = c(0.8, 0.9)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Normal # #Estimated Parameter(s): mean = 4.037732 # sd = 1.917004 # p.zero = 0.450000 # mean.zmnorm = 2.220753 # sd.zmnorm = 2.465829 # #Estimation Method: mvue # #Estimated Quantile(s): 80'th %ile = 4.706298 # 90'th %ile = 5.779250 # #Quantile Estimation Method: Quantile(s) Based on # mvue Estimators # #Data: dat # #Sample Size: 100 #---------- # Compare the estimated quantiles with the true quantiles qzmnorm(mean = 4, sd = 2, p.zero = 0.5, p = c(0.8, 0.9)) #[1] 4.506694 5.683242 #---------- # Clean up rm(dat)
Plot pointwise error bars given their upper and lower limits.
The errorBar
function is a modified version of the S function
error.bar
. The EnvStats function errorBar
includes the
additional arguments draw.lower
, draw.upper
, gap.size
,
bar.ends.size
, and col
to determine whether both the lower and
upper error bars are drawn and to control the size of the gaps, the size of the bar
ends, and the color of the bars.
errorBar(x, y = NULL, lower, upper, incr = TRUE, draw.lower = TRUE, draw.upper = TRUE, bar.ends = TRUE, gap = TRUE, add = FALSE, horizontal = FALSE, gap.size = 0.75, bar.ends.size = 1, col = 1, ..., xlab = deparse(substitute(x)), xlim, ylim)
errorBar(x, y = NULL, lower, upper, incr = TRUE, draw.lower = TRUE, draw.upper = TRUE, bar.ends = TRUE, gap = TRUE, add = FALSE, horizontal = FALSE, gap.size = 0.75, bar.ends.size = 1, col = 1, ..., xlab = deparse(substitute(x)), xlim, ylim)
x , y
|
coordinates of points. The coordinates can be given by two vector arguments or by a
single vector When both When both If a single numeric vector is given, then Missing values ( |
lower |
pointwise lower limits of the error bars. This may be a single number or a vector
the same length as |
upper |
pointwise upper limits of the error bars. This may be a single number or a vector the
same length as |
incr |
logical scalar indicating whether the values in |
draw.lower |
logical scalar indicating whether to draw the lower error bar.
The default is |
draw.upper |
logical scalar indicating whether to draw the upper error bar.
The default is |
bar.ends |
logical scalar indicating whether flat bars should be drawn at the endpoints. The
default is |
gap |
logical scalar indicating whether gaps should be left around the points to emphasize
their locations. The default is |
add |
logical scalar indicating whether error bars should be added to the current plot.
If |
horizontal |
logical scalar indicating whether the error bars should be oriented horizontally
( |
gap.size |
numeric scalar controlling the width of the gap. |
bar.ends.size |
numeric scalar controlling the length of the bar ends. |
col |
numeric or character vector indicating the color(s) of the bars. |
xlab , xlim , ylim , ...
|
additional graphical parameters (see |
errorBar
creates a plot of y
versus x
with pointwise error bars.
errorBar
invisibly returns a list with the following components:
group.centers |
numeric vector of values on the group axis (the |
group.stats |
a matrix with the number of rows equal to the number of groups and three columns indicating the group location parameter (Center), the lower limit for the error bar (Lower), and the upper limit for the error bar (Upper). |
Authors of S (for code for error.bar
in S).
Steven P. Millard ([email protected])
Cleveland, W.S. (1994). The Elements of Graphing Data. Hobart Press, Summit, New Jersey.
plot
, segments
, pointwise
,
stripChart
.
# The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # Using the log-transformed data, create # # 1. A dynamite plot (bar plot showing mean plus 1 SE) # # 2. A confidence interval plot. TcCB.mat <- summaryStats(TcCB ~ Area, data = EPA.94b.tccb.df, se = TRUE, ci = TRUE) Means <- TcCB.mat[, "Mean"] SEs <- TcCB.mat[, "SE"] LCLs <- TcCB.mat[, "95%.LCL"] UCLs <- TcCB.mat[, "95%.UCL"] # Dynamite Plot #-------------- dev.new() group.centers <- barplot(Means, col = c("red", "blue"), ylim = range(0, Means, Means + SEs), ylab = "TcCB (ppb)", main = "Dynamite Plot for TcCB Data") errorBar(x = as.vector(group.centers), y = Means, lower = SEs, draw.lower = FALSE, gap = FALSE, col = c("red", "blue"), add = TRUE) # Confidence Interval Plot #------------------------- xlim <- par("usr")[1:2] dev.new() errorBar(x = as.vector(group.centers), y = Means, lower = LCLs, upper = UCLs, incr = FALSE, gap = FALSE, col = c("red", "blue"), xlim = xlim, xaxt = "n", xlab = "", ylab = "TcCB (ppb)", main = "Confidence Interval Plot for TcCB Data") axis(1, at = group.centers, labels = dimnames(TcCB.mat)[[1]]) # Clean up #--------- rm(TcCB.mat, Means, SEs, LCLs, UCLs, group.centers, xlim) graphics.off()
# The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # Using the log-transformed data, create # # 1. A dynamite plot (bar plot showing mean plus 1 SE) # # 2. A confidence interval plot. TcCB.mat <- summaryStats(TcCB ~ Area, data = EPA.94b.tccb.df, se = TRUE, ci = TRUE) Means <- TcCB.mat[, "Mean"] SEs <- TcCB.mat[, "SE"] LCLs <- TcCB.mat[, "95%.LCL"] UCLs <- TcCB.mat[, "95%.UCL"] # Dynamite Plot #-------------- dev.new() group.centers <- barplot(Means, col = c("red", "blue"), ylim = range(0, Means, Means + SEs), ylab = "TcCB (ppb)", main = "Dynamite Plot for TcCB Data") errorBar(x = as.vector(group.centers), y = Means, lower = SEs, draw.lower = FALSE, gap = FALSE, col = c("red", "blue"), add = TRUE) # Confidence Interval Plot #------------------------- xlim <- par("usr")[1:2] dev.new() errorBar(x = as.vector(group.centers), y = Means, lower = LCLs, upper = UCLs, incr = FALSE, gap = FALSE, col = c("red", "blue"), xlim = xlim, xaxt = "n", xlab = "", ylab = "TcCB (ppb)", main = "Confidence Interval Plot for TcCB Data") axis(1, at = group.centers, labels = dimnames(TcCB.mat)[[1]]) # Clean up #--------- rm(TcCB.mat, Means, SEs, LCLs, UCLs, group.centers, xlim) graphics.off()
Objects of S3 class "estimate"
are returned by any of the
EnvStats functions that estimate the parameters or quantiles of a
probability distribution and optionally construct confidence,
prediction, or tolerance intervals based on a sample of data
assumed to come from that distribution.
Objects of S3 class "estimate"
are lists that contain
information about the estimated distribution parameters,
quantiles, and intervals. The names of the EnvStats
functions that produce objects of class "estimate"
have the following forms:
Form of Function Name | Result |
e abb |
Parameter Estimation |
eq abb |
Quantile Estimation |
predInt Abb |
Prediction Interval |
tolInt Abb |
Tolerance Interval |
where abb denotes the abbreviation of the name of a
probability distribution (see the help file for
Distribution.df
for a list of available probability
distributions and their abbreviations), and Abb denotes the
same thing as abb except the first letter of the abbreviation
for the probability distribution is capitalized.
See the help files Estimating Distribution Parameters and Estimating Distribution Quantiles for lists of functions that estimate distribution parameters and quantiles. See the help files Prediction Intervals and Tolerance Intervals for lists of functions that create prediction and tolerance intervals.
For example:
The function enorm
returns an object of class
"estimate"
(a list) with information about the estimated
mean and standard deviation of the assumed normal (Gaussian)
distribution, as well as an optional confidence interval for
the mean.
The function eqnorm
returns a list of class
"estimate"
with information about the estimated mean and
standard deviation of the assumed normal distribution, the
estimated user-specified quantile(s), and an optional confidence
interval for a single quantile.
The function predIntNorm
returns a list of class
"estimate"
with information about the estimated mean and
standard deviation of the assumed normal distribution, along with a
prediction interval for a user-specified number of future
observations (or means, medians, or sums).
The function tolIntNorm
returns a list of class
"estimate"
with information about the estimated mean and
standard deviation of the assumed normal distribution, along with a
tolerance interval.
Required Components
The following components must be included in a legitimate list of
class "estimate"
.
distribution |
character string indicating the name of the
assumed distribution (this equals |
sample.size |
numeric scalar indicating the sample size used to estimate the parameters or quantiles. |
data.name |
character string indicating the name of the data object used to compute the estimated parameters or quantiles. |
bad.obs |
numeric scalar indicating the number of missing ( |
Optional Components
The following components may optionally be included in a legitimate
list of class "estimate"
.
parameters |
(parametric estimation only) a numeric vector with a names attribute containing the names and values of the estimated distribution parameters. |
n.param.est |
(parametric estimation only) a scalar indicating the number of distribution parameters estimated. |
method |
(parametric estimation only) a character string indicating the method used to compute the estimated parameters. |
quantiles |
a numeric vector of estimated quantiles. |
quantile.method |
a character string indicating the method of quantile estimation. |
interval |
a list of class |
All lists of class "intervalEstimate"
contain the following
component:
name |
a character string inidicating the kind of interval.
Possible values are: |
The number and names of the other components in a list of class
"intervalEstimate"
depends on the kind of interval it is.
These components may include:
parameter |
a character string indicating the parameter for
which the interval is constructed (e.g., |
limits |
a numeric vector containing the lower and upper bounds of the interval. |
type |
the type of interval (i.e., |
method |
the method used to construct the interval
(e.g., |
conf.level |
the confidence level associated with the interval. |
sample.size |
the sample size associated with the interval. |
dof |
(parametric intervals only) the degrees of freedom associated with the interval. |
limit.ranks |
(nonparametric intervals only) the rank(s) of the order statistic(s) used to construct the interval. |
m |
(prediction intervals only) the total number of future
observations ( |
k |
(prediction intervals only) the minimum number of future
observations |
n.mean |
(prediction intervals only) the sample size associated with the future averages that should be contained in the interval. |
n.median |
(prediction intervals only) the sample size associated with the future medians that should be contained in the interval. |
n.sum |
(Poisson prediction intervals only) the sample size associated with the future sums that should be contained in the interval. |
rule |
(simultaneous prediction intervals only) the rule used to construct the simultaneous prediction interval. |
delta.over.sigma |
(simultaneous prediction intervals only) numeric
scalar indicating the ratio |
Generic functions that have methods for objects of class
"estimate"
include: print
.
Since objects of class "estimate"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
Estimating Distribution Parameters, Estimating Distribution Quantiles,
Distribution.df
, Prediction Intervals,
Tolerance Intervals, estimateCensored.object
.
# Create an object of class "estimate", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- rnorm(20, mean = 3, sd = 2) estimate.obj <- enorm(dat, ci = TRUE) mode(estimate.obj) #[1] "list" class(estimate.obj) #[1] "estimate" names(estimate.obj) #[1] "distribution" "sample.size" "parameters" #[4] "n.param.est" "method" "data.name" #[7] "bad.obs" "interval" names(estimate.obj$interval) #[1] "name" "parameter" "limits" #[4] "type" "method" "conf.level" #[7] "sample.size" "dof" estimate.obj #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 2.861160 # sd = 1.180226 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 2.308798 # UCL = 3.413523 #---------- # Extract the confidence limits for the mean estimate.obj$interval$limits # LCL UCL #2.308798 3.413523 #---------- # Clean up rm(dat, estimate.obj)
# Create an object of class "estimate", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- rnorm(20, mean = 3, sd = 2) estimate.obj <- enorm(dat, ci = TRUE) mode(estimate.obj) #[1] "list" class(estimate.obj) #[1] "estimate" names(estimate.obj) #[1] "distribution" "sample.size" "parameters" #[4] "n.param.est" "method" "data.name" #[7] "bad.obs" "interval" names(estimate.obj$interval) #[1] "name" "parameter" "limits" #[4] "type" "method" "conf.level" #[7] "sample.size" "dof" estimate.obj #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 2.861160 # sd = 1.180226 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Confidence Interval for: mean # #Confidence Interval Method: Exact # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 2.308798 # UCL = 3.413523 #---------- # Extract the confidence limits for the mean estimate.obj$interval$limits # LCL UCL #2.308798 3.413523 #---------- # Clean up rm(dat, estimate.obj)
Objects of S3 class "estimateCensored"
are returned by any of the
EnvStats functions that estimate the parameters or quantiles of a
probability distribution and optionally construct confidence,
prediction, or tolerance intervals based on a sample of censored
data assumed to come from that distribution.
Objects of S3 class "estimateCensored"
are lists that contain
information about the estimated distribution parameters,
quantiles, and (if present) intervals, as well as the censoring side,
censoring levels and percentage of censored observations.
The names of the EnvStats
functions that produce objects of class "estimateCensored"
have the following forms:
Form of Function Name | Result |
e abbCensored |
Parameter Estimation |
eq abbCensored |
Quantile Estimation |
predInt AbbCensored |
Prediction Interval |
tolInt AbbCensored |
Tolerance Interval |
where abb denotes the abbreviation of the name of a
probability distribution (see the help file for
Distribution.df
for a list of available probability
distributions and their abbreviations), and Abb denotes the
same thing as abb except the first letter of the abbreviation
for the probability distribution is capitalized.
See the sections Estimating Distribution Parameters, Estimating Distribution Quantiles, and Prediction and Tolerance Intervals in the help file EnvStats Functions for Censored Data for a list of functions that estimate distribution parameters, estimate distribution quantiles, create prediction intervals, or create tolerance intervals using censored data.
For example:
The function enormCensored
returns an object of class
"estimateCensored"
(a list) with information about the estimated
mean and standard deviation of the assumed normal (Gaussian)
distribution, information about the amount and side of censoring, and also an
optional confidence interval for the mean.
The function eqnormCensored
returns a list of class
"estimateCensored"
with information about the estimated mean and
standard deviation of the assumed normal distribution, information about the
amount and side of censoring, the
estimated user-specified quantile(s), and an optional confidence
interval for a single quantile.
The function tolIntNormCensored
returns a list of class
"estimateCensored"
with information about the estimated mean and
standard deviation of the assumed normal distribution, information about the amount
and side of censoring, and the computed tolerance interval.
Required Components
The following components must be included in a legitimate list of
class "estimateCensored"
.
distribution |
character string indicating the name of the
assumed distribution (this equals |
sample.size |
numeric scalar indicating the sample size used to estimate the parameters or quantiles. |
censoring.side |
character string indicating whether the data are left- or right-censored. |
censoring.levels |
numeric scalar or vector indicating the censoring level(s). |
percent.censored |
numeric scalar indicating the percent of non-missing observations that are censored. |
data.name |
character string indicating the name of the data object used to compute the estimateCensored parameters or quantiles. |
censoring.name |
character string indicating the name of the data object used to identify which values are censored. |
bad.obs |
numeric scalar indicating the number of missing ( |
Optional Components
The following components may optionally be included in a legitimate
list of class "estimateCensored"
.
parameters |
(parametric estimation only) a numeric vector with a names attribute containing the names and values of the estimateCensored distribution parameters. |
n.param.est |
(parametric estimation only) a scalar indicating the number of distribution parameters estimateCensored. |
method |
(parametric estimation only) a character string indicating the method used to compute the estimateCensored parameters. |
quantiles |
a numeric vector of estimateCensored quantiles. |
quantile.method |
a character string indicating the method of quantile estimation. |
interval |
a list of class |
All lists of class "intervalEstimateCensored"
contain the following
component:
name |
a character string inidicating the kind of interval.
Possible values are: |
The number and names of the other components in a list of class
"intervalEstimate"
depends on the kind of interval it is.
These components may include:
parameter |
a character string indicating the parameter for
which the interval is constructed (e.g., |
limits |
a numeric vector containing the lower and upper bounds of the interval. |
type |
the type of interval (i.e., |
method |
the method used to construct the interval
(e.g., |
conf.level |
the confidence level associated with the interval. |
sample.size |
the sample size associated with the interval. |
dof |
(parametric intervals only) the degrees of freedom associated with the interval. |
limit.ranks |
(nonparametric intervals only) the rank(s) of the order statistic(s) used to construct the interval. |
m |
(prediction intervals only) the total number of future
observations ( |
k |
(prediction intervals only) the minimum number of future
observations |
n.mean |
(prediction intervals only) the sample size associated with the future averages that should be contained in the interval. |
n.median |
(prediction intervals only) the sample size associated with the future medians that should be contained in the interval. |
n.sum |
(Poisson prediction intervals only) the sample size associated with the future sums that should be contained in the interval. |
rule |
(simultaneous prediction intervals only) the rule used to construct the simultaneous prediction interval. |
delta.over.sigma |
(simultaneous prediction intervals only) numeric
scalar indicating the ratio |
Generic functions that have methods for objects of class
"estimateCensored"
include: print
.
Since objects of class "estimateCensored"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
EnvStats Functions for Censored Data,
Distribution.df
, estimate.object
.
# Create an object of class "estimateCensored", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- rnorm(20, mean = 100, sd = 20) censored <- dat < 90 dat[censored] <- 90 estimateCensored.obj <- enormCensored(dat, censored, ci = TRUE) mode(estimateCensored.obj) #[1] "list" class(estimateCensored.obj) #[1] "estimateCensored" names(estimateCensored.obj) # [1] "distribution" "sample.size" "censoring.side" "censoring.levels" # [5] "percent.censored" "parameters" "n.param.est" "method" # [9] "data.name" "censoring.name" "bad.obs" "interval" #[13] "var.cov.params" names(estimateCensored.obj$interval) #[1] "name" "parameter" "limits" "type" "method" "conf.level" estimateCensored.obj #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 90 # #Estimated Parameter(s): mean = 96.52796 # sd = 14.62275 # #Estimation Method: MLE # #Data: dat # #Censoring Variable: censored # #Sample Size: 20 # #Percent Censored: 25% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 88.82415 # UCL = 103.27604 #---------- # Extract the confidence limits for the mean estimateCensored.obj$interval$limits # LCL UCL # 91.7801 103.7839 #---------- # Clean up rm(dat, censored, estimateCensored.obj)
# Create an object of class "estimateCensored", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- rnorm(20, mean = 100, sd = 20) censored <- dat < 90 dat[censored] <- 90 estimateCensored.obj <- enormCensored(dat, censored, ci = TRUE) mode(estimateCensored.obj) #[1] "list" class(estimateCensored.obj) #[1] "estimateCensored" names(estimateCensored.obj) # [1] "distribution" "sample.size" "censoring.side" "censoring.levels" # [5] "percent.censored" "parameters" "n.param.est" "method" # [9] "data.name" "censoring.name" "bad.obs" "interval" #[13] "var.cov.params" names(estimateCensored.obj$interval) #[1] "name" "parameter" "limits" "type" "method" "conf.level" estimateCensored.obj #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 90 # #Estimated Parameter(s): mean = 96.52796 # sd = 14.62275 # #Estimation Method: MLE # #Data: dat # #Censoring Variable: censored # #Sample Size: 20 # #Percent Censored: 25% # #Confidence Interval for: mean # #Confidence Interval Method: Profile Likelihood # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 88.82415 # UCL = 103.27604 #---------- # Extract the confidence limits for the mean estimateCensored.obj$interval$limits # LCL UCL # 91.7801 103.7839 #---------- # Clean up rm(dat, censored, estimateCensored.obj)
Explanation of Euler's Constant.
Euler's Constant, here denoted , is a real-valued number that can
be defined in several ways. Johnson et al. (1992, p. 5) use the definition:
and note that it can also be expressed as
where is the digamma function
(Johnson et al., 1992, p.8).
The value of Euler's Constant, to 10 decimal places, is 0.5772156649.
The expression for the mean of a
Type I extreme value (Gumbel) distribution involves Euler's
constant; hence Euler's constant is used to compute the method of moments
estimators for this distribution (see eevd
).
Steven P. Millard ([email protected])
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, pp.4-8.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Extreme Value Distribution, eevd
.
Estimate the minimum and maximum parameters of a uniform distribution.
eunif(x, method = "mle")
eunif(x, method = "mle")
x |
numeric vector of observations. Missing ( |
method |
character string specifying the method of estimation. The possible values are
|
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from an uniform distribution with
parameters
min=
and
max=
. Also, let
denote the
'th order statistic.
Estimation
Maximum Likelihood Estimation (method="mle"
)
The maximum likelihood estimators (mle's) of and
are given by
(Johnson et al, 1995, p.286):
Method of Moments Estimation (method="mme"
)
The method of moments estimators (mme's) of and
are given by
(Forbes et al., 2011):
where
Method of Moments Estimation Based on the Unbiased Estimator of Variance (method="mmue"
)
The method of moments estimators based on the unbiased estimator of variance are
exactly the same as the method of moments estimators given in equations (3-6) above,
except that the method of moments estimator of variance in equation (6) is replaced
with the unbiased estimator of variance:
where
a list of class "estimate"
containing the estimated parameters and other
information. See estimate.object
for details.
The uniform distribution (also called the rectangular
distribution) with parameters min
and max
takes on values on the
real line between min
and max
with equal probability. It has been
used to represent the distribution of round-off errors in tabulated values. Another
important application is that the distribution of the cumulative distribution
function (cdf) of any kind of continuous random variable follows a uniform
distribution with parameters min=0
and max=1
.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
# Generate 20 observations from a uniform distribution with parameters # min=-2 and max=3, then estimate the parameters via maximum likelihood. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- runif(20, min = -2, max = 3) eunif(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Uniform # #Estimated Parameter(s): min = -1.574529 # max = 2.837006 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 #---------- # Compare the three methods of estimation: eunif(dat, method = "mle")$parameters # min max #-1.574529 2.837006 eunif(dat, method = "mme")$parameters # min max #-1.988462 2.650737 eunif(dat, method = "mmue")$parameters # min max #-2.048721 2.710996 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from a uniform distribution with parameters # min=-2 and max=3, then estimate the parameters via maximum likelihood. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- runif(20, min = -2, max = 3) eunif(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Uniform # #Estimated Parameter(s): min = -1.574529 # max = 2.837006 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 #---------- # Compare the three methods of estimation: eunif(dat, method = "mle")$parameters # min max #-1.574529 2.837006 eunif(dat, method = "mme")$parameters # min max #-1.988462 2.650737 eunif(dat, method = "mmue")$parameters # min max #-2.048721 2.710996 #---------- # Clean up #--------- rm(dat)
Density, distribution function, quantile function, and random generation for the (largest) extreme value distribution.
devd(x, location = 0, scale = 1) pevd(q, location = 0, scale = 1) qevd(p, location = 0, scale = 1) revd(n, location = 0, scale = 1)
devd(x, location = 0, scale = 1) pevd(q, location = 0, scale = 1) qevd(p, location = 0, scale = 1) revd(n, location = 0, scale = 1)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
location |
vector of location parameters. |
scale |
vector of positive scale parameters. |
Let be an extreme value random variable with parameters
location=
and
scale=
.
The density function of
is given by:
where and
.
The cumulative distribution function of is given by:
The quantile of
is given by:
The mode, mean, variance, skew, and kurtosis of are given by:
where denotes Euler's constant,
which is equivalent to
-digamma(1)
.
density (devd
), probability (pevd
), quantile (qevd
), or
random sample (revd
) for the extreme value distribution with
location parameter(s) determined by location
and scale
parameter(s) determined by scale
.
There are three families of extreme value distributions. The one
described here is the Type I, also called the Gumbel extreme value
distribution or simply Gumbel distribution. The name
“extreme value” comes from the fact that this distribution is
the limiting distribution (as approaches infinity) of the
greatest value among
independent random variables each
having the same continuous distribution.
The Gumbel extreme value distribution is related to the
exponential distribution as follows.
Let be an exponential random variable
with parameter
rate=
. Then
has an extreme value distribution with parameters
location=
and
scale=
.
The distribution described above and used by devd
, pevd
,
qevd
, and revd
is the largest extreme value
distribution. The smallest extreme value distribution is the limiting
distribution (as approaches infinity) of the smallest value among
independent random variables each having the same continuous distribution.
If
has a largest extreme value distribution with parameters
location=
and
scale=
, then
has a smallest extreme value distribution with parameters
location=
and
scale=
. The smallest
extreme value distribution is related to the
Weibull distribution as follows.
Let
be a Weibull random variable with parameters
shape=
and
scale=
. Then
has a smallest extreme value distribution with parameters
location=
and
scale=
.
The extreme value distribution has been used extensively to model the distribution of streamflow, flooding, rainfall, temperature, wind speed, and other meteorological variables, as well as material strength and life data.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
eevd
, GEVD
,
Probability Distributions and Random Numbers.
# Density of an extreme value distribution with location=0, scale=1, # evaluated at 0.5: devd(.5) #[1] 0.3307043 #---------- # The cdf of an extreme value distribution with location=1, scale=2, # evaluated at 0.5: pevd(.5, 1, 2) #[1] 0.2769203 #---------- # The 25'th percentile of an extreme value distribution with # location=-2, scale=0.5: qevd(.25, -2, 0.5) #[1] -2.163317 #---------- # Random sample of 4 observations from an extreme value distribution with # location=5, scale=2. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) revd(4, 5, 2) #[1] 9.070406 7.669139 4.511481 5.903675
# Density of an extreme value distribution with location=0, scale=1, # evaluated at 0.5: devd(.5) #[1] 0.3307043 #---------- # The cdf of an extreme value distribution with location=1, scale=2, # evaluated at 0.5: pevd(.5, 1, 2) #[1] 0.2769203 #---------- # The 25'th percentile of an extreme value distribution with # location=-2, scale=0.5: qevd(.25, -2, 0.5) #[1] -2.163317 #---------- # Random sample of 4 observations from an extreme value distribution with # location=5, scale=2. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) revd(4, 5, 2) #[1] 9.070406 7.669139 4.511481 5.903675
Compute the expected values of order statistics for a random sample from a standard normal distribution.
evNormOrdStats(n = 1, method = "royston", lower = -9, inc = 0.025, warn = TRUE, alpha = 3/8, nmc = 2000, seed = 47, approximate = NULL) evNormOrdStatsScalar(r = 1, n = 1, method = "royston", lower = -9, inc = 0.025, warn = TRUE, alpha = 3/8, nmc = 2000, conf.level = 0.95, seed = 47, approximate = NULL)
evNormOrdStats(n = 1, method = "royston", lower = -9, inc = 0.025, warn = TRUE, alpha = 3/8, nmc = 2000, seed = 47, approximate = NULL) evNormOrdStatsScalar(r = 1, n = 1, method = "royston", lower = -9, inc = 0.025, warn = TRUE, alpha = 3/8, nmc = 2000, conf.level = 0.95, seed = 47, approximate = NULL)
n |
positive integer indicating the sample size. |
r |
positive integer between |
method |
character string indicating what method to use. The possible values are:
See the DETAILS section below. |
lower |
numeric scalar |
inc |
numeric scalar between |
warn |
logical scalar indicating whether to issue a warning when
|
alpha |
numeric scalar between 0 and 0.5 that determines the constant used when |
nmc |
integer |
conf.level |
numeric scalar between 0 and 1 denoting the confidence level of
the confidence interval for the expected value of the normal
order statistic when |
seed |
integer between |
approximate |
logical scalar included for backwards compatibility with versions of
EnvStats prior to version 2.3.0.
When |
Let denote a vector of
observations from a normal distribution with parameters
mean=0
and sd=1
. That is, denotes a vector of
observations from a standard normal distribution. Let
denote the
'th order statistic of
,
for
. The probability density function of
is given by:
where and
denote the cumulative distribution function and
probability density function of the standard normal distribution, respectively
(Johnson et al., 1994, p.93). Thus, the expected value of
is given by:
It can be shown that if is odd, then
Also, for all values of ,
The function evNormOrdStatsScalar
computes the value of for
user-specified values of
and
.
The function evNormOrdStats
computes the values of for all
values of
(i.e., for
)
for a user-specified value of
.
Exact Method Based on Royston's Approximation to the Integral (method="royston"
)
When method="royston"
, the integral in Equation (2) above is approximated by
computing the value of the integrand between the values of lower
and
-lower
using increments of inc
, then summing these values and
multiplying by inc
. In particular, the integrand is restructured as:
By default, as per Royston (1982), the integrand is evaluated between -9 and 9 in
increments of 0.025. The approximation is computed this way for values of
between
and
, where
denotes the floor of
.
If
, then the approximation is computed for
and
Equation (4) is used.
Note that Equation (1) in Royston (1982) differs from Equations (1) and (2) above
because Royston's paper is based on the largest value,
not the
order statistic.
Royston (1982) states that this algorithm “is accurate to at least seven decimal
places on a 36-bit machine,” that it has been validated up to a sample size
of , and that the accuracy for
may be improved by
reducing the value of the argument
inc
. Note that making
inc
smaller will increase the computation time.
Approxmation Based on Blom's Method (method="blom"
)
When method="blom"
, the following approximation to ,
proposed by Blom (1958, pp. 68-75), is used:
By default, . This approximation is quite accurate.
For example, for
, the approximation is accurate to the first decimal place,
and for
it is accurate to the second decimal place.
Harter (1961) discusses appropriate values of for various sample sizes
and values of
.
Approximation Based on Monte Carlo Simulation (method="mc"
)
When method="mc"
, Monte Carlo simulation is used to estmate the expected value
of the order statistic. That is,
nmc
trials are run in which,
for each trial, a random sample of standard normal observations is
generated and the
order statistic is computed. Then, the average value
of this order statistic over all
trials is computed, along with a
confidence interval for the expected value, assuming an approximately
normal distribution for the mean of the order statistic (the confidence interval
is computed by supplying the simulated values of the
order statistic
to the function
enorm
).
NOTE: This method has not been optimized for large sample sizes
(i.e., large values of the argument
n
) and/or a large number of
Monte Carlo trials (i.e., large values of the argument
nmc
) and
may take a long time to execute in these cases.
For evNormOrdStats
: a numeric vector of length n
containing the
expected values of all the order statistics for a random sample of n
standard normal deviates.
For evNormOrdStatsScalar
: a numeric scalar containing the expected value
of the r
'th order statistic from a random sample of n
standard
normal deviates. When method="mc"
, the returned object also has a
cont.int
attribute that contains the 95
and a nmc
attribute indicating the number of Monte Carlo trials run.
The expected values of normal order statistics are used to construct normal
quantile-quantile (Q-Q) plots (see qqPlot
) and to compute
goodness-of-fit statistics (see gofTest
). Usually, however,
approximations are used instead of exact values. The functions
evNormOrdStats
and evNormOrdStatsScalar
have been included mainly
because evNormOrdStatsScalar
is called by elnorm3
and
predIntNparSimultaneousTestPower
.
Steven P. Millard ([email protected])
Blom, G. (1958). Statistical Estimates and Transformed Beta Variables. John Wiley and Sons, New York.
Harter, H. L. (1961). Expected Values of Normal Order Statistics 48, 151–165.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York, pp. 93–99.
Royston, J.P. (1982). Algorithm AS 177. Expected Normal Order Statistics (Exact and Approximate). Applied Statistics 31, 161–165.
Normal, ppoints
, elnorm3
,
predIntNparSimultaneousTestPower
, gofTest
,
qqPlot
.
# Compute the expected value of the minimum for a random sample of size 10 # from a standard normal distribution: # Based on method="royston" #-------------------------- evNormOrdStatsScalar(r = 1, n = 10) #[1] -1.538753 # Based on method="blom" #----------------------- evNormOrdStatsScalar(r = 1, n = 10, method = "blom") #[1] -1.546635 # Based on method="mc" with 10,000 Monte Carlo trials #---------------------------------------------------- evNormOrdStatsScalar(r = 1, n = 10, method = "mc", nmc = 10000) #[1] -1.544318 #attr(,"confint") # 95%LCL 95%UCL #-1.555838 -1.532797 #attr(,"nmc") #[1] 10000 #==================== # Compute the expected values of all of the order statistics # for a random sample of size 10 from a standard normal distribution # based on Royston's (1982) method: #-------------------------------------------------------------------- evNormOrdStats(10) #[1] -1.5387527 -1.0013570 -0.6560591 -0.3757647 -0.1226678 #[6] 0.1226678 0.3757647 0.6560591 1.0013570 1.5387527 # Compare the above with Blom (1958) scores: #------------------------------------------- evNormOrdStats(10, method = "blom") #[1] -1.5466353 -1.0004905 -0.6554235 -0.3754618 -0.1225808 #[6] 0.1225808 0.3754618 0.6554235 1.0004905 1.5466353
# Compute the expected value of the minimum for a random sample of size 10 # from a standard normal distribution: # Based on method="royston" #-------------------------- evNormOrdStatsScalar(r = 1, n = 10) #[1] -1.538753 # Based on method="blom" #----------------------- evNormOrdStatsScalar(r = 1, n = 10, method = "blom") #[1] -1.546635 # Based on method="mc" with 10,000 Monte Carlo trials #---------------------------------------------------- evNormOrdStatsScalar(r = 1, n = 10, method = "mc", nmc = 10000) #[1] -1.544318 #attr(,"confint") # 95%LCL 95%UCL #-1.555838 -1.532797 #attr(,"nmc") #[1] 10000 #==================== # Compute the expected values of all of the order statistics # for a random sample of size 10 from a standard normal distribution # based on Royston's (1982) method: #-------------------------------------------------------------------- evNormOrdStats(10) #[1] -1.5387527 -1.0013570 -0.6560591 -0.3757647 -0.1226678 #[6] 0.1226678 0.3757647 0.6560591 1.0013570 1.5387527 # Compare the above with Blom (1958) scores: #------------------------------------------- evNormOrdStats(10, method = "blom") #[1] -1.5466353 -1.0004905 -0.6554235 -0.3754618 -0.1225808 #[6] 0.1225808 0.3754618 0.6554235 1.0004905 1.5466353
Estimate the shape and scale parameters of a Weibull distribution.
eweibull(x, method = "mle")
eweibull(x, method = "mle")
x |
numeric vector of observations. Missing ( |
method |
character string specifying the method of estimation. Possible values are
|
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from an Weibull distribution with
parameters
shape=
and
scale=
.
Estimation
Maximum Likelihood Estimation (method="mle"
)
The maximum likelihood estimators (mle's) of and
are
the solutions of the simultaneous equations (Forbes et al., 2011):
Method of Moments Estimation (method="mme"
)
The method of moments estimator (mme) of is computed by solving the
equation:
and the method of moments estimator (mme) of is then computed as:
where
and denotes the gamma function.
Method of Moments Estimation Based on the Unbiased Estimator of Variance (method="mmue"
)
The method of moments estimators based on the unbiased estimator of variance are
exactly the same as the method of moments estimators given in equations (3-6) above,
except that the method of moments estimator of variance in equation (6) is replaced
with the unbiased estimator of variance:
a list of class "estimate"
containing the estimated parameters and other
information. See estimate.object
for details.
The Weibull distribution is named after the Swedish physicist Waloddi Weibull, who used this distribution to model breaking strengths of materials. The Weibull distribution has been extensively applied in the fields of reliability and quality control.
The exponential distribution is a special case of the
Weibull distribution: a Weibull random variable with parameters shape=
and
scale=
is equivalent to an exponential random variable with
parameter
rate=
.
The Weibull distribution is related to the
Type I extreme value (Gumbel) distribution as follows:
if is a random variable from a Weibull distribution with parameters
shape=
and
scale=
, then
is a random variable from an extreme value distribution with parameters
location=
and
scale=
.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Weibull, Exponential, EVD,
estimate.object
.
# Generate 20 observations from a Weibull distribution with parameters # shape=2 and scale=3, then estimate the parameters via maximum likelihood. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rweibull(20, shape = 2, scale = 3) eweibull(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Weibull # #Estimated Parameter(s): shape = 2.673098 # scale = 3.047762 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 #---------- # Use the same data as in previous example, and compute the method of # moments estimators based on the unbiased estimator of variance: eweibull(dat, method = "mmue") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Weibull # #Estimated Parameter(s): shape = 2.528377 # scale = 3.052507 # #Estimation Method: mmue # #Data: dat # #Sample Size: 20 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from a Weibull distribution with parameters # shape=2 and scale=3, then estimate the parameters via maximum likelihood. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rweibull(20, shape = 2, scale = 3) eweibull(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Weibull # #Estimated Parameter(s): shape = 2.673098 # scale = 3.047762 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 #---------- # Use the same data as in previous example, and compute the method of # moments estimators based on the unbiased estimator of variance: eweibull(dat, method = "mmue") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Weibull # #Estimated Parameter(s): shape = 2.528377 # scale = 3.052507 # #Estimation Method: mmue # #Data: dat # #Sample Size: 20 #---------- # Clean up #--------- rm(dat)
Estimate the parameters of a zero-modified lognormal distribution or a zero-modified lognormal distribution (alternative parameterization), and optionally construct a confidence interval for the mean.
ezmlnorm(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95) ezmlnormAlt(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95)
ezmlnorm(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95) ezmlnormAlt(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95)
x |
numeric vector of observations. Missing ( |
method |
character string specifying the method of estimation. The only possible value is
|
ci |
logical scalar indicating whether to compute a confidence interval for the
mean. The default value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence
interval for the mean. The only possible value is |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from a
zero-modified lognormal distribution with
parameters
meanlog=
,
sdlog=
, and
p.zero=
. Alternatively, let
be a vector of
observations from a
zero-modified lognormal distribution
(alternative parameterization) with parameters
mean=
,
cv=
, and
p.zero=
.
Let denote the number of observations in
that are equal
to 0, and order the observations so that
denote
the
zero observations and
denote
the
non-zero observations.
Note that is not the mean of the zero-modified lognormal
distribution; it is the mean of the lognormal part of the distribution. Similarly,
is not the coefficient of variation of the zero-modified
lognormal distribution; it is the coefficient of variation of the lognormal
part of the distribution.
Let ,
, and
denote the mean, standard deviation,
and coefficient of variation of the overall zero-modified lognormal (delta)
distribution. Let
denote the standard deviation of the lognormal
part of the distribution, so that
. Aitchison (1955)
shows that:
so that
Estimation
Minimum Variance Unbiased Estimation (method="mvue"
)
Aitchison (1955) shows that the minimum variance unbiased estimators (mvue's) of
and
are:
|
|
if , |
|
if , |
|
|
if |
|
|
|
if , |
|
if , |
|
|
if
|
where
Note that when or
, the estimator of
is simply the
sample mean for all observations (including zero values), and the estimator for
is simply the sample variance for all observations.
The expected value and asymptotic variance of the mvue of are
(Aitchison and Brown, 1957, p.99; Owen and DeRouen, 1980):
Confidence Intervals
Based on Normal Approximation (ci.method="normal.approx"
)
An approximate confidence interval for
is
constructed based on the assumption that the estimator of
is
approximately normally distributed. Thus, an approximate two-sided
confidence interval for
is constructed as:
where is the
'th quantile of
Student's t-distribution with
degrees of freedom, and
the quantity
is the estimated standard deviation
of the mvue of
, and is computed by replacing the values of
,
, and
in equation (11) above with their estimated
values and taking the square root.
Note that there must be at least 3 non-missing observations () and
at least one observation must be non-zero (
) in order to construct
a confidence interval.
One-sided confidence intervals are computed in a similar fashion.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
For the function ezmlnorm
, the component called parameters
is a
numeric vector with the following estimated parameters:
Parameter Name | Explanation |
meanlog |
mean of the log of the lognormal part of the distribution. |
sdlog |
standard deviation of the log of the lognormal part of the distribution. |
p.zero |
probability that an observation will be 0. |
mean.zmlnorm |
mean of the overall zero-modified lognormal (delta) distribution. |
sd.zmlnorm |
standard deviation of the overall zero-modified lognormal (delta) distribution. |
For the function ezmlnormAlt
, the component called parameters
is a
numeric vector with the following estimated parameters:
Parameter Name | Explanation |
mean |
mean of the lognormal part of the distribution. |
cv |
coefficient of variation of the lognormal part of the distribution. |
p.zero |
probability that an observation will be 0. |
mean.zmlnorm |
mean of the overall zero-modified lognormal (delta) distribution. |
sd.zmlnorm |
standard deviation of the overall zero-modified lognormal (delta) distribution. |
The zero-modified lognormal (delta) distribution is sometimes used to model chemical concentrations for which some observations are reported as “Below Detection Limit” (the nondetects are assumed equal to 0). See, for example, Gilliom and Helsel (1986), Owen and DeRouen (1980), and Gibbons et al. (2009, Chapter 12). USEPA (2009, Chapter 15) recommends this strategy only in specific situations, and Helsel (2012, Chapter 1) strongly discourages this approach to dealing with non-detects.
A variation of the zero-modified lognormal (delta) distribution is the zero-modified normal distribution, in which a normal distribution is mixed with a positive probability mass at 0.
One way to try to assess whether a zero-modified lognormal (delta),
zero-modified normal, censored normal, or censored lognormal is the best
model for the data is to construct both censored and detects-only probability
plots (see qqPlotCensored
).
Steven P. Millard ([email protected])
Aitchison, J. (1955). On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association 50, 901–908.
Aitchison, J., and J.A.C. Brown (1957). The Lognormal Distribution (with special reference to its uses in economics). Cambridge University Press, London. pp.94-99.
Crow, E.L., and K. Shimizu. (1988). Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, pp.47–51.
Gibbons, RD., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring. Second Edition. John Wiley and Sons, Hoboken, NJ.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135–146.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R. Second Edition. John Wiley and Sons, Hoboken, NJ, Chapter 1.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, p.312.
Owen, W., and T. DeRouen. (1980). Estimation of the Mean for Lognormal Data Containing Zeros and Left-Censored Values, with Applications to the Measurement of Worker Exposure to Air Contaminants. Biometrics 36, 707–719.
USEPA (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, Permits and State Programs Division, US Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Zero-Modified Lognormal, Zero-Modified Normal, Lognormal.
# Generate 100 observations from a zero-modified lognormal (delta) # distribution with mean=2, cv=1, and p.zero=0.5, then estimate the # parameters. According to equations (1) and (3) above, the overall mean # is mean.zmlnorm=1 and the overall cv is cv.zmlnorm=sqrt(3). # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rzmlnormAlt(100, mean = 2, cv = 1, p.zero = 0.5) ezmlnormAlt(dat, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Lognormal (Delta) # #Estimated Parameter(s): mean = 1.9604561 # cv = 0.9169411 # p.zero = 0.4500000 # mean.zmlnorm = 1.0782508 # cv.zmlnorm = 1.5307175 # #Estimation Method: mvue # #Data: dat # #Sample Size: 100 # #Confidence Interval for: mean.zmlnorm # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.748134 # UCL = 1.408368 #---------- # Clean up rm(dat)
# Generate 100 observations from a zero-modified lognormal (delta) # distribution with mean=2, cv=1, and p.zero=0.5, then estimate the # parameters. According to equations (1) and (3) above, the overall mean # is mean.zmlnorm=1 and the overall cv is cv.zmlnorm=sqrt(3). # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rzmlnormAlt(100, mean = 2, cv = 1, p.zero = 0.5) ezmlnormAlt(dat, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Lognormal (Delta) # #Estimated Parameter(s): mean = 1.9604561 # cv = 0.9169411 # p.zero = 0.4500000 # mean.zmlnorm = 1.0782508 # cv.zmlnorm = 1.5307175 # #Estimation Method: mvue # #Data: dat # #Sample Size: 100 # #Confidence Interval for: mean.zmlnorm # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.748134 # UCL = 1.408368 #---------- # Clean up rm(dat)
Estimate the mean and standard deviation of a zero-modified normal distribution, and optionally construct a confidence interval for the mean.
ezmnorm(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95)
ezmnorm(x, method = "mvue", ci = FALSE, ci.type = "two-sided", ci.method = "normal.approx", conf.level = 0.95)
x |
numeric vector of observations. |
method |
character string specifying the method of estimation. Currently, the only possible
value is |
ci |
logical scalar indicating whether to compute a confidence interval for the
mean. The default value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
ci.method |
character string indicating what method to use to construct the confidence interval
for the mean. Currently the only possible value is |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the confidence interval.
The default value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
Let be a vector of
observations from a
zero-modified normal distribution with
parameters
mean=
,
sd=
, and
p.zero=
.
Let
denote the number of observations in
that are equal
to 0, and order the observations so that
denote
the
zero observations, and
denote the
non-zero observations.
Note that is not the mean of the zero-modified normal distribution;
it is the mean of the normal part of the distribution. Similarly,
is
not the standard deviation of the zero-modified normal distribution; it is
the standard deviation of the normal part of the distribution.
Let and
denote the mean and standard deviation of the
overall zero-modified normal distribution. Aitchison (1955) shows that:
Estimation
Minimum Variance Unbiased Estimation (method="mvue"
)
Aitchison (1955) shows that the minimum variance unbiased estimators (mvue's) of
and
are:
|
|
if , |
|
if , |
|
|
if
|
where
Note that the quantity in equation (5) is the sample mean of all observations
(including 0 values), the quantity in equation (6) is the sample mean of all non-zero
observations, and the quantity in equation (7) is the sample variance of all
non-zero observations. Also note that for or
, the estimator
of
is the sample variance for all observations (including 0 values).
Confidence Intervals
Based on Normal Approximation (ci.method="normal.approx"
)
An approximate confidence interval for
is
constructed based on the assumption that the estimator of
is
approximately normally distributed. Aitchison (1955) shows that
Thus, an approximate two-sided confidence interval for
is constructed as:
where is the
'th quantile of
Student's t-distribution with
degrees of freedom.
One-sided confidence intervals are computed in a similar fashion.
a list of class "estimate"
containing the estimated parameters and other information.
See estimate.object
for details.
The component called parameters
is a numeric vector with the following
estimated parameters:
Parameter Name | Explanation |
mean |
mean of the normal (Gaussian) part of the distribution. |
sd |
standard deviation of the normal (Gaussian) part of the distribution. |
p.zero |
probability that an observation will be 0. |
mean.zmnorm |
mean of the overall zero-modified normal distribution. |
sd.zmnorm |
standard deviation of the overall normal distribution. |
The zero-modified normal distribution is sometimes used to model chemical concentrations for which some observations are reported as “Below Detection Limit”. See, for example USEPA (1992c, pp.27-34). In most cases, however, the zero-modified lognormal (delta) distribution will be more appropriate, since chemical concentrations are bounded below at 0 (e.g., Gilliom and Helsel, 1986; Owen and DeRouen, 1980).
Once you estimate the parameters of the zero-modified normal distribution, it is often useful to characterize the uncertainty in the estimate of the mean. This is done with a confidence interval.
One way to try to assess whether a
zero-modified lognormal (delta),
zero-modified normal, censored normal, or
censored lognormal is the best model for the data is to construct both
censored and detects-only probability plots (see qqPlotCensored
).
Steven P. Millard ([email protected])
Aitchison, J. (1955). On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association 50, 901–908.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135–146.
Owen, W., and T. DeRouen. (1980). Estimation of the Mean for Lognormal Data Containing Zeros and Left-Censored Values, with Applications to the Measurement of Worker Exposure to Air Contaminants. Biometrics 36, 707–719.
USEPA (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, Permits and State Programs Division, US Environmental Protection Agency, Washington, D.C.
ZeroModifiedNormal, Normal,
ezmlnorm
, ZeroModifiedLognormal, estimate.object
.
# Generate 100 observations from a zero-modified normal distribution # with mean=4, sd=2, and p.zero=0.5, then estimate the parameters. # According to equations (1) and (2) above, the overall mean is # mean.zmnorm=2 and the overall standard deviation is sd.zmnorm=sqrt(6). # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rzmnorm(100, mean = 4, sd = 2, p.zero = 0.5) ezmnorm(dat, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Normal # #Estimated Parameter(s): mean = 4.037732 # sd = 1.917004 # p.zero = 0.450000 # mean.zmnorm = 2.220753 # sd.zmnorm = 2.465829 # #Estimation Method: mvue # #Data: dat # #Sample Size: 100 # #Confidence Interval for: mean.zmnorm # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 1.731417 # UCL = 2.710088 #---------- # Following Example 9 on page 34 of USEPA (1992c), compute an # estimate of the mean of the zinc data, assuming a # zero-modified normal distribution. The data are stored in # EPA.92c.zinc.df. head(EPA.92c.zinc.df) # Zinc.orig Zinc Censored Sample Well #1 <7 7.00 TRUE 1 1 #2 11.41 11.41 FALSE 2 1 #3 <7 7.00 TRUE 3 1 #4 <7 7.00 TRUE 4 1 #5 <7 7.00 TRUE 5 1 #6 10.00 10.00 FALSE 6 1 New.Zinc <- EPA.92c.zinc.df$Zinc New.Zinc[EPA.92c.zinc.df$Censored] <- 0 ezmnorm(New.Zinc, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Normal # #Estimated Parameter(s): mean = 11.891000 # sd = 1.594523 # p.zero = 0.500000 # mean.zmnorm = 5.945500 # sd.zmnorm = 6.123235 # #Estimation Method: mvue # #Data: New.Zinc # #Sample Size: 40 # #Confidence Interval for: mean.zmnorm # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 3.985545 # UCL = 7.905455 #---------- # Clean up rm(dat, New.Zinc)
# Generate 100 observations from a zero-modified normal distribution # with mean=4, sd=2, and p.zero=0.5, then estimate the parameters. # According to equations (1) and (2) above, the overall mean is # mean.zmnorm=2 and the overall standard deviation is sd.zmnorm=sqrt(6). # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rzmnorm(100, mean = 4, sd = 2, p.zero = 0.5) ezmnorm(dat, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Normal # #Estimated Parameter(s): mean = 4.037732 # sd = 1.917004 # p.zero = 0.450000 # mean.zmnorm = 2.220753 # sd.zmnorm = 2.465829 # #Estimation Method: mvue # #Data: dat # #Sample Size: 100 # #Confidence Interval for: mean.zmnorm # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 1.731417 # UCL = 2.710088 #---------- # Following Example 9 on page 34 of USEPA (1992c), compute an # estimate of the mean of the zinc data, assuming a # zero-modified normal distribution. The data are stored in # EPA.92c.zinc.df. head(EPA.92c.zinc.df) # Zinc.orig Zinc Censored Sample Well #1 <7 7.00 TRUE 1 1 #2 11.41 11.41 FALSE 2 1 #3 <7 7.00 TRUE 3 1 #4 <7 7.00 TRUE 4 1 #5 <7 7.00 TRUE 5 1 #6 10.00 10.00 FALSE 6 1 New.Zinc <- EPA.92c.zinc.df$Zinc New.Zinc[EPA.92c.zinc.df$Censored] <- 0 ezmnorm(New.Zinc, ci = TRUE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Zero-Modified Normal # #Estimated Parameter(s): mean = 11.891000 # sd = 1.594523 # p.zero = 0.500000 # mean.zmnorm = 5.945500 # sd.zmnorm = 6.123235 # #Estimation Method: mvue # #Data: New.Zinc # #Sample Size: 40 # #Confidence Interval for: mean.zmnorm # #Confidence Interval Method: Normal Approximation # (t Distribution) # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 3.985545 # UCL = 7.905455 #---------- # Clean up rm(dat, New.Zinc)
Hyperlink list of EnvStats functions by category.
The EnvStats functions listed below are useful for performing calibration and inverse prediction to determine the concentration of a chemical based on a machine signal.
Function Name | Description |
anovaPE |
Compute lack-of-fit and pure error ANOVA table for a |
linear model. | |
calibrate |
Fit a calibration line or curve. |
detectionLimitCalibrate |
Determine detection limit based on using a calibration |
line (or curve) and inverse regression. | |
inversePredictCalibrate |
Predict concentration using a calibration line (or curve) |
and inverse regression. | |
pointwise |
Pointwise confidence limits for predictions. |
predict.lm |
Predict method for linear model fits. |
The EnvStats functions listed below are useful for dealing with Type I censored data.
Data Transformations
Function Name | Description |
boxcoxCensored |
Compute values of an objective for Box-Cox Power |
transformations, or compute optimal transformation, | |
for Type I censored data. | |
print.boxcoxCensored |
Print an object of class "boxcoxCensored" . |
plot.boxcoxCensored |
Plot an object of class "boxcoxCensored" . |
Estimating Distribution Parameters
Function Name | Description |
egammaCensored |
Estimate shape and scale parameters for a gamma distribution |
based on Type I censored data. | |
egammaAltCensored |
Estimate mean and CV for a gamma distribution |
based on Type I censored data. | |
elnormCensored |
Estimate parameters for a lognormal distribution (log-scale) |
based on Type I censored data. | |
elnormAltCensored |
Estimate parameters for a lognormal distribution (original scale) |
based on Type I censored data. | |
enormCensored |
Estimate parameters for a Normal distribution based on Type I |
censored data. | |
epoisCensored |
Estimate parameter for a Poisson distribution based on Type I |
censored data. | |
enparCensored |
Estimate the mean and standard deviation nonparametrically. |
gpqCiNormSinglyCensored |
Generate the generalized pivotal quantity used to construct a |
confidence interval for the mean of a Normal distribution based | |
on Type I singly censored data. | |
gpqCiNormMultiplyCensored |
Generate the generalized pivotal quantity used to construct a |
confidence interval for the mean of a Normal distribution based | |
on Type I multiply censored data. | |
print.estimateCensored |
Print an object of class "estimateCensored" . |
Estimating Distribution Quantiles
Function Name | Description |
eqlnormCensored |
Estimate quantiles of a Lognormal distribution (log-scale) |
based on Type I censored data, and optionally construct | |
a confidence interval for a quantile. | |
eqnormCensored |
Estimate quantiles of a Normal distribution |
based on Type I censored data, and optionally construct | |
a confidence interval for a quantile. | |
All of the functions for computing quantiles (and associated confidence intervals) for complete (uncensored)
data are listed in the help file Estimating Distribution Quantiles. All of these functions, with
the exception of eqnpar
, will accept an object of class
"estimateCensored"
. Thus, you may estimate
quantiles (and construct approximate confidence intervals) for any distribution for which:
There exists a function to estimate distribution parameters using censored data (see the section Estimating Distribution Parameters above).
There exists a function to estimate quantiles for that distribution based on complete data (see the help file Estimating Distribution Quantiles).
Nonparametric estimates of quantiles (and associated confidence intervals) can be constructed from censored
data as long as the order statistics used in the results are above all left-censored observations or below
all right-censored observations. See the help file for eqnpar
for more information and
examples.
Goodness-of-Fit Tests
Function Name | Description |
gofTestCensored |
Perform a goodness-of-fit test based on Type I left- or |
right-censored data. | |
print.gofCensored |
Print an object of class "gofCensored" . |
plot.gofCensored |
Plot an object of class "gofCensored" . |
Hypothesis Tests
Function Name | Description |
twoSampleLinearRankTestCensored |
Perform two-sample linear rank tests based on |
censored data. | |
print.htestCensored |
Printing method for object of class |
"htestCensored" . |
|
Plotting Probability Distributions
Function Name | Description |
cdfCompareCensored |
Plot two cumulative distribution functions based on Type I |
censored data. | |
ecdfPlotCensored |
Plot an empirical cumulative distribution function based on |
Type I censored data. | |
ppointsCensored |
Compute plotting positions for Type I censored data. |
qqPlotCensored |
Produce quantile-quantile (Q-Q) plots, also called probability |
plots, based on Type I censored data. | |
Prediction and Tolerance Intervals
Function Name | Description |
gpqTolIntNormSinglyCensored |
Generate the generalized pivotal quantity used to construct a |
tolerance interval for a Normal distribution based | |
on Type I singly censored data. | |
gpqTolIntNormMultiplyCensored |
Generate the generalized pivotal quantity used to construct a |
tolerance interval for a Normal distribution based | |
on Type I multiply censored data. | |
tolIntLnormCensored |
Tolerance interval for a lognormal distribution (log-scale) |
based on Type I censored data. | |
tolIntNormCensored |
Tolerance interval for a Normal distribution based on Type I |
censored data. | |
All of the functions for computing prediction and tolerance intervals for complete (uncensored)
data are listed in the help files Prediction Intervals and Tolerance Intervals.
All of these functions, with the exceptions of predIntNpar
and tolIntNpar
,
will accept an object of class "estimateCensored"
. Thus, you
may construct approximate prediction or tolerance intervals for any distribution for which:
There exists a function to estimate distribution parameters using censored data (see the section Estimating Distribution Parameters above).
There exists a function to create a prediction or tolerance interval for that distribution based on complete data (see the help files Prediction Intervals and Tolerance Intervals).
Nonparametric prediction and tolerance intervals can be constructed from censored
data as long as the order statistics used in the results are above all left-censored observations or below
all right-censored observations. See the help files for predIntNpar
,
predIntNparSimultaneous
, and tolIntNpar
for more information and examples.
The EnvStats functions listed below are useful for deciding on data transformations.
Function Name | Description |
boxcox |
Compute values of an objective for Box-Cox transformations, or |
compute optimal transformation based on raw observations | |
or residuals from a linear model. | |
boxcoxTransform |
Apply a Box-Cox Power transformation to a set of data. |
plot.boxcox |
Plotting method for an object of class "boxcox" . |
plot.boxcoxLm |
Plotting method for an object of class "boxcoxLm" . |
print.boxcox |
Printing method for an object of class "boxcox" . |
print.boxcoxLm |
Printing method for an object of class "boxcoxLm" . |
The EnvStats functions listed below are useful for estimating distribution parameters and optionally constructing confidence intervals.
Function Name | Description |
ebeta |
Estimate parameters of a Beta distribution |
ebinom |
Estimate parameter of a Binomial distribution |
eexp |
Estimate parameter of an Exponential distribution |
eevd |
Estimate parameters of an Extreme Value distribution |
egamma |
Estimate shape and scale parameters of a Gamma distribution |
egammaAlt |
Estimate mean and CV parameters of a Gamma distribution |
egevd |
Estimate parameters of a Generalized Extreme Value distribution |
egeom |
Estimate parameter of a Geometric distribution |
ehyper |
Estimate parameter of a Hypergeometric distribution |
elogis |
Estimate parameters of a Logistic distribution |
elnorm |
Estimate parameters of a Lognormal distribution (log-scale) |
elnormAlt |
Estimate parameters of a Lognormal distribution (original scale) |
elnorm3 |
Estimate parameters of a Three-Parameter Lognormal distribution |
enbinom |
Estimate parameter of a Negative Binomial distribution |
enorm |
Estimate parameters of a Normal distribution |
epareto |
Estimate parameters of a Pareto distribution |
epois |
Estimate parameter of a Poisson distribution |
eunif |
Estimate parameters of a Uniform distribution |
eweibull |
Estimate parameters of a Weibull distribution |
ezmlnorm |
Estimate parameters of a Zero-Modified Lognormal (Delta) |
distribution (log-Scale) | |
ezmlnormAlt |
Estimate parameters of a Zero-Modified Lognormal (Delta) |
distribution (original Scale) | |
ezmnorm |
Estimate parameters of a Zero-Modified Normal distribution |
The EnvStats functions listed below are useful for estimating distribution quantiles and, for some functions, optionally constructing confidence intervals for a quantile.
Function Name | Description |
eqbeta |
Estimate quantiles of a Beta distribution. |
eqbinom |
Estimate quantiles of a Binomial distribution. |
eqexp |
Estimate quantiles of an Exponential distribution. |
eqevd |
Estimate quantiles of an Extreme Value distribution. |
eqgamma |
Estimate quantiles of a Gamma distribution |
using the Shape and Scale Parameterization, and optionally | |
construct a confidence interval for a quantile. | |
eqgammaAlt |
Estimate quantiles of a Gamma distribution |
using the mean and CV Parameterization, and optionally | |
construct a confidence interval for a quantile. | |
eqgevd |
Estimate quantiles of a Generalized Extreme Value distribution. |
eqgeom |
Estimate quantiles of a Geometric distribution. |
eqhyper |
Estimate quantiles of a Hypergeometric distribution. |
eqlogis |
Estimate quantiles of a Logistic distribution. |
eqlnorm |
Estimate quantiles of a Lognormal distribution (log-scale), |
and optionally construct a confidence interval for a quantile. | |
eqlnorm3 |
Estimate quantiles of a Three-Parameter Lognormal distribution. |
eqnbinom |
Estimate quantiles of a Negative Binomial distribution. |
eqnorm |
Estimate quantiles of a Normal distribution, |
and optionally construct a confidence interval for a quantile. | |
eqpareto |
Estimate quantiles of a Pareto distribution. |
eqpois |
Estimate quantiles of a Poisson distribution, |
and optionally construct a confidence interval for a quantile. | |
equnif |
Estimate quantiles of a Uniform distribution. |
eqweibull |
Estimate quantiles of a Weibull distribution. |
eqzmlnorm |
Estimate quantiles of a Zero-Modified Lognormal (Delta) |
distribution (log-scale). | |
eqzmlnormAlt |
Estimate quantiles of a Zero-Modified Lognormal (Delta) |
distribution (original scale). | |
eqzmnorm |
Estimate quantiles of a Zero-Modified Normal distribution. |
The EnvStats functions listed below are useful for performing goodness-of-fit tests for user-specified probability distributions.
Goodness-of-Fit Tests
Function Name | Description |
gofTest |
Perform a goodness-of-fit test for a specified probability distribution. |
The resulting object is of class "gof" unless the test is the |
|
two-sample Kolmogorov-Smirnov test, in which case the resulting | |
object is of class "gofTwoSample" . |
|
plot.gof |
S3 class method for plotting an object of class "gof" . |
print.gof |
S3 class method for printing an object of class "gof" . |
plot.gofTwoSample |
S3 class method for plotting an object of class "gofTwoSample" . |
print.gofTwoSample |
S3 class method for printing an object of class "gofTwoSample" . |
gofGroupTest |
Perform a goodness-of-fit test to determine whether data in a set of groups |
appear to all come from the same probability distribution | |
(with possibly different parameters for each group). | |
The resulting object is of class "gofGroup" . |
|
plot.gofGroup |
S3 class method for plotting an object of class "gofGroup" . |
print.gofGroup |
S3 class method for printing an object of class "gofGroup" . |
Tests for Outliers
Function Name | Description |
rosnerTest |
Perform Rosner's test for outliers assuming a normal (Gaussian) distribution. |
print.gofOutlier |
S3 class method for printing an object of class "gofOutlier" . |
Choose a Distribution
Function Name | Description |
distChoose |
Choose best fitting distribution based on goodness-of-fit tests. |
print.distChoose |
S3 class method for printing an object of class "distChoose" . |
The EnvStats functions listed below are useful for performing hypothesis tests not already built into R. See Power and Sample Size Calculations for a list of functions you can use to perform power and sample size calculations based on various hypothesis tests.
For goodness-of-fit tests, see Goodness-of-Fit Tests.
Function Name | Description |
chenTTest |
Chen's modified one-sided t-test for skewed |
distributions. | |
kendallTrendTest |
Nonparametric test for monotonic trend |
based on Kendall's tau statistic (and | |
optional confidence interval for slope). | |
kendallSeasonalTrendTest |
Nonparametric test for monotonic trend |
within each season based on Kendall's tau | |
statistic (and optional confidence interval | |
for slope). | |
oneSamplePermutationTest |
Fisher's one-sample randomization |
(permutation) test for location. | |
quantileTest |
Two-sample rank test to detect a shift in |
a proportion of the “treated” population. | |
quantileTestPValue |
Compute p-value associated with a specified |
combination of , , and |
|
for the quantile test. | |
Useful for determining and for a |
|
given significance level . |
|
serialCorrelationTest |
Test for the presence of serial correlation. |
signTest |
One- or paired-sample sign test on the |
median. | |
twoSampleLinearRankTest |
Two-sample linear rank test to detect a |
shift in the “treated” population. | |
twoSamplePermutationTestLocation |
Two-sample or paired-sample randomization |
(permutation) test for location. | |
twoSamplePermutationTestProportion |
Randomization (permutation) test to compare |
two proportions (Fisher's exact test). | |
varTest |
One-sample test on variance or two-sample |
test to compare variances. | |
varGroupTest |
Test for homogeneity of variance among two |
or more groups. | |
zTestGevdShape |
Estimate the shape parameter of a |
Generalized Extreme Value distribution and | |
test the null hypothesis that the true | |
value is equal to 0. | |
The EnvStats functions listed below are useful for performing Monte Carlo simulations and risk assessment.
Function Name | Description |
Empirical | Empirical distribution based on a set of observations. |
simulateVector |
Simulate a vector of random numbers from a specified theoretical |
probability distribution or empirical probability distribution | |
using either Latin hypercube sampling or simple random sampling. | |
simulateMvMatrix |
Simulate a multivariate matrix of random numbers from specified |
theoretical probability distributions and/or empirical probability | |
distributions based on a specified rank correlation matrix, using | |
either Latin hypercube sampling or simple random sampling. | |
The EnvStats functions listed below are useful for plotting probability distributions.
Function Name | Description |
cdfCompare |
Plot two cumulative distribution functions with the same -axis |
in order to compare them. | |
cdfPlot |
Plot a cumulative distribution function. |
ecdfPlot |
Plot empirical cumulative distribution function. |
epdfPlot |
Plot empirical probability density function. |
pdfPlot |
Plot probability density function. |
qqPlot |
Produce a quantile-quantile (Q-Q) plot, also called a probability plot. |
qqPlotGestalt |
Plot several Q-Q plots from the same distribution in order to |
develop a Gestalt of Q-Q plots for that distribution. | |
The EnvStats functions listed below are useful for creating plots with the ggplot2 package.
Function Name | Description |
geom_stripchart |
Adaptation of the EnvStats function stripChart , |
used to create a strip plot using functions from the package | |
ggplot2. | |
stat_n_text |
Add text indicating the sample size |
to a ggplot2 plot. | |
stat_mean_sd_text |
Add text indicating the mean and standard deviation |
to a ggplot2 plot. | |
stat_median_iqr_text |
Add text indicating the median and interquartile range |
to a ggplot2 plot. | |
stat_test_text |
Add text indicating the results of a hypothesis test |
comparing groups to a ggplot2 plot. | |
The EnvStats functions listed below are useful for power and sample size calculations.
Confidence Intervals
Function Name | Description |
ciTableProp |
Confidence intervals for binomial proportion, or |
difference between two proportions, following Bacchetti (2010) | |
ciBinomHalfWidth |
Compute the half-width of a confidence interval for a |
Binomial proportion or the difference between two proportions. | |
ciBinomN |
Compute the sample size necessary to achieve a specified |
half-width of a confidence interval for a Binomial proportion or | |
the difference between two proportions. | |
plotCiBinomDesign |
Create plots for a sampling design based on a confidence interval |
for a Binomial proportion or the difference between two proportions. | |
ciTableMean |
Confidence intervals for mean of normal distribution, or |
difference between two means, following Bacchetti (2010) | |
ciNormHalfWidth |
Compute the half-width of a confidence interval for the mean of a |
Normal distribution or the difference between two means. | |
ciNormN |
Compute the sample size necessary to achieve a specified half-width |
of a confidence interval for the mean of a Normal distribution or | |
the difference between two means. | |
plotCiNormDesign |
Create plots for a sampling design based on a confidence interval |
for the mean of a Normal distribution or the difference between | |
two means. | |
ciNparConfLevel |
Compute the confidence level associated with a nonparametric |
confidence interval for a percentile. | |
ciNparN |
Compute the sample size necessary to achieve a specified |
confidence level for a nonparametric confidence interval for | |
a percentile. | |
plotCiNparDesign |
Create plots for a sampling design based on a nonparametric |
confidence interval for a percentile. |
Hypothesis Tests
Function Name | Description |
aovN |
Compute the sample sizes necessary to achieve a |
specified power for a one-way fixed-effects analysis | |
of variance test. | |
aovPower |
Compute the power of a one-way fixed-effects analysis of |
variance test. | |
plotAovDesign |
Create plots for a sampling design based on a one-way |
analysis of variance. | |
propTestN |
Compute the sample size necessary to achieve a specified |
power for a one- or two-sample proportion test. | |
propTestPower |
Compute the power of a one- or two-sample proportion test. |
propTestMdd |
Compute the minimal detectable difference associated with |
a one- or two-sample proportion test. | |
plotPropTestDesign |
Create plots involving sample size, power, difference, and |
significance level for a one- or two-sample proportion test. | |
tTestAlpha |
Compute the Type I Error associated with specified values for |
for power, sample size(s), and scaled MDD for a one- or | |
two-sample t-test. | |
tTestN |
Compute the sample size necessary to achieve a specified |
power for a one- or two-sample t-test. | |
tTestPower |
Compute the power of a one- or two-sample t-test. |
tTestScaledMdd |
Compute the scaled minimal detectable difference |
associated with a one- or two-sample t-test. | |
plotTTestDesign |
Create plots for a sampling design based on a one- or |
two-sample t-test. | |
tTestLnormAltN |
Compute the sample size necessary to achieve a specified |
power for a one- or two-sample t-test, assuming lognormal | |
data. | |
tTestLnormAltPower |
Compute the power of a one- or two-sample t-test, assuming |
lognormal data. | |
tTestLnormAltRatioOfMeans |
Compute the minimal or maximal detectable ratio of means |
associated with a one- or two-sample t-test, assuming | |
lognormal data. | |
plotTTestLnormAltDesign |
Create plots for a sampling design based on a one- or |
two-sample t-test, assuming lognormal data. | |
linearTrendTestN |
Compute the sample size necessary to achieve a specified |
power for a t-test for linear trend. | |
linearTrendTestPower |
Compute the power of a t-test for linear trend. |
linearTrendTestScaledMds |
Compute the scaled minimal detectable slope for a t-test |
for linear trend. | |
plotLinearTrendTestDesign |
Create plots for a sampling design based on a t-test for |
linear trend. | |
Prediction Intervals
Normal Distribution Prediction Intervals
Function Name | Description |
predIntNormHalfWidth |
Compute the half-width of a prediction |
interval for a normal distribution. | |
predIntNormK |
Compute the required value of for |
a prediction interval for a Normal | |
distribution. | |
predIntNormN |
Compute the sample size necessary to |
achieve a specified half-width for a | |
prediction interval for a Normal | |
distribution. | |
plotPredIntNormDesign |
Create plots for a sampling design |
based on the width of a prediction | |
interval for a Normal distribution. | |
predIntNormTestPower |
Compute the probability that at least |
one future observation (or mean) | |
falls outside a prediction interval | |
for a Normal distribution. | |
plotPredIntNormTestPowerCurve |
Create plots for a sampling |
design based on a prediction interval | |
for a Normal distribution. | |
predIntNormSimultaneousTestPower
|
Compute the probability that at |
least one set of future observations | |
(or means) violates the given rule | |
based on a simultaneous prediction | |
interval for a Normal distribution. | |
plotPredIntNormSimultaneousTestPowerCurve |
Create plots for a sampling design |
based on a simultaneous prediction | |
interval for a Normal distribution. | |
Lognormal Distribution Prediction Intervals
Function Name | Description |
predIntLnormAltTestPower |
Compute the probability that at least |
one future observation (or geometric | |
mean) falls outside a prediction | |
interval for a lognormal distribution. | |
plotPredIntLnormAltTestPowerCurve |
Create plots for a sampling design |
based on a prediction interval for a | |
lognormal distribution. | |
predIntLnormAltSimultaneousTestPower |
Compute the probability that at least |
one set of future observations (or | |
geometric means) violates the given | |
rule based on a simultaneous | |
prediction interval for a lognormal | |
distribution. | |
plotPredIntLnormAltSimultaneousTestPowerCurve |
Create plots for a sampling design |
based on a simultaneous prediction | |
interval for a lognormal distribution. | |
Nonparametric Prediction Intervals
Function Name | Description |
predIntNparConfLevel |
Compute the confidence level associated with |
a nonparametric prediction interval. | |
predIntNparN |
Compute the required sample size to achieve |
a specified confidence level for a | |
nonparametric prediction interval. | |
plotPredIntNparDesign |
Create plots for a sampling design based on |
the confidence level and sample size of a | |
nonparametric prediction interval. | |
predIntNparSimultaneousConfLevel |
Compute the confidence level associated with |
a simultaneous nonparametric prediction | |
interval. | |
predIntNparSimultaneousN |
Compute the required sample size for a |
simultaneous nonparametric prediction | |
interval. | |
plotPredIntNparSimultaneousDesign |
Create plots for a sampling design based on |
a simultaneous nonparametric prediction | |
interval. | |
predIntNparSimultaneousTestPower |
Compute the probability that at least one |
set of future observations violates the | |
given rule based on a nonparametric | |
simultaneous prediction interval. | |
plotPredIntNparSimultaneousTestPowerCurve |
Create plots for a sampling design based on |
a simultaneous nonparametric prediction | |
interval. | |
Tolerance Intervals
Function Name | Description |
tolIntNormHalfWidth |
Compute the half-width of a tolerance |
interval for a normal distribution. | |
tolIntNormK |
Compute the required value of for a |
tolerance interval for a Normal distribution. | |
tolIntNormN |
Compute the sample size necessary to achieve a |
specified half-width for a tolerance interval | |
for a Normal distribution. | |
plotTolIntNormDesign |
Create plots for a sampling design based on a |
tolerance interval for a Normal distribution. | |
tolIntNparConfLevel |
Compute the confidence level associated with a |
nonparametric tolerance interval for a specified | |
sample size and coverage. | |
tolIntNparCoverage |
Compute the coverage associated with a |
nonparametric tolerance interval for a specified | |
sample size and confidence level. | |
tolIntNparN |
Compute the sample size required for a nonparametric |
tolerance interval with a specified coverage and | |
confidence level. | |
plotTolIntNparDesign |
Create plots for a sampling design based on a |
nonparametric tolerance interval. | |
The EnvStats functions listed below are useful for computing prediction intervals and simultaneous prediction intervals. See Power and Sample Size for a list of functions useful for computing power and sample size for a design based on a prediction interval width, or a design based on a hypothesis test for future observations falling outside of a prediciton interval.
Function Name | Description |
predIntGamma , |
Prediction interval for the next |
predIntGammaAlt |
observations or next set of means for a |
Gamma distribution. | |
predIntGammaSimultaneous , |
Construct a simultaneous prediction interval for the |
predIntGammaAltSimultaneous |
next sampling occasions based on a |
Gamma distribution. | |
predIntLnorm , |
Prediction interval for the next |
predIntLnormAlt |
observations or geometric means from a |
Lognormal distribution. | |
predIntLnormSimultaneous , |
Construct a simultaneous prediction interval for the |
predIntLnormAltSimultaneous |
next sampling occasions based on a |
Lognormal distribution. | |
predIntNorm |
Prediction interval for the next observations |
or means from a Normal (Gaussian) distribution. | |
predIntNormK |
Compute the value of for a prediction interval |
for a Normal distribution. | |
predIntNormSimultaneous |
Construct a simultaneous prediction interval for the |
next sampling occasions based on a |
|
Normal distribution. | |
predIntNormSimultaneousK |
Compute the value of for a simultaneous |
prediction interval for the next sampling |
|
occasions based on a Normal distribution. | |
predIntNpar |
Nonparametric prediction interval for the next |
of observations. |
|
predIntNparSimultaneous |
Construct a nonparametric simultaneous prediction |
interval for the next sampling occasions. |
|
predIntPois |
Prediction interval for the next observations |
or sums from a Poisson distribution. | |
The EnvStats functions listed below are printing and plotting methods for various S3 classes.
Printing Methods
Function Name | Description |
print.boxcox |
Print an object that inherits from class "boxcox" . |
print.boxcoxCensored |
Print an object that inherits from class |
"boxcoxCensored" . |
|
print.boxcoxLm |
Print an object that inherits from class "boxcoxLm" . |
print.estimate |
Print an object that inherits from class "estimate" . |
print.estimateCensored |
Print an object that inherits from class |
"estimateCensored" . |
|
print.gof |
Print an object that inherits from class "gof" . |
print.gofCensored |
Print an object that inherits from class "gofCensored" . |
print.gofGroup |
Print an object that inherits from class "gofGroup" . |
print.gofTwoSample |
Print an object that inherits from class |
"gofTwoSample" . |
|
print.htest |
Print an object that inherits from class "htest" . |
print.htestCensored |
Print an object that inherits from class |
"htestCensored" . |
|
print.permutationTest |
Print an object that inherits from class |
"permutationTest" . |
|
print.summaryStats |
Print an object that inherits from class |
"summaryStats" . |
|
Plotting Methods
Function Name | Description |
plot.boxcox |
Plot an object that inherits from class "boxcox" . |
plot.boxcoxCensored |
Plot an object that inherits from class "boxcoxCensored" . |
plot.boxcoxLm |
Plot an object that inherits from class "boxcoxLm" . |
plot.gof |
Plot an object that inherits from class "gof" . |
plot.gofCensored |
Plot an object that inherits from class "gofCensored" . |
plot.gofGroup |
Plot an object that inherits from class "gofGroup" . |
plot.gofTwoSample |
Plot an object that inherits from class "gofTwoSample" . |
plot.permutationTest |
Plot an object that inherits from class "permutationTest" . |
Listed below are all of the probability distributions available in R and EnvStats. Distributions with a description in bold are new ones that are part of EnvStats. For each distribution, there are functions for generating: values for the probability density function, values for the cumulative distribution function, quantiles, and random numbers.
The data frame Distribution.df
contains information about
all of these probability distributions.
Distribution Abbreviation | Description |
beta |
Beta distribution. |
binom |
Binomial distribution. |
cauchy |
Cauchy distribution. |
chi |
Chi distribution. |
chisq |
Chi-squared distribution. |
exp |
Exponential distribution. |
evd |
Extreme value distribution. |
f |
F-distribution. |
gamma |
Gamma distribution. |
gammAlt |
Gamma distribution parameterized with mean and CV. |
gevd |
Generalized extreme value distribution. |
geom |
Geometric distribution. |
hyper |
Hypergeometric distribution. |
logis |
Logistic distribution. |
lnorm |
Lognormal distribution. |
lnormAlt |
Lognormal distribution parameterized with mean and CV. |
lnormMix |
Mixture of two lognormal distributions. |
lnormMixAlt |
Mixture of two lognormal distributions |
parameterized by their means and CVs. | |
lnorm3 |
Three-parameter lognormal distribution. |
lnormTrunc |
Truncated lognormal distribution. |
lnormTruncAlt |
Truncated lognormal distribution |
parameterized by mean and CV. | |
nbinom |
Negative binomial distribution. |
norm |
Normal distribution. |
normMix |
Mixture of two normal distributions. |
normTrunc |
Truncated normal distribution. |
pareto |
Pareto distribution. |
pois |
Poisson distribution. |
t |
Student's t-distribution. |
tri |
Triangular distribution. |
unif |
Uniform distribution. |
weibull |
Weibull distribution. |
wilcox |
Wilcoxon rank sum distribution. |
zmlnorm |
Zero-modified lognormal (delta) distribution. |
zmlnormAlt |
Zero-modified lognormal (delta) distribution |
parameterized with mean and CV. | |
zmnorm |
Zero-modified normal distribution. |
In addition, the functions evNormOrdStats
and
evNormOrdStatsScalar
compute expected values of order statistics
from a standard normal distribution.
The EnvStats functions listed below create summary statistics and plots.
Summary Statistics
R comes with several functions for computing summary statistics, including
mean
, var
, median
, range
,
quantile
, and summary
. The following functions in
EnvStats complement these R functions.
Function Name | Description |
cv |
Coefficient of variation |
geoMean |
Geometric mean |
geoSD |
Geometric standard deviation |
iqr |
Interquartile range |
kurtosis |
Kurtosis |
lMoment |
-moments |
pwMoment |
Probability-weighted moments |
skewness |
Skew |
summaryFull |
Extensive summary statistics |
summaryStats |
Summary statistics |
Summary Plots
R comes with several functions for creating plots to summarize data, including
hist
, barplot
, boxplot
,
dotchart
, stripchart
, and numerous others.
The help file Plotting Probability Distributions lists several EnvStats functions useful for producing summary plots as well.
In addition, the EnvStats function stripChart
is a modification
of stripchart
that allows you to include summary statistics
on the plot itself.
Finally, the help file Plotting Using ggplot2 lists
several EnvStats functions for adding information to plots produced with the
ggplot
function, including the function geom_stripchart
,
which is an adaptation of the EnvStats function stripChart
.
The EnvStats functions listed below are useful for computing tolerance intervals. See Power and Sample Size for a list of functions useful for computing power and sample size for a design based on a tolerance interval width.
Function Name | Description |
tolIntGamma , |
Tolerance interval for a Gamma distribution. |
tolIntGammaAlt |
|
tolIntLnorm , |
Tolerance interval for a lognormal distribution. |
tolIntLnormAlt |
|
tolIntNorm |
Tolerance interval for a Normal (Gaussian) distribution. |
tolIntNormK |
Compute the constant for a Normal (Gaussian) |
tolerance interval. | |
tolIntNpar |
Nonparametric tolerance interval. |
tolIntPois |
Tolerance interval for a Poisson distribution. |
Density, distribution function, quantile function, and random generation
for the gamma distribution with parameters mean
and cv
.
dgammaAlt(x, mean, cv = 1, log = FALSE) pgammaAlt(q, mean, cv = 1, lower.tail = TRUE, log.p = FALSE) qgammaAlt(p, mean, cv = 1, lower.tail = TRUE, log.p = FALSE) rgammaAlt(n, mean, cv = 1)
dgammaAlt(x, mean, cv = 1, log = FALSE) pgammaAlt(q, mean, cv = 1, lower.tail = TRUE, log.p = FALSE) qgammaAlt(p, mean, cv = 1, lower.tail = TRUE, log.p = FALSE) rgammaAlt(n, mean, cv = 1)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
mean |
vector of (positive) means of the distribution of the random variable. |
cv |
vector of (positive) coefficients of variation of the random variable. |
log , log.p
|
logical; if |
lower.tail |
logical; if |
Let be a random variable with a gamma distribution with parameters
shape=
and
scale=
. The relationship
between these parameters and the mean (
mean=
) and coefficient
of variation (
cv=
) of this distribution is given by:
Thus, the functions dgammaAlt
, pgammaAlt
, qgammaAlt
, and
rgammaAlt
call the R functions dgamma
,
pgamma
, qgamma
, and rgamma
,
respectively, using the values for the shape
and scale
parameters
given by: shape <- cv^-2
, scale <- mean/shape
.
dgammaAlt
gives the density, pgammaAlt
gives the distribution function,
qgammaAlt
gives the quantile function, and rgammaAlt
generates random
deviates.
Invalid arguments will result in return value NaN
, with a warning.
The gamma distribution takes values on the positive real line. Special cases of
the gamma are the exponential distribution and the
chi-square distribution. Applications of the gamma include
life testing, statistical ecology, queuing theory, inventory control and
precipitation processes. A gamma distribution starts to resemble a normal
distribution as the shape parameter tends to infinity or
the cv parameter
tends to 0.
Some EPA guidance documents (e.g., Singh et al., 2002; Singh et al., 2010a,b) discourage using the assumption of a lognormal distribution for some types of environmental data and recommend instead assessing whether the data appear to fit a gamma distribution.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions, Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Singh, A., A.K. Singh, and R.J. Iaci. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
GammaDist, egammaAlt
,
Probability Distributions and Random Numbers.
# Density of a gamma distribution with parameters mean=10 and cv=2, # evaluated at 7: dgammaAlt(7, mean = 10, cv = 2) #[1] 0.02139335 #---------- # The cdf of a gamma distribution with parameters mean=10 and cv=2, # evaluated at 12: pgammaAlt(12, mean = 10, cv = 2) #[1] 0.7713307 #---------- # The 25'th percentile of a gamma distribution with parameters # mean=10 and cv=2: qgammaAlt(0.25, mean = 10, cv = 2) #[1] 0.1056871 #---------- # A random sample of 4 numbers from a gamma distribution with # parameters mean=10 and cv=2. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(10) rgammaAlt(4, mean = 10, cv = 2) #[1] 3.772004230 1.889028078 0.002987823 8.179824976
# Density of a gamma distribution with parameters mean=10 and cv=2, # evaluated at 7: dgammaAlt(7, mean = 10, cv = 2) #[1] 0.02139335 #---------- # The cdf of a gamma distribution with parameters mean=10 and cv=2, # evaluated at 12: pgammaAlt(12, mean = 10, cv = 2) #[1] 0.7713307 #---------- # The 25'th percentile of a gamma distribution with parameters # mean=10 and cv=2: qgammaAlt(0.25, mean = 10, cv = 2) #[1] 0.1056871 #---------- # A random sample of 4 numbers from a gamma distribution with # parameters mean=10 and cv=2. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(10) rgammaAlt(4, mean = 10, cv = 2) #[1] 3.772004230 1.889028078 0.002987823 8.179824976
geom_stripchart
is an adaptation of the EnvStats function
stripChart
and is used to create a strip plot
using functions from the package ggplot2.
geom_stripchart
produces one-dimensional scatter plots (also called dot plots),
along with text indicating sample size and estimates of location (mean or median) and scale
(standard deviation or interquartile range), as well as confidence intervals
for the population location parameter, and results of a hypothesis
test comparing group locations.
geom_stripchart(..., seed = 47, paired = FALSE, paired.lines = paired, group = NULL, x.nudge = if (paired && paired.lines) c(-0.3, 0.3) else 0.3, text.box = FALSE, location = "mean", ci = "normal", digits = 1, digit.type = "round", nsmall = ifelse(digit.type == "round", digits, 0), jitter.params = list(), point.params = list(), line.params = list(), location.params = list(), errorbar.params = list(), n.text = TRUE, n.text.box = text.box, n.text.params = list(), location.scale.text = TRUE, location.scale.text.box = text.box, location.scale.text.params = list(), test.text = FALSE, test.text.box = text.box, test = ifelse(location == "mean", "parametric", "nonparametric"), test.text.params = list())
geom_stripchart(..., seed = 47, paired = FALSE, paired.lines = paired, group = NULL, x.nudge = if (paired && paired.lines) c(-0.3, 0.3) else 0.3, text.box = FALSE, location = "mean", ci = "normal", digits = 1, digit.type = "round", nsmall = ifelse(digit.type == "round", digits, 0), jitter.params = list(), point.params = list(), line.params = list(), location.params = list(), errorbar.params = list(), n.text = TRUE, n.text.box = text.box, n.text.params = list(), location.scale.text = TRUE, location.scale.text.box = text.box, location.scale.text.params = list(), test.text = FALSE, test.text.box = text.box, test = ifelse(location == "mean", "parametric", "nonparametric"), test.text.params = list())
... |
Arguments that can be passed on |
seed |
For the case of non-paired data, the argument |
paired |
For the case of two groups, a logical scalar indicating whether the data
should be considered to be paired. The default value is NOTE: if the argument |
paired.lines |
For the case when there are two groups and the observations are paired
(i.e., |
group |
For the case when there are two groups and the observations are paired
(i.e., |
x.nudge |
A numeric scalar indicating the amount to move the estimates of location and
confidence interval lines on the |
text.box |
A logical scalar indicating whether to surround text indicating sample size,
location/scale estimates, and test results with text boxes (i.e.,
whether to use |
location |
A character string indicating whether to display the mean for each group |
ci |
For the case when NOTE: For the case when |
digits |
Integer indicating the number of digits to use for displaying text indicating the
location and scale estimates and, for the case of one or two groups,
the number of digits to use for displaying text indicating the confidence interval
associated with the test of hypothesis. When For location/scale estimates, you can override the value of this argument by
including a component named |
digit.type |
Character string indicating whether the For location/scale estimates, you can override the value of this argument by
including a component named |
nsmall |
Integer passed to the function |
jitter.params |
A list containing arguments to the function This argument is ignored when there are two groups and both |
point.params |
For the case when there are two groups and both |
line.params |
For the case when there are two groups and both |
location.params |
A list containing arguments to the function |
errorbar.params |
A list containing arguments to the function |
n.text |
A logical scalar indicating whether to display the sample size for each group.
The default is |
n.text.box |
A logical scalar indicating whether to surround the text indicating the sample size for
each group with a text box (i.e., whether to use |
n.text.params |
A list containing arguments to the function |
location.scale.text |
A logical scalar indicating whether to display text indicating the location and scale
for each group. The default is |
location.scale.text.box |
A logical scalar indicating whether to surround the text indicating the
location and scale for each group with a text box (i.e., whether to use
|
location.scale.text.params |
A list containing arguments to the function
|
test.text |
A logical scalar indicating whether to display the results of the hypthesis test
comparing groups. The default is |
test.text.box |
A logical scalar indicating whether to surround the text indicating the
results of the hypothesis test comparing groups with a text box
(i.e., whether to use |
test |
A character string indicating whether to use a standard parametric test
( |
test.text.params |
A list containing arguments to the function |
See the vignette Extending ggplot2 at https://cran.r-project.org/package=ggplot2/vignettes/extending-ggplot2.html and Chapter 12 of Wickham (2016) for information on how to create a new geom.
Steven P. Millard ([email protected])
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (Use R!). Second Edition. Springer.
stat_n_text
, stat_mean_sd_text
,
stat_median_iqr_text
, stat_test_text
,
geom_jitter
, geom_point
,
geom_line
, stat_summary
,
geom_text
, geom_label
.
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #========== #--------------------- # 3 Independent Groups #--------------------- # Example 1: # Using the built-in data frame mtcars, # create a stipchart of miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) p + geom_stripchart() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== ## Not run: # Example 2: # Repeat Example 1, but include the results of the # standard parametric analysis of variance. #------------------------------------------------- dev.new() p + geom_stripchart(test.text = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 3: # Using Example 2, show explicitly the layering # process that geom_stripchart is using. # # This plot should look identical to the previous one. #----------------------------------------------------- set.seed(47) dev.new() p + theme(legend.position = "none") + geom_jitter(pch = 1, width = 0.15, height = 0) + stat_summary(fun.y = "mean", geom = "point", size = 2, position = position_nudge(x = 0.3)) + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", size = 0.75, width = 0.075, position = position_nudge(x = 0.3)) + stat_n_text() + stat_mean_sd_text() + stat_test_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 4: # Repeat Example 2, but put all text in a text box. #-------------------------------------------------- dev.new() p + geom_stripchart(text.box = TRUE, test.text = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 5: # Repeat Example 2, but put just the test results # in a text box. #------------------------------------------------ dev.new() p + geom_stripchart(test.text = TRUE, test.text.box = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 6: # Repeat Example 2, but: # 1) plot the median and IQR instead of the mean and the 95 # 2) show text for the median and IQR, and # 3) use the nonparametric test to compare groups. # # Note that following what the ggplot2 stat_summary function # does when you specify a "confidence interval" for the # median (i.e., when you call stat_summary with the arguments # geom="errorbar" and fun.data="median_hilow"), the displayed # error bars show intervals based on estimated quuantiles. # By default, stat_summary with the arguments # geom="errorbar" and fun.data="median_hilow" displays # error bars using the 2.5'th and 97.5'th percentiles. # The function geom_stripchart, however, by default # displays error bars using the 25'th and 75'th percentiles # (see the explanation for the argument ci above). #------------------------------------------------------------ dev.new() p + geom_stripchart(location = "median", test.text = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Clean up #--------- graphics.off() rm(p) #======================================== #--------------------- # 2 Independent Groups #--------------------- # Example 7: # Repeat Example 2, but use only the groups with # 4 and 8 cylinders. #----------------------------------------------- dev.new() p <- ggplot(subset(mtcars, cyl %in% c(4, 8)), aes(x = factor(cyl), y = mpg, color = cyl)) p + geom_stripchart(test.text = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 8: # Repeat Example 7, but # 1) facet by transmission type # 2) make the text smaller # 3) put the text for the test results in a text box # and make them blue. dev.new() p + geom_stripchart(test.text = TRUE, test.text.box = TRUE, n.text.params = list(size = 3), location.scale.text.params = list(size = 3), test.text.params = list(size = 3, color = "blue")) + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Clean up #--------- graphics.off() rm(p) #======================================== #--------------------- # 2 Independent Groups #--------------------- # Example 9: # The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # First create one-dimensional scatterplots to compare the # TcCB concentrations between the areas and use a nonparametric # test to test for a difference between areas. dev.new() p <- ggplot(EPA.94b.tccb.df, aes(x = Area, y = TcCB, color = Area)) p + geom_stripchart(location = "median", test.text = TRUE) + labs(y = "TcCB (ppb)") #========== # Example 10: # Now log-transform the TcCB data and use a parametric test # to compare the areas. dev.new() p <- ggplot(EPA.94b.tccb.df, aes(x = Area, y = log10(TcCB), color = Area)) p + geom_stripchart(test.text = TRUE) + labs(y = "log10[ TcCB (ppb) ]") #========== # Example 11: # Repeat Example 10, but allow the variances to differ # between Areas. #----------------------------------------------------- dev.new() p + geom_stripchart(test.text = TRUE, test.text.params = list(test.arg.list = list(var.equal=FALSE))) + labs(y = "log10[ TcCB (ppb) ]") #========== # Clean up #--------- graphics.off() rm(p) #======================================== #-------------------- # Paired Observations #-------------------- # Example 12: # The data frame ACE.13.TCE.df contians paired observations of # trichloroethylene (TCE; mg/L) at 10 groundwater monitoring wells # before and after remediation. # # Create one-dimensional scatterplots to compare TCE concentrations # before and after remediation and use a paired t-test to # test for a difference between periods. ACE.13.TCE.df # TCE.mg.per.L Well Period #1 20.900 1 Before #2 9.170 2 Before #3 5.960 3 Before #... ...... .. ...... #18 0.520 8 After #19 3.060 9 After #20 1.900 10 After dev.new() p <- ggplot(ACE.13.TCE.df, aes(x = Period, y = TCE.mg.per.L, color = Period)) p + geom_stripchart(paired = TRUE, group = "Well", test.text = TRUE) + labs(y = "TCE (mg/L)") #========== # Example 13: # Repeat Example 11, but use a one-sided alternative since # remediation should decrease TCE concentration. #--------------------------------------------------------- dev.new() p + geom_stripchart(paired = TRUE, group = "Well", test.text = TRUE, test.text.params = list(test.arg.list = list(alternative="less"))) + labs(y = "TCE (mg/L)") #========== # Clean up #--------- graphics.off() rm(p) #======================================== #---------------------------------------- # Paired Observations, Nonparametric Test #---------------------------------------- # Example 14: # The data frame Helsel.Hirsch.02.Mayfly.df contains paired counts # of mayfly nymphs above and below industrial outfalls in 12 streams. # # Create one-dimensional scatterplots to compare the # counts between locations and use a nonparametric test # to compare counts above and below the outfalls. Helsel.Hirsch.02.Mayfly.df # Mayfly.Count Stream Location #1 12 1 Above #2 15 2 Above #3 11 3 Above #... ... .. ..... #22 60 10 Below #23 53 11 Below #24 124 12 Below dev.new() p <- ggplot(Helsel.Hirsch.02.Mayfly.df, aes(x = Location, y = Mayfly.Count, color = Location)) p + geom_stripchart(location = "median", paired = TRUE, group = "Stream", test.text = TRUE) + labs(y = "Number of Mayfly Nymphs") #========== # Clean up #--------- graphics.off() rm(p) ## End(Not run)
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #========== #--------------------- # 3 Independent Groups #--------------------- # Example 1: # Using the built-in data frame mtcars, # create a stipchart of miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) p + geom_stripchart() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== ## Not run: # Example 2: # Repeat Example 1, but include the results of the # standard parametric analysis of variance. #------------------------------------------------- dev.new() p + geom_stripchart(test.text = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 3: # Using Example 2, show explicitly the layering # process that geom_stripchart is using. # # This plot should look identical to the previous one. #----------------------------------------------------- set.seed(47) dev.new() p + theme(legend.position = "none") + geom_jitter(pch = 1, width = 0.15, height = 0) + stat_summary(fun.y = "mean", geom = "point", size = 2, position = position_nudge(x = 0.3)) + stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", size = 0.75, width = 0.075, position = position_nudge(x = 0.3)) + stat_n_text() + stat_mean_sd_text() + stat_test_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 4: # Repeat Example 2, but put all text in a text box. #-------------------------------------------------- dev.new() p + geom_stripchart(text.box = TRUE, test.text = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 5: # Repeat Example 2, but put just the test results # in a text box. #------------------------------------------------ dev.new() p + geom_stripchart(test.text = TRUE, test.text.box = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 6: # Repeat Example 2, but: # 1) plot the median and IQR instead of the mean and the 95 # 2) show text for the median and IQR, and # 3) use the nonparametric test to compare groups. # # Note that following what the ggplot2 stat_summary function # does when you specify a "confidence interval" for the # median (i.e., when you call stat_summary with the arguments # geom="errorbar" and fun.data="median_hilow"), the displayed # error bars show intervals based on estimated quuantiles. # By default, stat_summary with the arguments # geom="errorbar" and fun.data="median_hilow" displays # error bars using the 2.5'th and 97.5'th percentiles. # The function geom_stripchart, however, by default # displays error bars using the 25'th and 75'th percentiles # (see the explanation for the argument ci above). #------------------------------------------------------------ dev.new() p + geom_stripchart(location = "median", test.text = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Clean up #--------- graphics.off() rm(p) #======================================== #--------------------- # 2 Independent Groups #--------------------- # Example 7: # Repeat Example 2, but use only the groups with # 4 and 8 cylinders. #----------------------------------------------- dev.new() p <- ggplot(subset(mtcars, cyl %in% c(4, 8)), aes(x = factor(cyl), y = mpg, color = cyl)) p + geom_stripchart(test.text = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 8: # Repeat Example 7, but # 1) facet by transmission type # 2) make the text smaller # 3) put the text for the test results in a text box # and make them blue. dev.new() p + geom_stripchart(test.text = TRUE, test.text.box = TRUE, n.text.params = list(size = 3), location.scale.text.params = list(size = 3), test.text.params = list(size = 3, color = "blue")) + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Clean up #--------- graphics.off() rm(p) #======================================== #--------------------- # 2 Independent Groups #--------------------- # Example 9: # The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # First create one-dimensional scatterplots to compare the # TcCB concentrations between the areas and use a nonparametric # test to test for a difference between areas. dev.new() p <- ggplot(EPA.94b.tccb.df, aes(x = Area, y = TcCB, color = Area)) p + geom_stripchart(location = "median", test.text = TRUE) + labs(y = "TcCB (ppb)") #========== # Example 10: # Now log-transform the TcCB data and use a parametric test # to compare the areas. dev.new() p <- ggplot(EPA.94b.tccb.df, aes(x = Area, y = log10(TcCB), color = Area)) p + geom_stripchart(test.text = TRUE) + labs(y = "log10[ TcCB (ppb) ]") #========== # Example 11: # Repeat Example 10, but allow the variances to differ # between Areas. #----------------------------------------------------- dev.new() p + geom_stripchart(test.text = TRUE, test.text.params = list(test.arg.list = list(var.equal=FALSE))) + labs(y = "log10[ TcCB (ppb) ]") #========== # Clean up #--------- graphics.off() rm(p) #======================================== #-------------------- # Paired Observations #-------------------- # Example 12: # The data frame ACE.13.TCE.df contians paired observations of # trichloroethylene (TCE; mg/L) at 10 groundwater monitoring wells # before and after remediation. # # Create one-dimensional scatterplots to compare TCE concentrations # before and after remediation and use a paired t-test to # test for a difference between periods. ACE.13.TCE.df # TCE.mg.per.L Well Period #1 20.900 1 Before #2 9.170 2 Before #3 5.960 3 Before #... ...... .. ...... #18 0.520 8 After #19 3.060 9 After #20 1.900 10 After dev.new() p <- ggplot(ACE.13.TCE.df, aes(x = Period, y = TCE.mg.per.L, color = Period)) p + geom_stripchart(paired = TRUE, group = "Well", test.text = TRUE) + labs(y = "TCE (mg/L)") #========== # Example 13: # Repeat Example 11, but use a one-sided alternative since # remediation should decrease TCE concentration. #--------------------------------------------------------- dev.new() p + geom_stripchart(paired = TRUE, group = "Well", test.text = TRUE, test.text.params = list(test.arg.list = list(alternative="less"))) + labs(y = "TCE (mg/L)") #========== # Clean up #--------- graphics.off() rm(p) #======================================== #---------------------------------------- # Paired Observations, Nonparametric Test #---------------------------------------- # Example 14: # The data frame Helsel.Hirsch.02.Mayfly.df contains paired counts # of mayfly nymphs above and below industrial outfalls in 12 streams. # # Create one-dimensional scatterplots to compare the # counts between locations and use a nonparametric test # to compare counts above and below the outfalls. Helsel.Hirsch.02.Mayfly.df # Mayfly.Count Stream Location #1 12 1 Above #2 15 2 Above #3 11 3 Above #... ... .. ..... #22 60 10 Below #23 53 11 Below #24 124 12 Below dev.new() p <- ggplot(Helsel.Hirsch.02.Mayfly.df, aes(x = Location, y = Mayfly.Count, color = Location)) p + geom_stripchart(location = "median", paired = TRUE, group = "Stream", test.text = TRUE) + labs(y = "Number of Mayfly Nymphs") #========== # Clean up #--------- graphics.off() rm(p) ## End(Not run)
Compute the sample geometric mean.
geoMean(x, na.rm = FALSE)
geoMean(x, na.rm = FALSE)
x |
numeric vector of observations. |
na.rm |
logical scalar indicating whether to remove missing values from |
If x
contains any non-positive values (values less than or equal to 0),
geoMean
returns NA
and issues a warning.
Let denote a vector of
observations from some
distribution. The sample geometric mean is a measure of central tendency.
It is defined as:
that is, it is the 'th root of the product of all
observations.
An equivalent way to define the geometric mean is by:
where
That is, the sample geometric mean is antilog of the sample mean of the log-transformed observations.
The geometric mean is only defined for positive observations. It can be shown that the geometric mean is less than or equal to the sample arithmetic mean with equality only when all of the observations are the same value.
A numeric scalar – the sample geometric mean.
The geometric mean is sometimes used to average ratios and percent changes
(Zar, 2010). For the lognormal distribution, the geometric mean is the
maximum likelihood estimator of the median of the distribution,
although it is sometimes used incorrectly to estimate the mean of the
distribution (see the NOTE section in the help file for elnormAlt
).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
Taylor, J.K. (1990). Statistical Techniques for Data Analysis. Lewis Publishers, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
geoSD
, summaryFull
, Summary Statistics
,
mean
, median
.
# Generate 20 observations from a lognormal distribution with parameters # mean=10 and cv=2, and compute the mean, median, and geometric mean. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 2) mean(dat) #[1] 5.339273 median(dat) #[1] 3.692091 geoMean(dat) #[1] 4.095127 #---------- # Clean up rm(dat)
# Generate 20 observations from a lognormal distribution with parameters # mean=10 and cv=2, and compute the mean, median, and geometric mean. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 2) mean(dat) #[1] 5.339273 median(dat) #[1] 3.692091 geoMean(dat) #[1] 4.095127 #---------- # Clean up rm(dat)
Compute the sample geometric standard deviation.
geoSD(x, na.rm = FALSE, sqrt.unbiased = TRUE)
geoSD(x, na.rm = FALSE, sqrt.unbiased = TRUE)
x |
numeric vector of observations. |
na.rm |
logical scalar indicating whether to remove missing values from |
sqrt.unbiased |
logical scalar specifying what method to use to compute the sample standard
deviation of the log-transformed observations. If |
If x
contains any non-positive values (values less than or equal to 0),
geoMean
returns NA
and issues a warning.
Let denote a vector of
observations from some
distribution. The sample geometric standard deviation is a measure of variability.
It is defined as:
where
That is, the sample geometric standard deviation is the antilog of the sample standard deviation of the log-transformed observations.
The sample standard deviation of the log-transformed observations shown in Equation (2) is the square root of the unbiased estimator of variance. (Note that this estimator of standard deviation is not an unbiased estimator.) Sometimes, the square root of the method of moments estimator of variance is used instead:
This is the estimator used in Equation (1) when sqrt.unbiased=FALSE
.
A numeric scalar – the sample geometric standard deviation.
The geometric standard deviation is only defined for positive observations. It is usually computed only for observations that are assumed to have come from a lognormal distribution.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.
Leidel, N.A., K.A. Busch, and J.R. Lynch. (1977). Occupational Exposure Sampling Strategy Manual. U.S. Department of Health, Education, and Welfare, Public Health Service, Center for Disease Control, National Institute for Occupational Safety and Health, Cincinnati, Ohio 45226, January, 1977, pp.102–103.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
Taylor, J.K. (1990). Statistical Techniques for Data Analysis. Lewis Publishers, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
geoMean
, Lognormal, elnorm
,
summaryFull
, Summary Statistics
.
# Generate 2000 observations from a lognormal distribution with parameters # mean=10 and cv=1, which implies the standard deviation (on the original # scale) is 10. Compute the mean, geometric mean, standard deviation, # and geometric standard deviation. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(2000, mean = 10, cv = 1) mean(dat) #[1] 10.23417 geoMean(dat) #[1] 7.160154 sd(dat) #[1] 9.786493 geoSD(dat) #[1] 2.334358 #---------- # Clean up rm(dat)
# Generate 2000 observations from a lognormal distribution with parameters # mean=10 and cv=1, which implies the standard deviation (on the original # scale) is 10. Compute the mean, geometric mean, standard deviation, # and geometric standard deviation. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(2000, mean = 10, cv = 1) mean(dat) #[1] 10.23417 geoMean(dat) #[1] 7.160154 sd(dat) #[1] 9.786493 geoSD(dat) #[1] 2.334358 #---------- # Clean up rm(dat)
Density, distribution function, quantile function, and random generation for the generalized extreme value distribution.
dgevd(x, location = 0, scale = 1, shape = 0) pgevd(q, location = 0, scale = 1, shape = 0) qgevd(p, location = 0, scale = 1, shape = 0) rgevd(n, location = 0, scale = 1, shape = 0)
dgevd(x, location = 0, scale = 1, shape = 0) pgevd(q, location = 0, scale = 1, shape = 0) qgevd(p, location = 0, scale = 1, shape = 0) rgevd(n, location = 0, scale = 1, shape = 0)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
location |
vector of location parameters. |
scale |
vector of positive scale parameters. |
shape |
vector of shape parameters. |
Let be a generalized extreme value random variable with parameters
location=
,
scale=
, and
shape=
.
When the shape parameter
, the generalized extreme value distribution
reduces to the extreme value distribution. When the shape parameter
, the cumulative distribution function of
is given by:
where and
.
When
, the range of
is:
and when the range of
is:
The quantile of
is given by:
density (devd
), probability (pevd
), quantile (qevd
), or
random sample (revd
) for the generalized extreme value distribution with
location parameter(s) determined by location
, scale parameter(s)
determined by scale
, and shape parameter(s) determined by shape
.
Two-parameter extreme value distributions (EVD) have been applied extensively since the 1930's to several fields of study, including the distributions of hydrological and meteorological variables, human lifetimes, and strength of materials. The three-parameter generalized extreme value distribution (GEVD) was introduced by Jenkinson (1955) to model annual maximum and minimum values of meteorological events. Since then, it has been used extensively in the hydological and meteorological fields.
The three families of EVDs are all special kinds of GEVDs. When the shape
parameter , the GEVD reduces to the
Type I extreme value (Gumbel) distribution. (The function
zTestGevdShape
allows you to test the null hypothesis that the shape
parameter is equal to 0.) When , the GEVD is the same as the Type II
extreme value distribution, and when
it is the same as the
Type III extreme value distribution.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Jenkinson, A.F. (1955). The Frequency Distribution of the Annual Maximum (or Minimum) of Meteorological Events. Quarterly Journal of the Royal Meteorological Society, 81, 158–171.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
egevd
, zTestGevdShape
, EVD
,
Probability Distributions and Random Numbers.
# Density of a generalized extreme value distribution with # location=0, scale=1, and shape=0, evaluated at 0.5: dgevd(.5) #[1] 0.3307043 #---------- # The cdf of a generalized extreme value distribution with # location=1, scale=2, and shape=0.25, evaluated at 0.5: pgevd(.5, 1, 2, 0.25) #[1] 0.2795905 #---------- # The 90'th percentile of a generalized extreme value distribution with # location=-2, scale=0.5, and shape=-0.25: qgevd(.9, -2, 0.5, -0.25) #[1] -0.4895683 #---------- # Random sample of 4 observations from a generalized extreme value # distribution with location=5, scale=2, and shape=1. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rgevd(4, 5, 2, 1) #[1] 6.738692 6.473457 4.446649 5.727085
# Density of a generalized extreme value distribution with # location=0, scale=1, and shape=0, evaluated at 0.5: dgevd(.5) #[1] 0.3307043 #---------- # The cdf of a generalized extreme value distribution with # location=1, scale=2, and shape=0.25, evaluated at 0.5: pgevd(.5, 1, 2, 0.25) #[1] 0.2795905 #---------- # The 90'th percentile of a generalized extreme value distribution with # location=-2, scale=0.5, and shape=-0.25: qgevd(.9, -2, 0.5, -0.25) #[1] -0.4895683 #---------- # Random sample of 4 observations from a generalized extreme value # distribution with location=5, scale=2, and shape=1. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rgevd(4, 5, 2, 1) #[1] 6.738692 6.473457 4.446649 5.727085
Alkilinity concentrations (mg/L) in groundwater.
data(Gibbons.et.al.09.Alkilinity.vec)
data(Gibbons.et.al.09.Alkilinity.vec)
A numeric vector with 27 elements.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring. Second Edition. John Wiley & Sons, Hoboken. Table 5.5, p. 107.
Vinyl chloride concentrations (g/L) in groundwater from upgradient
background monitoring wells.
data(Gibbons.et.al.09.Vinyl.Chloride.vec)
data(Gibbons.et.al.09.Vinyl.Chloride.vec)
A numeric vector with 34 elements.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring. Second Edition. John Wiley & Sons, Hoboken. Table 4.3, p. 87.
Objects of S3 class "gof"
are returned by the EnvStats function
gofTest
when just the x
argument is supplied.
Objects of S3 class "gof"
are lists that contain
information about the assumed distribution, the estimated or
user-supplied distribution parameters, and the test statistic
and p-value.
Required Components
The following components must be included in a legitimate list of
class "gof"
.
distribution |
a character string indicating the name of the
assumed distribution (see |
dist.abb |
a character string containing the abbreviated name
of the distribution (see |
distribution.parameters |
a numeric vector with a names attribute containing the names and values of the estimated or user-supplied distribution parameters associated with the assumed distribution. |
n.param.est |
a scalar indicating the number of distribution
parameters estimated prior to performing the goodness-of-fit
test. The value of this component will be |
estimation.method |
a character string indicating the method
used to compute the estimated parameters. The value of this
component will depend on the available estimation methods
(see |
statistic |
a numeric scalar with a names attribute containing the name and value of the goodness-of-fit statistic. |
sample.size |
a numeric scalar containing the number of non-missing observations in the sample used for the goodness-of-fit test. |
parameters |
numeric vector with a names attribute containing
the name(s) and value(s) of the parameter(s) associated with the
test statistic given in the |
z.value |
(except when |
p.value |
numeric scalar containing the p-value associated with the goodness-of-fit statistic. |
alternative |
character string indicating the alternative hypothesis. |
method |
character string indicating the name of the
goodness-of-fit test (e.g., |
data |
numeric vector containing the data actually used for the goodness-of-fit test (i.e., the original data without any missing or infinite values). |
data.name |
character string indicating the name of the data object used for the goodness-of-fit test. |
bad.obs |
numeric scalar indicating the number of missing ( |
NOTE: when the function gofTest
is called with
both arguments x
and y
and test="ks"
, it
returns an object of class "gofTwoSample"
.
No specific parametric distribution is assumed, so the value of the component
distribution
is "Equal"
and the following components
are omitted: dist.abb
, distribution.parameters
,
n.param.est
, estimation.method
, and z.value
.
Optional Components
The following components are included in the result of
calling gofTest
with the argument test="chisq"
and may be used by the function
plot.gof
:
cut.points |
numeric vector containing the cutpoints used to define the cells. |
counts |
numeric vector containing the observed number of counts for each cell. |
expected |
numeric vector containing the expected number of counts for each cell. |
X2.components |
numeric vector containing the contribution of each cell to the chi-square statistic. |
Generic functions that have methods for objects of class
"gof"
include: print
, plot
.
Since objects of class "gof"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
gofTest
, print.gof
, plot.gof
,
Goodness-of-Fit Tests,
Distribution.df
, gofCensored.object
.
# Create an object of class "gof", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- rnorm(20, mean = 3, sd = 2) gof.obj <- gofTest(dat) mode(gof.obj) #[1] "list" class(gof.obj) #[1] "gof" names(gof.obj) # [1] "distribution" "dist.abb" # [3] "distribution.parameters" "n.param.est" # [5] "estimation.method" "statistic" # [7] "sample.size" "parameters" # [9] "z.value" "p.value" #[11] "alternative" "method" #[13] "data" "data.name" #[15] "bad.obs" gof.obj #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Normal # #Estimated Parameter(s): mean = 2.861160 # sd = 1.180226 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Test Statistic: W = 0.9640724 # #Test Statistic Parameter: n = 20 # #P-value: 0.6279872 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. #========== # Extract the p-value #-------------------- gof.obj$p.value #[1] 0.6279872 #========== # Plot the results of the test #----------------------------- dev.new() plot(gof.obj) #========== # Clean up #--------- rm(dat, gof.obj) graphics.off()
# Create an object of class "gof", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- rnorm(20, mean = 3, sd = 2) gof.obj <- gofTest(dat) mode(gof.obj) #[1] "list" class(gof.obj) #[1] "gof" names(gof.obj) # [1] "distribution" "dist.abb" # [3] "distribution.parameters" "n.param.est" # [5] "estimation.method" "statistic" # [7] "sample.size" "parameters" # [9] "z.value" "p.value" #[11] "alternative" "method" #[13] "data" "data.name" #[15] "bad.obs" gof.obj #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Normal # #Estimated Parameter(s): mean = 2.861160 # sd = 1.180226 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Test Statistic: W = 0.9640724 # #Test Statistic Parameter: n = 20 # #P-value: 0.6279872 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. #========== # Extract the p-value #-------------------- gof.obj$p.value #[1] 0.6279872 #========== # Plot the results of the test #----------------------------- dev.new() plot(gof.obj) #========== # Clean up #--------- rm(dat, gof.obj) graphics.off()
Objects of S3 class "gofCensored"
are returned by the EnvStats function
gofTestCensored
.
Objects of S3 class "gofCensored"
are lists that contain
information about the assumed distribution, the amount of censoring,
the estimated or user-supplied distribution parameters, and the test
statistic and p-value.
Required Components
The following components must be included in a legitimate list of
class "gofCensored"
.
distribution |
a character string indicating the name of the
assumed distribution (see |
dist.abb |
a character string containing the abbreviated name
of the distribution (see |
distribution.parameters |
a numeric vector with a names attribute containing the names and values of the estimated or user-supplied distribution parameters associated with the assumed distribution. |
n.param.est |
a scalar indicating the number of distribution
parameters estimated prior to performing the goodness-of-fit
test. The value of this component will be |
estimation.method |
a character string indicating the method
used to compute the estimated parameters. The value of this
component will depend on the available estimation methods
(see |
statistic |
a numeric scalar with a names attribute containing the name and value of the goodness-of-fit statistic. |
sample.size |
a numeric scalar containing the number of non-missing observations in the sample used for the goodness-of-fit test. |
censoring.side |
character string indicating whether the data are left- or right-censored. |
censoring.levels |
numeric scalar or vector indicating the censoring level(s). |
percent.censored |
numeric scalar indicating the percent of non-missing observations that are censored. |
parameters |
numeric vector with a names attribute containing
the name(s) and value(s) of the parameter(s) associated with the
test statistic given in the |
z.value |
(except when |
p.value |
numeric scalar containing the p-value associated with the goodness-of-fit statistic. |
alternative |
character string indicating the alternative hypothesis. |
method |
character string indicating the name of the
goodness-of-fit test (e.g., |
data.name |
character string indicating the name of the data object used for the goodness-of-fit test. |
censored |
logical vector indicating which observations are censored. |
censoring.name |
character string indicating the name of the object used to indicate the censoring. |
Optional Components
The following components are included when the argument keep.data
is
set to TRUE
in the call to the function producing the
object of class "gofCensored"
.
data |
numeric vector containing the data actually used for the goodness-of-fit test (i.e., the original data without any missing or infinite values). |
censored |
logical vector indicating the censoring status of the data actually used for the goodness-of-fit test. |
The following component is included when the data object
contains missing (NA
), undefined (NaN
) and/or infinite
(Inf
, -Inf
) values.
bad.obs |
numeric scalar indicating the number of missing ( |
Generic functions that have methods for objects of class
"gofCensored"
include: print
, plot
.
Since objects of class "gofCensored"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
gofTestCensored
, print.gofCensored
,
plot.gofCensored
,
Censored Data,
Goodness-of-Fit Tests,
Distribution.df
, gof.object
.
# Create an object of class "gofCensored", then print it out. #------------------------------------------------------------ gofCensored.obj <- with(EPA.09.Ex.15.1.manganese.df, gofTestCensored(Manganese.ppb, Censored, test = "sf")) mode(gofCensored.obj) #[1] "list" class(gofCensored.obj) #[1] "gofCensored" names(gofCensored.obj) # [1] "distribution" "dist.abb" # [3] "distribution.parameters" "n.param.est" # [5] "estimation.method" "statistic" # [7] "sample.size" "censoring.side" # [9] "censoring.levels" "percent.censored" #[11] "parameters" "z.value" #[13] "p.value" "alternative" #[15] "method" "data" #[17] "data.name" "censored" #[19] "censoring.name" "bad.obs" gofCensored.obj #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Francia GOF # (Multiply Censored Data) # #Hypothesized Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 15.23508 # sd = 30.62812 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Test Statistic: W = 0.8368016 # #Test Statistic Parameters: N = 25.00 # DELTA = 0.24 # #P-value: 0.004662658 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. #========== # Extract the p-value #-------------------- gofCensored.obj$p.value #[1] 0.004662658 #========== # Plot the results of the test #----------------------------- dev.new() plot(gofCensored.obj) #========== # Clean up #--------- rm(gofCensored.obj) graphics.off()
# Create an object of class "gofCensored", then print it out. #------------------------------------------------------------ gofCensored.obj <- with(EPA.09.Ex.15.1.manganese.df, gofTestCensored(Manganese.ppb, Censored, test = "sf")) mode(gofCensored.obj) #[1] "list" class(gofCensored.obj) #[1] "gofCensored" names(gofCensored.obj) # [1] "distribution" "dist.abb" # [3] "distribution.parameters" "n.param.est" # [5] "estimation.method" "statistic" # [7] "sample.size" "censoring.side" # [9] "censoring.levels" "percent.censored" #[11] "parameters" "z.value" #[13] "p.value" "alternative" #[15] "method" "data" #[17] "data.name" "censored" #[19] "censoring.name" "bad.obs" gofCensored.obj #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Francia GOF # (Multiply Censored Data) # #Hypothesized Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 15.23508 # sd = 30.62812 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Test Statistic: W = 0.8368016 # #Test Statistic Parameters: N = 25.00 # DELTA = 0.24 # #P-value: 0.004662658 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. #========== # Extract the p-value #-------------------- gofCensored.obj$p.value #[1] 0.004662658 #========== # Plot the results of the test #----------------------------- dev.new() plot(gofCensored.obj) #========== # Clean up #--------- rm(gofCensored.obj) graphics.off()
Objects of S3 class "gofGroup"
are returned by the EnvStats function
gofGroupTest
.
Objects of S3 class "gofGroup"
are lists that contain
information about the assumed distribution, the estimated or
user-supplied distribution parameters, and the test statistic
and p-value.
Required Components
The following components must be included in a legitimate list of
class "gofGroup"
.
distribution |
a character string indicating the name of the
assumed distribution (see |
dist.abb |
a character string containing the abbreviated name
of the distribution (see |
statistic |
a numeric scalar with a names attribute containing the name and value of the goodness-of-fit statistic. |
sample.size |
a numeric scalar containing the number of non-missing observations in the sample used for the goodness-of-fit test. |
parameters |
numeric vector with a names attribute containing
the name(s) and value(s) of the parameter(s) associated with the
test statistic given in the |
p.value |
numeric scalar containing the p-value associated with the goodness-of-fit statistic. |
alternative |
character string indicating the alternative hypothesis. |
method |
character string indicating the name of the
goodness-of-fit test (e.g., |
data.name |
character string indicating the name of the data object used for the goodness-of-fit test. |
grouping.variable |
character string indicating the name of the variable defining the groups. |
bad.obs |
numeric vector indicating the number of missing ( |
n.groups |
numeric scalar containing the number of groups. |
group.names |
character vector containing the levels of the grouping variable, i.e., the names of each of the groups. |
group.scores |
numeric vector containing the individual statistics for each group. |
Optional Component
The following component is included when gofGroupTest
is
called with a formula for the first argument and a data
argument.
parent.of.data |
character string indicating the name of the object supplied
in the |
Generic functions that have methods for objects of class
"gofGroup"
include: print
, plot
.
Since objects of class "gofGroup"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
gofGroupTest
, print.gofGroup
, plot.gofGroup
,
Goodness-of-Fit Tests,
Distribution.df
.
# Create an object of class "gofGroup", then print it out. # Example 10-4 of USEPA (2009, page 10-20) gives an example of # simultaneously testing the assumption of normality for nickel # concentrations (ppb) in groundwater collected at 4 monitoring # wells over 5 months. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. gofGroup.obj <- gofGroupTest(Nickel.ppb ~ Well, data = EPA.09.Ex.10.1.nickel.df) mode(gofGroup.obj) #[1] "list" class(gofGroup.obj) #[1] "gofGroup" names(gofGroup.obj) # [1] "distribution" "dist.abb" "statistic" # [4] "sample.size" "parameters" "p.value" # [7] "alternative" "method" "data.name" #[10] "grouping.variable" "parent.of.data" "bad.obs" #[13] "n.groups" "group.names" "group.scores" gofGroup.obj #Results of Group Goodness-of-Fit Test #------------------------------------- # #Test Method: Wilk-Shapiro GOF (Normal Scores) # #Hypothesized Distribution: Normal # #Data: Nickel.ppb # #Grouping Variable: Well # #Data Source: EPA.09.Ex.10.1.nickel.df # #Number of Groups: 4 # #Sample Sizes: Well.1 = 5 # Well.2 = 5 # Well.3 = 5 # Well.4 = 5 # #Test Statistic: z (G) = -3.658696 # #P-values for #Individual Tests: Well.1 = 0.03510747 # Well.2 = 0.02385344 # Well.3 = 0.01120775 # Well.4 = 0.10681461 # #P-value for #Group Test: 0.0001267509 # #Alternative Hypothesis: At least one group # does not come from a # Normal Distribution. #========== # Extract the p-values #--------------------- gofGroup.obj$p.value # Well.1 Well.2 Well.3 Well.4 z (G) #0.0351074733 0.0238534406 0.0112077511 0.1068146088 0.0001267509 #========== # Plot the results of the test #----------------------------- dev.new() plot(gofGroup.obj) #========== # Clean up #--------- rm(gofGroup.obj) graphics.off()
# Create an object of class "gofGroup", then print it out. # Example 10-4 of USEPA (2009, page 10-20) gives an example of # simultaneously testing the assumption of normality for nickel # concentrations (ppb) in groundwater collected at 4 monitoring # wells over 5 months. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. gofGroup.obj <- gofGroupTest(Nickel.ppb ~ Well, data = EPA.09.Ex.10.1.nickel.df) mode(gofGroup.obj) #[1] "list" class(gofGroup.obj) #[1] "gofGroup" names(gofGroup.obj) # [1] "distribution" "dist.abb" "statistic" # [4] "sample.size" "parameters" "p.value" # [7] "alternative" "method" "data.name" #[10] "grouping.variable" "parent.of.data" "bad.obs" #[13] "n.groups" "group.names" "group.scores" gofGroup.obj #Results of Group Goodness-of-Fit Test #------------------------------------- # #Test Method: Wilk-Shapiro GOF (Normal Scores) # #Hypothesized Distribution: Normal # #Data: Nickel.ppb # #Grouping Variable: Well # #Data Source: EPA.09.Ex.10.1.nickel.df # #Number of Groups: 4 # #Sample Sizes: Well.1 = 5 # Well.2 = 5 # Well.3 = 5 # Well.4 = 5 # #Test Statistic: z (G) = -3.658696 # #P-values for #Individual Tests: Well.1 = 0.03510747 # Well.2 = 0.02385344 # Well.3 = 0.01120775 # Well.4 = 0.10681461 # #P-value for #Group Test: 0.0001267509 # #Alternative Hypothesis: At least one group # does not come from a # Normal Distribution. #========== # Extract the p-values #--------------------- gofGroup.obj$p.value # Well.1 Well.2 Well.3 Well.4 z (G) #0.0351074733 0.0238534406 0.0112077511 0.1068146088 0.0001267509 #========== # Plot the results of the test #----------------------------- dev.new() plot(gofGroup.obj) #========== # Clean up #--------- rm(gofGroup.obj) graphics.off()
Perform a goodness-of-fit test to determine whether data in a set of groups appear to all come from the same probability distribution (with possibly different parameters for each group).
gofGroupTest(object, ...) ## S3 method for class 'formula' gofGroupTest(object, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: gofGroupTest(object, group, test = "sw", distribution = "norm", est.arg.list = NULL, n.classes = NULL, cut.points = NULL, param.list = NULL, estimate.params = ifelse(is.null(param.list), TRUE, FALSE), n.param.est = NULL, correct = NULL, digits = .Options$digits, exact = NULL, ws.method = "normal scores", data.name = NULL, group.name = NULL, parent.of.data = NULL, subset.expression = NULL, ...) ## S3 method for class 'data.frame' gofGroupTest(object, ...) ## S3 method for class 'matrix' gofGroupTest(object, ...) ## S3 method for class 'list' gofGroupTest(object, ...)
gofGroupTest(object, ...) ## S3 method for class 'formula' gofGroupTest(object, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: gofGroupTest(object, group, test = "sw", distribution = "norm", est.arg.list = NULL, n.classes = NULL, cut.points = NULL, param.list = NULL, estimate.params = ifelse(is.null(param.list), TRUE, FALSE), n.param.est = NULL, correct = NULL, digits = .Options$digits, exact = NULL, ws.method = "normal scores", data.name = NULL, group.name = NULL, parent.of.data = NULL, subset.expression = NULL, ...) ## S3 method for class 'data.frame' gofGroupTest(object, ...) ## S3 method for class 'matrix' gofGroupTest(object, ...) ## S3 method for class 'list' gofGroupTest(object, ...)
object |
an object containing data for 2 or more groups to be compared to the
hypothesized distribution specified by |
data |
when |
subset |
when |
na.action |
when |
group |
when |
test |
character string defining which goodness-of-fit test to perform on each group.
Possible values are:
|
distribution |
a character string denoting the distribution abbreviation. See the help file for
When When When When When |
est.arg.list |
a list of arguments to be passed to the function estimating the distribution parameters
for each group of observations.
For example, if When When When When |
n.classes |
for the case when |
cut.points |
for the case when |
param.list |
for the case when |
estimate.params |
for the case when |
n.param.est |
for the case when |
correct |
for the case when |
digits |
a scalar indicating how many significant digits to print out for the parameters
associated with the hypothesized distribution. The default value is |
exact |
for the case when |
ws.method |
character string indicating which method to use when performing the
Wilk-Shapiro test for a Uniform [0,1] distribution
on the p-values from the goodness-of-fit tests on each group. Possible values
are NOTE: In the case where you are testing whether each group comes from a
Uniform [0,1] distribution (i.e., when you set
|
data.name |
character string indicating the name of the data used for the goodness-of-fit tests.
The default value is |
group.name |
character string indicating the name of the data used to create the groups.
The default value is |
parent.of.data |
character string indicating the source of the data used for the goodness-of-fit tests. |
subset.expression |
character string indicating the expression used to subset the data. |
... |
additional arguments affecting the goodness-of-fit test. |
The function gofGroupTest
performs a goodness-of-fit test for each group of
data by calling the function gofTest
. Using the p-values from these
goodness-of-fit tests, it then calls the function gofTest
with the
argument test="ws"
to test whether the p-values appear to come from a
Uniform [0,1] distribution.
a list of class "gofGroup"
containing the results of the group goodness-of-fit test.
Objects of class "gofGroup"
have special printing and plotting methods.
See the help file for gofGroup.object
for details.
The Wilk-Shapiro (1968) tests for a Uniform [0, 1] distribution were introduced in the context
of testing whether several independent samples all come from normal distributions, with
possibly different means and variances. The function gofGroupTest
extends
this idea to allow you to test whether several independent samples come from the same
distribution (e.g., gamma, extreme value, etc.), with possibly different parameters.
Examples of simultaneously assessing whether several groups come from the same distribution are given in USEPA (2009) and Gibbons et al. (2009).
In practice, almost any goodness-of-fit test will not reject the null hypothesis
if the number of observations is relatively small. Conversely, almost any goodness-of-fit
test will reject the null hypothesis if the number of observations is very large,
since “real” data are never distributed according to any theoretical distribution
(Conover, 1980, p.367). For most cases, however, the distribution of “real” data
is close enough to some theoretical distribution that fairly accurate results may be
provided by assuming that particular theoretical distribution. One way to asses the
goodness of the fit is to use goodness-of-fit tests. Another way is to look at
quantile-quantile (Q-Q) plots (see qqPlot
).
Steven P. Millard ([email protected])
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-17.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Wilk, M.B., and S.S. Shapiro. (1968). The Joint Assessment of Normality of Several Independent Samples. Technometrics, 10(4), 825-839.
gofTest
, gofGroup.object
, print.gofGroup
,
plot.gofGroup
, qqPlot
.
# Example 10-4 of USEPA (2009, page 10-20) gives an example of # simultaneously testing the assumption of normality for nickel # concentrations (ppb) in groundwater collected at 4 monitoring # wells over 5 months. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #4 8 Well.1 56.0 #5 10 Well.1 8.7 #6 1 Well.2 19.0 #7 3 Well.2 81.5 #8 6 Well.2 331.0 #9 8 Well.2 14.0 #10 10 Well.2 64.4 #11 1 Well.3 39.0 #12 3 Well.3 151.0 #13 6 Well.3 27.0 #14 8 Well.3 21.4 #15 10 Well.3 578.0 #16 1 Well.4 3.1 #17 3 Well.4 942.0 #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 # Test for a normal distribution at each well: #-------------------------------------------- gofGroup.list <- gofGroupTest(Nickel.ppb ~ Well, data = EPA.09.Ex.10.1.nickel.df) gofGroup.list #Results of Group Goodness-of-Fit Test #------------------------------------- # #Test Method: Wilk-Shapiro GOF (Normal Scores) # #Hypothesized Distribution: Normal # #Data: Nickel.ppb # #Grouping Variable: Well # #Data Source: EPA.09.Ex.10.1.nickel.df # #Number of Groups: 4 # #Sample Sizes: Well.1 = 5 # Well.2 = 5 # Well.3 = 5 # Well.4 = 5 # #Test Statistic: z (G) = -3.658696 # #P-values for #Individual Tests: Well.1 = 0.03510747 # Well.2 = 0.02385344 # Well.3 = 0.01120775 # Well.4 = 0.10681461 # #P-value for #Group Test: 0.0001267509 # #Alternative Hypothesis: At least one group # does not come from a # Normal Distribution. dev.new() plot(gofGroup.list) #---------- # Test for a lognormal distribution at each well: #----------------------------------------------- gofGroupTest(Nickel.ppb ~ Well, data = EPA.09.Ex.10.1.nickel.df, dist = "lnorm") #Results of Group Goodness-of-Fit Test #------------------------------------- # #Test Method: Wilk-Shapiro GOF (Normal Scores) # #Hypothesized Distribution: Lognormal # #Data: Nickel.ppb # #Grouping Variable: Well # #Data Source: EPA.09.Ex.10.1.nickel.df # #Number of Groups: 4 # #Sample Sizes: Well.1 = 5 # Well.2 = 5 # Well.3 = 5 # Well.4 = 5 # #Test Statistic: z (G) = 0.2401720 # #P-values for #Individual Tests: Well.1 = 0.6898164 # Well.2 = 0.6700394 # Well.3 = 0.3208299 # Well.4 = 0.5041375 # #P-value for #Group Test: 0.5949015 # #Alternative Hypothesis: At least one group # does not come from a # Lognormal Distribution. #---------- # Clean up rm(gofGroup.list) graphics.off()
# Example 10-4 of USEPA (2009, page 10-20) gives an example of # simultaneously testing the assumption of normality for nickel # concentrations (ppb) in groundwater collected at 4 monitoring # wells over 5 months. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #4 8 Well.1 56.0 #5 10 Well.1 8.7 #6 1 Well.2 19.0 #7 3 Well.2 81.5 #8 6 Well.2 331.0 #9 8 Well.2 14.0 #10 10 Well.2 64.4 #11 1 Well.3 39.0 #12 3 Well.3 151.0 #13 6 Well.3 27.0 #14 8 Well.3 21.4 #15 10 Well.3 578.0 #16 1 Well.4 3.1 #17 3 Well.4 942.0 #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 # Test for a normal distribution at each well: #-------------------------------------------- gofGroup.list <- gofGroupTest(Nickel.ppb ~ Well, data = EPA.09.Ex.10.1.nickel.df) gofGroup.list #Results of Group Goodness-of-Fit Test #------------------------------------- # #Test Method: Wilk-Shapiro GOF (Normal Scores) # #Hypothesized Distribution: Normal # #Data: Nickel.ppb # #Grouping Variable: Well # #Data Source: EPA.09.Ex.10.1.nickel.df # #Number of Groups: 4 # #Sample Sizes: Well.1 = 5 # Well.2 = 5 # Well.3 = 5 # Well.4 = 5 # #Test Statistic: z (G) = -3.658696 # #P-values for #Individual Tests: Well.1 = 0.03510747 # Well.2 = 0.02385344 # Well.3 = 0.01120775 # Well.4 = 0.10681461 # #P-value for #Group Test: 0.0001267509 # #Alternative Hypothesis: At least one group # does not come from a # Normal Distribution. dev.new() plot(gofGroup.list) #---------- # Test for a lognormal distribution at each well: #----------------------------------------------- gofGroupTest(Nickel.ppb ~ Well, data = EPA.09.Ex.10.1.nickel.df, dist = "lnorm") #Results of Group Goodness-of-Fit Test #------------------------------------- # #Test Method: Wilk-Shapiro GOF (Normal Scores) # #Hypothesized Distribution: Lognormal # #Data: Nickel.ppb # #Grouping Variable: Well # #Data Source: EPA.09.Ex.10.1.nickel.df # #Number of Groups: 4 # #Sample Sizes: Well.1 = 5 # Well.2 = 5 # Well.3 = 5 # Well.4 = 5 # #Test Statistic: z (G) = 0.2401720 # #P-values for #Individual Tests: Well.1 = 0.6898164 # Well.2 = 0.6700394 # Well.3 = 0.3208299 # Well.4 = 0.5041375 # #P-value for #Group Test: 0.5949015 # #Alternative Hypothesis: At least one group # does not come from a # Lognormal Distribution. #---------- # Clean up rm(gofGroup.list) graphics.off()
Objects of S3 class "gofOutlier"
are returned by the EnvStats function
rosnerTest
.
Objects of S3 class "gofOutlier"
are lists that contain
information about the assumed distribution, the test statistics,
the Type I error level, and the number of outliers detected.
Required Components
The following components must be included in a legitimate list of
class "gofOutlier"
.
distribution |
a character string indicating the name of the
assumed distribution (see |
statistic |
a numeric vector with a names attribute containing the names and values of the outlier test statistic for each outlier tested. |
sample.size |
a numeric scalar containing the number of non-missing observations in the sample used for the outlier test. |
parameters |
numeric vector with a names attribute containing
the name(s) and value(s) of the parameter(s) associated with the
test statistic given in the |
alpha |
numeric scalar indicating the Type I error level. |
crit.value |
numeric vector containing the critical values associated with the test for each outlier. |
alternative |
character string indicating the alternative hypothesis. |
method |
character string indicating the name of the outlier test. |
data |
numeric vector containing the data actually used for the outlier test (i.e., the original data without any missing or infinite values). |
data.name |
character string indicating the name of the data object used for the goodness-of-fit test. |
all.stats |
data frame containing all of the results of the test. |
Optional Components
The following component is included when the data object
contains missing (NA
), undefined (NaN
) and/or infinite
(Inf
, -Inf
) values.
bad.obs |
numeric scalar indicating the number of missing ( |
Generic functions that have methods for objects of class
"gofOutlier"
include: print
.
Since objects of class "gofOutlier"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
rosnerTest
, print.gofOutlier
,
Goodness-of-Fit Tests.
# Create an object of class "gofOutlier", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- c(rnorm(30, mean = 3, sd = 2), rnorm(3, mean = 10, sd = 1)) gofOutlier.obj <- rosnerTest(dat, k = 4) mode(gofOutlier.obj) #[1] "list" class(gofOutlier.obj) #[1] "gofOutlier" names(gofOutlier.obj) # [1] "distribution" "statistic" "sample.size" "parameters" # [5] "alpha" "crit.value" "n.outliers" "alternative" # [9] "method" "data" "data.name" "bad.obs" #[13] "all.stats" gofOutlier.obj #Results of Outlier Test #------------------------- # #Test Method: Rosner's Test for Outliers # #Hypothesized Distribution: Normal # #Data: dat # #Sample Size: 33 # #Test Statistics: R.1 = 2.848514 # R.2 = 3.086875 # R.3 = 3.033044 # R.4 = 2.380235 # #Test Statistic Parameter: k = 4 # #Alternative Hypothesis: Up to 4 observations are not # from the same Distribution. # #Type I Error: 5% # #Number of Outliers Detected: 3 # # i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier #1 0 3.549744 2.531011 10.7593656 33 2.848514 2.951949 TRUE #2 1 3.324444 2.209872 10.1460427 31 3.086875 2.938048 TRUE #3 2 3.104392 1.856109 8.7340527 32 3.033044 2.923571 TRUE #4 3 2.916737 1.560335 -0.7972275 25 2.380235 2.908473 FALSE #========== # Extract the data frame with all the test results #------------------------------------------------- gofOutlier.obj$all.stats # i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier #1 0 3.549744 2.531011 10.7593656 33 2.848514 2.951949 TRUE #2 1 3.324444 2.209872 10.1460427 31 3.086875 2.938048 TRUE #3 2 3.104392 1.856109 8.7340527 32 3.033044 2.923571 TRUE #4 3 2.916737 1.560335 -0.7972275 25 2.380235 2.908473 FALSE #========== # Clean up #--------- rm(dat, gofOutlier.obj)
# Create an object of class "gofOutlier", then print it out. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- c(rnorm(30, mean = 3, sd = 2), rnorm(3, mean = 10, sd = 1)) gofOutlier.obj <- rosnerTest(dat, k = 4) mode(gofOutlier.obj) #[1] "list" class(gofOutlier.obj) #[1] "gofOutlier" names(gofOutlier.obj) # [1] "distribution" "statistic" "sample.size" "parameters" # [5] "alpha" "crit.value" "n.outliers" "alternative" # [9] "method" "data" "data.name" "bad.obs" #[13] "all.stats" gofOutlier.obj #Results of Outlier Test #------------------------- # #Test Method: Rosner's Test for Outliers # #Hypothesized Distribution: Normal # #Data: dat # #Sample Size: 33 # #Test Statistics: R.1 = 2.848514 # R.2 = 3.086875 # R.3 = 3.033044 # R.4 = 2.380235 # #Test Statistic Parameter: k = 4 # #Alternative Hypothesis: Up to 4 observations are not # from the same Distribution. # #Type I Error: 5% # #Number of Outliers Detected: 3 # # i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier #1 0 3.549744 2.531011 10.7593656 33 2.848514 2.951949 TRUE #2 1 3.324444 2.209872 10.1460427 31 3.086875 2.938048 TRUE #3 2 3.104392 1.856109 8.7340527 32 3.033044 2.923571 TRUE #4 3 2.916737 1.560335 -0.7972275 25 2.380235 2.908473 FALSE #========== # Extract the data frame with all the test results #------------------------------------------------- gofOutlier.obj$all.stats # i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier #1 0 3.549744 2.531011 10.7593656 33 2.848514 2.951949 TRUE #2 1 3.324444 2.209872 10.1460427 31 3.086875 2.938048 TRUE #3 2 3.104392 1.856109 8.7340527 32 3.033044 2.923571 TRUE #4 3 2.916737 1.560335 -0.7972275 25 2.380235 2.908473 FALSE #========== # Clean up #--------- rm(dat, gofOutlier.obj)
Perform a goodness-of-fit test to determine whether a data set appears to come from a specified probability distribution or if two data sets appear to come from the same distribution.
gofTest(y, ...) ## S3 method for class 'formula' gofTest(y, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: gofTest(y, x = NULL, test = ifelse(is.null(x), "sw", "ks"), distribution = "norm", est.arg.list = NULL, alternative = "two.sided", n.classes = NULL, cut.points = NULL, param.list = NULL, estimate.params = ifelse(is.null(param.list), TRUE, FALSE), n.param.est = NULL, correct = NULL, digits = .Options$digits, exact = NULL, ws.method = "normal scores", warn = TRUE, keep.data = TRUE, data.name = NULL, data.name.x = NULL, parent.of.data = NULL, subset.expression = NULL, ...)
gofTest(y, ...) ## S3 method for class 'formula' gofTest(y, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: gofTest(y, x = NULL, test = ifelse(is.null(x), "sw", "ks"), distribution = "norm", est.arg.list = NULL, alternative = "two.sided", n.classes = NULL, cut.points = NULL, param.list = NULL, estimate.params = ifelse(is.null(param.list), TRUE, FALSE), n.param.est = NULL, correct = NULL, digits = .Options$digits, exact = NULL, ws.method = "normal scores", warn = TRUE, keep.data = TRUE, data.name = NULL, data.name.x = NULL, parent.of.data = NULL, subset.expression = NULL, ...)
y |
an object containing data for the goodness-of-fit test. In the default
method, the argument |
data |
specifies an optional data frame, list or environment (or object coercible
by |
subset |
specifies an optional vector specifying a subset of observations to be used. |
na.action |
specifies a function which indicates what should happen when the data contain |
x |
numeric vector of values for the first sample in the case of a two-sample
Kolmogorov-Smirnov goodness-of-fit test ( |
test |
character string defining which goodness-of-fit test to perform. Possible values are:
When the argument |
distribution |
a character string denoting the distribution abbreviation. See the help file for
When When When When When When |
est.arg.list |
a list of arguments to be passed to the function estimating the distribution parameters.
For example, if When When When When |
alternative |
for the case when |
n.classes |
for the case when |
cut.points |
for the case when |
param.list |
for the case when |
estimate.params |
for the case when |
n.param.est |
for the case when |
correct |
for the case when |
digits |
for the case when |
exact |
for the case when |
ws.method |
for the case when |
warn |
logical scalar indicating whether to print a warning message when
observations with |
keep.data |
logical scalar indicating whether to return the data used for the goodness-of-fit test.
The default value is |
data.name |
character string indicating the name of the data used for argument |
data.name.x |
character string indicating the name of the data used for argument |
parent.of.data |
character string indicating the source of the data used for the goodness-of-fit test. |
subset.expression |
character string indicating the expression used to subset the data. |
... |
additional arguments affecting the goodness-of-fit test. |
Shapiro-Wilk Goodness-of-Fit Test (test="sw"
).
The Shapiro-Wilk goodness-of-fit test (Shapiro and Wilk, 1965; Royston, 1992a)
is one of the most commonly used goodness-of-fit tests for normality.
You can use it to test the following hypothesized distributions:
Normal, Lognormal, Three-Parameter Lognormal,
Zero-Modified Normal, or
Zero-Modified Lognormal (Delta).
In addition, you can also use it to test the null hypothesis of any
continuous distribution that is available (see the help file for
Distribution.df
, and see explanation below).
Shapiro-Wilk W-Statistic and P-Value for Testing Normality
Let denote a random variable with cumulative distribution function (cdf)
. Suppose we want to test the null hypothesis that
is the cdf of
a normal (Gaussian) distribution with some arbitrary mean
and standard deviation
against the alternative hypothesis
that
is the cdf of some other distribution. The table below shows the
random variable for which
is the assumed cdf, given the value of the
argument
distribution
.
Value of | Random Variable for | |
distribution |
Distribution Name | which is the cdf |
"norm" |
Normal | |
"lnorm" |
Lognormal (Log-space) | |
"lnormAlt" |
Lognormal (Untransformed) | |
"lnorm3" |
Three-Parameter Lognormal | |
"zmnorm" |
Zero-Modified Normal | |
"zmlnorm" |
Zero-Modified Lognormal (Log-space) | |
"zmlnormAlt" |
Zero-Modified Lognormal (Untransformed) |
|
Note that for the three-parameter lognormal distribution, the symbol
denotes the threshold parameter.
Let denote the vector of
ordered observations assumed to come from a normal
distribution.
The Shapiro-Wilk W-Statistic
Shapiro and Wilk (1965) introduced the following statistic to test
the null hypothesis that is the cdf of a normal distribution:
where the quantity is the
'th element of the vector
defined by:
where denotes the transpose operator, and
is the vector
of expected values and
is the variance-covariance matrix of the order
statistics of a random sample of size
from a standard normal distribution.
That is, the values of
are the expected values of the standard
normal order statistics weighted by their variance-covariance matrix, and
normalized so that
It can be shown that the coefficients are antisymmetric, that
is,
and for odd ,
Now because
and
the -statistic in Equation (1) is the same as the square of the sample
product-moment correlation between the vectors
and
:
where
(see the R help file for cor
).
The Shapiro-Wilk -statistic is also simply the ratio of two estimators of
variance, and can be rewritten as
where the numerator is the square of the best linear unbiased estimate (BLUE) of the standard deviation, and the denominator is the minimum variance unbiased estimator (MVUE) of the variance:
Small values of indicate the null hypothesis is probably not true.
Shapiro and Wilk (1965) computed the values of the coefficients
and the percentage points for
(based on smoothing the empirical null
distribution of
) for sample sizes up to 50. Computation of the
-statistic for larger sample sizes can be cumbersome, since computation of
the coefficients
requires storage of at least
reals followed by
matrix inversion
(Royston, 1992a).
The Shapiro-Francia W'-Statistic
Shapiro and Francia (1972) introduced a modification of the -test that
depends only on the expected values of the order statistics (
)
and not on the variance-covariance matrix (
):
where the quantity is the
'th element of the vector
defined by:
Several authors, including Ryan and Joiner (1973), Filliben (1975), and Weisberg
and Bingham (1975), note that the -statistic is intuitively appealing
because it is the squared Pearson correlation coefficient associated with a normal
probability plot. That is, it is the squared correlation between the ordered
sample values
and the expected normal order statistics
:
Shapiro and Francia (1972) present a table of empirical percentage points for
based on a Monte Carlo simulation. It can be shown that the asymptotic null
distributions of
and
are identical, but convergence is very slow
(Verrill and Johnson, 1988).
The Weisberg-Bingham Approximation to the W'-Statistic
Weisberg and Bingham (1975) introduced an approximation of the Shapiro-Francia
-statistic that is easier to compute. They suggested using Blom scores
(Blom, 1958, pp.68–75) to approximate the element of
:
where the quantity is the
'th element of the vector
defined by:
and
and denotes the standard normal cdf. That is, the values of the
elements of
in Equation (14) are replaced with their estimates
based on the usual plotting positions for a normal distribution.
Royston's Approximation to the Shapiro-Wilk W-Test
Royston (1992a) presents an approximation for the coefficients
necessary to compute the Shapiro-Wilk
-statistic, and also a transformation
of the
-statistic that has approximately a standard normal distribution
under the null hypothesis.
Noting that, up to a constant, the components of in
Equation (14) and
in Equation (17) differ from those of
in Equation (2) mainly in the first and last two components,
Royston (1992a) used the approximation
as the basis for
approximating
using polynomial (quintic) regression analysis.
For
, the approximation gave the following equations for the
last two (and hence first two) components of
:
where
The other components are computed as:
for if
, or
if
, where
if , and
if .
Royston (1992a) found his approximation to to be accurate to
at least
in the third decimal place over all values of
and
selected values of
, and also found that critical percentage points of
based on his approximation agreed closely with the exact critical
percentage points calculated by Verrill and Johnson (1988).
Transformation of the Null Distribution of W to Normality
In order to compute a p-value associated with a particular value of ,
Royston (1992a) approximated the distribution of
by a
three-parameter lognormal distribution for
,
and the upper half of the distribution of
by a two-parameter
lognormal distribution for
.
Setting
the p-value associated with is given by:
For , the quantities necessary to compute
are given by:
For , the quantities necessary to compute
are given
by:
For the last approximation when , Royston (1992a) claims
this approximation is actually valid for sample sizes up to
.
Modification for the Three-Parameter Lognormal Distribution
When distribution="lnorm3"
, the function gofTest
assumes the vector
is a random sample from a
three-parameter lognormal distribution. It estimates the
threshold parameter via the zero-skewness method (see
elnorm3
), and
then performs the Shapiro-Wilk goodness-of-fit test for normality on
where
is the estimated threshold
parmater. Because the threshold parameter has to be estimated, however, the
p-value associated with the computed z-statistic will tend to be conservative
(larger than it should be under the null hypothesis). Royston (1992b) proposed
the following transformation of the z-statistic:
where for ,
and for ,
where
and denotes the threshold parameter. The p-value associated with
this test is then given by:
Testing Goodness-of-Fit for Any Continuous Distribution
The function gofTest
extends the Shapiro-Wilk test to test for
goodness-of-fit for any continuous distribution by using the idea of
Chen and Balakrishnan (1995), who proposed a general purpose approximate
goodness-of-fit test based on the Cramer-von Mises or Anderson-Darling
goodness-of-fit tests for normality. The function gofTest
modifies the
approach of Chen and Balakrishnan (1995) by using the same first 2 steps, and then
applying the Shapiro-Wilk test:
Let denote the vector of
ordered observations.
Compute cumulative probabilities for each
based on the
cumulative distribution function for the hypothesized distribution. That is,
compute
where
denotes the
hypothesized cumulative distribution function with parameter(s)
,
and
denotes the estimated parameter(s).
Compute standard normal deviates based on the computed cumulative
probabilities:
Perform the Shapiro-Wilk goodness-of-fit test on the 's.
Shapiro-Francia Goodness-of-Fit Test (test="sf"
).
The Shapiro-Francia goodness-of-fit test (Shapiro and Francia, 1972;
Weisberg and Bingham, 1975; Royston, 1992c) is also one of the most commonly
used goodness-of-fit tests for normality. You can use it to test the following
hypothesized distributions:
Normal, Lognormal, Zero-Modified Normal,
or Zero-Modified Lognormal (Delta). In addition,
you can also use it to test the null hypothesis of any continuous distribution
that is available (see the help file for Distribution.df
). See the
section Testing Goodness-of-Fit for Any Continuous Distribution above for
an explanation of how this is done.
Royston's Transformation of the Shapiro-Francia W'-Statistic to Normality
Equation (13) above gives the formula for the Shapiro-Francia W'-statistic, and
Equation (16) above gives the formula for Weisberg-Bingham approximation to the
W'-statistic (denoted ). Royston (1992c) presents an algorithm
to transform the
-statistic so that its null distribution is
approximately a standard normal. For
,
Royston (1992c) approximates the distribution of
by a
lognormal distribution. Setting
the p-value associated with is given by:
The quantities necessary to compute are given by:
Testing Goodness-of-Fit for Any Continuous Distribution
The function gofTest
extends the Shapiro-Francia test to test for
goodness-of-fit for any continuous distribution by using the idea of
Chen and Balakrishnan (1995), who proposed a general purpose approximate
goodness-of-fit test based on the Cramer-von Mises or Anderson-Darling
goodness-of-fit tests for normality. The function gofTest
modifies the
approach of Chen and Balakrishnan (1995) by using the same first 2 steps, and then
applying the Shapiro-Francia test:
Let denote the vector of
ordered observations.
Compute cumulative probabilities for each
based on the
cumulative distribution function for the hypothesized distribution. That is,
compute
where
denotes the
hypothesized cumulative distribution function with parameter(s)
,
and
denotes the estimated parameter(s).
Compute standard normal deviates based on the computed cumulative
probabilities:
Perform the Shapiro-Francia goodness-of-fit test on the 's.
Probability Plot Correlation Coefficient (PPCC) Goodness-of-Fit Test (test="ppcc"
).
The PPPCC goodness-of-fit test (Filliben, 1975; Looney and Gulledge, 1985) can be
used to test the following hypothesized distributions:
Normal, Lognormal,
Zero-Modified Normal, or
Zero-Modified Lognormal (Delta). In addition,
you can also use it to test the null hypothesis of any continuous distribution that
is available (see the help file for Distribution.df
).
The function gofTest
computes the PPCC test
statistic using Blom plotting positions.
Filliben (1975) proposed using the correlation coefficient from a
normal probability plot to perform a goodness-of-fit test for
normality, and he provided a table of critical values for
under the
for samples sizes between 3 and 100. Vogel (1986) provided an additional table
for sample sizes between 100 and 10,000.
Looney and Gulledge (1985) investigated the characteristics of Filliben's
probability plot correlation coefficient (PPCC) test using the plotting position
formulas given in Filliben (1975), as well as three other plotting position
formulas: Hazen plotting positions, Weibull plotting positions, and Blom plotting
positions (see the help file for qqPlot
for an explanation of these
plotting positions). They concluded that the PPCC test based on Blom plotting
positions performs slightly better than tests based on other plotting positions, and
they provide a table of empirical percentage points for the distribution of
based on Blom plotting positions.
The function gofTest
computes the PPCC test statistic using Blom
plotting positions. It can be shown that the square of this statistic is
equivalent to the Weisberg-Bingham Approximation to the Shapiro-Francia
W'-Test (Weisberg and Bingham, 1975; Royston, 1993). Thus the PPCC
goodness-of-fit test is equivalent to the Shapiro-Francia goodness-of-fit test.
Anderson-Darling Goodness-of-Fit Test (test="ad"
).
The Anderson-Darling goodness-of-fit test (Stephens, 1986a; Thode, 2002) can be used to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta).
When test="ad"
, the function gofTest
calls the function
ad.test
in the package nortest. Documentation from that
package is as follows:
The Anderson-Darling test is an EDF omnibus test for the composite hypothesis of normality. The test statistic is:
where . Here,
is the cumulative
distribution function of the standard normal distribution, and
and
are mean and standard deviation of the data values. The p-value is computed from the
modified statistic
according to Table 4.9 in
Stephens [(1986a)].
Cramer-von Mises Goodness-of-Fit Test (test="cvm"
).
The Cramer-von Mises goodness-of-fit test (Stephens, 1986a; Thode, 2002) can be used to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta).
When test="cvm"
, the function gofTest
calls the function
cvm.test
in the package nortest. Documentation from that
package is as follows:
The Cramer-von Mises test is an EDF omnibus test for the composite hypothesis of normality. The test statistic is:
where . Here,
is the cumulative
distribution function of the standard normal distribution, and
and
are mean and standard deviation of the data values. The p-value is computed from the
modified statistic
according to Table 4.9 in Stephens [(1986a)].
Lilliefors Goodness-of-Fit Test (test="lillie"
).
The Lilliefors goodness-of-fit test (Stephens, 1974; Dallal and Wilkinson, 1986; Thode, 2002) can be used to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta).
When test="lillie"
, the function gofTest
calls the function
lillie.test
in the package nortest. Documentation from that
package is as follows:
The Lilliefors (Kolmogorov-Smirnov) test is an EDF omnibus test for the
composite hypothesis of normality. The test statistic is the
maximal absolute difference between empirical and hypothetical cumulative
distribution function. It may be computed as with
where . Here,
is the cumulative
distribution function of the standard normal distribution, and
and
are mean and standard deviation of the data values. The p-value is computed from
the Dallal-Wilkinson (1986) formula, which is claimed to be only reliable when the
p-value is smaller than 0.1. If the Dallal-Wilkinson p-value turns out to be
greater than 0.1, then the p-value is computed from the distribution of the
modified statistic
, see Stephens (1974),
the actual p-value formula being obtained by a simulation and approximation process.
Zero-Skew Goodness-of-Fit Test (test="skew"
).
The Zero-skew goodness-of-fit test (D'Agostino, 1970) can be used to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta).
When test="skew"
, the function gofTest
tests the null hypothesis
that the skew of the distribution is 0:
where
and the quantity denotes the
'th moment about the mean
(also called the
'th central moment). The quantity
is called the coefficient of skewness, and is estimated by:
where
denotes the 'th sample central moment.
The possible alternative hypotheses are:
which correspond to alternative="two-sided"
, alternative="less"
, and alternative="greater"
, respectively.
To test the null hypothesis of zero skew, D'Agostino (1970) derived an
approximation to the distribution of under the null hypothesis of
zero-skew, assuming the observations comprise a random sample from a normal
(Gaussian) distribution. Based on D'Agostino's approximation, the statistic
shown below is assumed to follow a standard normal distribution and is
used to compute the p-value associated with the test of
:
where
When the sample size is at least 150, a simpler approximation may be
used in which
in Equation (61) is assumed to follow a standard normal
distribution and is used to compute the p-value associated with the hypothesis
test.
Kolmogorov-Smirnov Goodness-of-Fit Test (test="ks"
).
When test="ks"
, the function gofTest
calls the R function
ks.test
to compute the test statistic and p-value. Note that for the
one-sample case, the distribution parameters
should be pre-specified and not estimated from the data, and if the distribution parameters
are estimated from the data you will receive a warning that this test is very conservative
(Type I error smaller than assumed; high Type II error) in this case.
ProUCL Kolmogorov-Smirnov Goodness-of-Fit Test for Gamma (test="proucl.ks.gamma"
).
When test="proucl.ks.gamma"
, the function gofTest
calls the R function
ks.test
to compute the Kolmogorov-Smirnov test statistic based on the
maximum likelihood estimates of the shape and scale parameters (see egamma
).
The p-value is computed based on the simulated critical values given in
ProUCL.Crit.Vals.for.KS.Test.for.Gamma.array
(USEPA, 2015).
The sample size must be between 5 and 1000, and the value of the maximum likelihood
estimate of the shape parameter must be between 0.025 and 50. The critical value
for the test statistic is computed using the simulated critical values and
linear interpolation.
ProUCL Anderson-Darling Goodness-of-Fit Test for Gamma (test="proucl.ad.gamma"
).
When test="proucl.ad.gamma"
, the function gofTest
computes the
Anderson-Darling test statistic (Stephens, 1986a, p.101) based on the
maximum likelihood estimates of the shape and scale parameters (see egamma
).
The p-value is computed based on the simulated critical values given in
ProUCL.Crit.Vals.for.AD.Test.for.Gamma.array
(USEPA, 2015).
The sample size must be between 5 and 1000, and the value of the maximum likelihood
estimate of the shape parameter must be between 0.025 and 50. The critical value
for the test statistic is computed using the simulated critical values and
linear interpolation.
Chi-Squared Goodness-of-Fit Test (test="chisq"
).
The method used by gofTest
is a modification of what is used for chisq.test
.
If the hypothesized distribution function is completely specified, the degrees of
freedom are where
denotes the number of classes. If any parameters
are estimated, the degrees of freedom depend on the method of estimation.
The function
gofTest
follows the convention of computing
degrees of freedom as , where
is the number of parameters estimated.
It can be shown that if the parameters are estimated by maximum likelihood, the degrees of
freedom are bounded between
and
. Therefore, especially when the
sample size is small, it is important to compare the test statistic to the chi-squared
distribution with both
and
degrees of freedom. See
Kendall and Stuart (1991, Chapter 30) for a more complete discussion.
The distribution theory of chi-square statistics is a large sample theory.
The expected cell counts are assumed to be at least moderately large.
As a rule of thumb, each should be at least 5. Although authors have found this rule
to be conservative (especially when the class probabilities are not too different
from each other), the user should regard p-values with caution when expected cell
counts are small.
Wilk-Shapiro Goodness-of-Fit Test for Uniform [0, 1] Distribution (test="ws"
).
Wilk and Shapiro (1968) suggested this test in the context of jointly testing several
independent samples for normality simultaneously. If denote
the p-values associated with the test for normality of
independent samples, then
under the null hypothesis that all
samples come from a normal distribution, the
p-values are a random sample of
observations from a Uniform [0,1] distribution,
that is a Uniform distribution with minimum 0 and maximum 1. Wilk and Shapiro (1968)
suggested two different methods for testing whether the p-values come from a Uniform [0, 1]
distribution:
Test Based on Normal Scores. Under the null hypothesis, the normal scores
are a random sample of observations from a standard normal distribution.
Wilk and Shapiro (1968) denote the
'th normal score by
and note that under the null hypothesis, the quantity defined as
has a standard normal distribution. Wilk and Shapiro (1968) were
interested in the alternative hypothesis that some of the
independent samples did not come from a normal distribution and hence
would be associated with smaller p-values than expected under the
null hypothesis, which translates to the alternative that the cdf for
the distribution of the p-values is greater than the cdf of a
Uniform [0, 1] distribution (
alternative="greater"
). In terms
of the test statistic , this alternative hypothesis would
tend to make
smaller than expected, so the p-value is given by
. For the one-sided lower alternative that the cdf for the
distribution of p-values is less than the cdf for a Uniform [0, 1]
distribution, the p-value is given by
.
Test Based on Chi-Square Scores. Under the null hypothesis, the chi-square scores
are a random sample of observations from a chi-square distribution with
2 degrees of freedom (Fisher, 1950). Wilk and Shapiro (1968) denote the
'th chi-square score by
and note that under the null hypothesis, the quantity defined as
has a chi-square distribution with degrees of freedom.
Wilk and Shapiro (1968) were
interested in the alternative hypothesis that some of the
independent samples did not come from a normal distribution and hence
would be associated with smaller p-values than expected under the
null hypothesis, which translates to the alternative that the cdf for
the distribution of the p-values is greater than the cdf of a
Uniform [0, 1] distribution (
alternative="greater"
). In terms
of the test statistic , this alternative hypothesis would
tend to make
larger than expected, so the p-value is given by
where denotes the cumulative distribution function of the
chi-square distribution with
degrees of freedom.
For the one-sided lower alternative that
the cdf for the distribution of p-values is less than the cdf for a
Uniform [0, 1] distribution, the p-value is given by
a list of class "gof"
containing the results of the goodness-of-fit test, unless
the two-sample
Kolmogorov-Smirnov test is used, in which case the value is a list of
class "gofTwoSample"
. Objects of class "gof"
and "gofTwoSample"
have special printing and plotting methods. See the help files for gof.object
and gofTwoSample.object
for details.
The Shapiro-Wilk test (Shapiro and Wilk, 1965) and the Shapiro-Francia test (Shapiro and Francia, 1972) are probably the two most commonly used hypothesis tests to test departures from normality. The Shapiro-Wilk test is most powerful at detecting short-tailed (platykurtic) and skewed distributions, and least powerful against symmetric, moderately long-tailed (leptokurtic) distributions. Conversely, the Shapiro-Francia test is more powerful against symmetric long-tailed distributions and less powerful against short-tailed distributions (Royston, 1992b; 1993). In general, the Shapiro-Wilk and Shapiro-Francia tests outperform the Anderson-Darling test, which in turn outperforms the Cramer-von Mises test, which in turn outperforms the Lilliefors test (Stephens, 1986a; Razali and Wah, 2011; Romao et al., 2010).
The zero-skew goodness-of-fit test for normality is one of several tests that have
been proposed to test the assumption of a normal distribution (D'Agostino, 1986b).
This test has been included mainly because it is called by elnorm3
.
Ususally, the Shapiro-Wilk or Shapiro-Francia test is preferred to this test, unless
the direction of the alternative to normality (e.g., positive skew) is known
(D'Agostino, 1986b, pp. 405–406).
Kolmogorov (1933) introduced a goodness-of-fit test to test the hypothesis that a
random sample of observations x comes from a specific hypothesized distribution
with cumulative distribution function
. This test is now usually called the
one-sample Kolmogorov-Smirnov goodness-of-fit test. Smirnov (1939) introduced a
goodness-of-fit test to test the hypothesis that a random sample of
observations x comes from the same distribution as a random sample of
observations y. This test is now usually called the two-sample
Kolmogorov-Smirnov goodness-of-fit test. Both tests are based on the maximum
vertical distance between two cumulative distribution functions. For the one-sample problem
with a small sample size, the Kolmogorov-Smirnov test may be preferred over the chi-squared
goodness-of-fit test since the KS-test is exact, while the chi-squared test is based on
an asymptotic approximation.
The chi-squared test, introduced by Pearson in 1900, is the oldest and best known goodness-of-fit test. The idea is to reduce the goodness-of-fit problem to a multinomial setting by comparing the observed cell counts with their expected values under the null hypothesis. Grouping the data sacrifices information, especially if the hypothesized distribution is continuous. On the other hand, chi-squared tests can be be applied to any type of variable: continuous, discrete, or a combination of these.
The Wilk-Shapiro (1968) tests for a Uniform [0, 1] distribution were introduced in the context
of testing whether several independent samples all come from normal distributions, with
possibly different means and variances. The function gofGroupTest
extends
this idea to allow you to test whether several independent samples come from the same
distribution (e.g., gamma, extreme value, etc.), with possibly different parameters.
In practice, almost any goodness-of-fit test will not reject the null hypothesis
if the number of observations is relatively small. Conversely, almost any goodness-of-fit
test will reject the null hypothesis if the number of observations is very large,
since “real” data are never distributed according to any theoretical distribution
(Conover, 1980, p.367). For most cases, however, the distribution of “real” data
is close enough to some theoretical distribution that fairly accurate results may be
provided by assuming that particular theoretical distribution. One way to asses the
goodness of the fit is to use goodness-of-fit tests. Another way is to look at
quantile-quantile (Q-Q) plots (see qqPlot
).
Steven P. Millard ([email protected])
Juergen Gross and Uwe Ligges for the Anderson-Darling, Carmer-von Mises, and Lilliefors tests called from the package nortest.
Birnbaum, Z.W., and F.H. Tingey. (1951). One-Sided Confidence Contours for Probability Distribution Functions. Annals of Mathematical Statistics 22, 592-596.
Blom, G. (1958). Statistical Estimates and Transformed Beta Variables. John Wiley and Sons, New York.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Dallal, G.E., and L. Wilkinson. (1986). An Analytic Approximation to the Distribution of Lilliefor's Test for Normality. The American Statistician 40, 294-296.
D'Agostino, R.B. (1970). Transformation to Normality of the Null Distribution of .
Biometrika 57, 679-681.
D'Agostino, R.B. (1971). An Omnibus Test of Normality for Moderate and Large Size Samples. Biometrika 58, 341-348.
D'Agostino, R.B. (1986b). Tests for the Normal Distribution. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York.
D'Agostino, R.B., and E.S. Pearson (1973). Tests for Departures from Normality.
Empirical Results for the Distributions of and
.
Biometrika 60(3), 613-622.
D'Agostino, R.B., and G.L. Tietjen (1973). Approaches to the Null Distribution of .
Biometrika 60(1), 169-173.
Fisher, R.A. (1950). Statistical Methods for Research Workers. 11'th Edition. Hafner Publishing Company, New York, pp.99-100.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Kendall, M.G., and A. Stuart. (1991). The Advanced Theory of Statistics, Volume 2: Inference and Relationship. Fifth Edition. Oxford University Press, New York.
Kim, P.J., and R.I. Jennrich. (1973). Tables of the Exact Sampling Distribution of the Two Sample Kolmogorov-Smirnov Criterion. In Harter, H.L., and D.B. Owen, eds. Selected Tables in Mathematical Statistics, Vol. 1. American Mathematical Society, Providence, Rhode Island, pp.79-170.
Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell' Istituto Italiano degle Attuari 4, 83-91.
Marsaglia, G., W.W. Tsang, and J. Wang. (2003). Evaluating Kolmogorov's distribution. Journal of Statistical Software, 8(18). doi:10.18637/jss.v008.i18.
Moore, D.S. (1986). Tests of Chi-Squared Type. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, pp.63-95.
Pomeranz, J. (1973). Exact Cumulative Distribution of the Kolmogorov-Smirnov Statistic for Small Samples (Algorithm 487). Collected Algorithms from ACM ??, ???-???.
Razali, N.M., and Y.B. Wah. (2011). Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors, and Anderson-Darling Tests. Journal of Statistical Modeling and Analytics 2(1), 21–33.
Romao, X., Delgado, R., and A. Costa. (2010). An Empirical Power Comparison of Univariate Goodness-of-Fit Tests for Normality. Journal of Statistical Computation and Simulation 80(5), 545–591.
Royston, J.P. (1992a). Approximating the Shapiro-Wilk W-Test for Non-Normality. Statistics and Computing 2, 117-119.
Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897-912.
Royston, J.P. (1992c). A Pocket-Calculator Algorithm for the Shapiro-Francia Test of Non-Normality: An Application to Medicine. Statistics in Medicine 12, 181-184.
Royston, P. (1993). A Toolkit for Testing for Non-Normality in Complete and Censored Samples. The Statistician 42, 37-43.
Ryan, T., and B. Joiner. (1973). Normal Probability Plots and Tests for Normality. Technical Report, Pennsylvannia State University, Department of Statistics.
Shapiro, S.S., and R.S. Francia. (1972). An Approximate Analysis of Variance Test for Normality. Journal of the American Statistical Association 67(337), 215-219.
Shapiro, S.S., and M.B. Wilk. (1965). An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52, 591-611.
Smirnov, N.V. (1939). Estimate of Deviation Between Empirical Distribution Functions in Two Independent Samples. Bulletin Moscow University 2(2), 3-16.
Smirnov, N.V. (1948). Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics 19, 279-281.
Stephens, M.A. (1970). Use of the Kolmogorov-Smirnov, Cramer-von Mises and Related Statistics Without Extensive Tables. Journal of the Royal Statistical Society, Series B, 32, 115-122.
Stephens, M.A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons. Journal of the American Statistical Association 69, 730-737.
Stephens, M.A. (1986a). Tests Based on EDF Statistics. In D'Agostino, R. B., and M.A. Stevens, eds. Goodness-of-Fit Techniques. Marcel Dekker, New York.
Thode Jr., H.C. (2002). Testing for Normality. Marcel Dekker, New York.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.
Verrill, S., and R.A. Johnson. (1987). The Asymptotic Equivalence of Some Modified Shapiro-Wilk Statistics – Complete and Censored Sample Cases. The Annals of Statistics 15(1), 413-419.
Verrill, S., and R.A. Johnson. (1988). Tables and Large-Sample Distribution Theory for Censored-Data Correlation Statistics for Testing Normality. Journal of the American Statistical Association 83, 1192-1197.
Weisberg, S., and C. Bingham. (1975). An Approximate Analysis of Variance Test for Non-Normality Suitable for Machine Calculation. Technometrics 17, 133-134.
Wilk, M.B., and S.S. Shapiro. (1968). The Joint Assessment of Normality of Several Independent Samples. Technometrics, 10(4), 825-839.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
rosnerTest
, gof.object
, print.gof
,
plot.gof
,
shapiro.test
, ks.test
, chisq.test
,
Normal, Lognormal, Lognormal3,
Zero-Modified Normal, Zero-Modified Lognormal (Delta),
enorm
, elnorm
, elnormAlt
,
elnorm3
, ezmnorm
, ezmlnorm
,
ezmlnormAlt
, qqPlot
.
# Generate 20 observations from a gamma distribution with # parameters shape = 2 and scale = 3 then run various # goodness-of-fit tests. # (Note: the call to set.seed lets you reproduce this example.) set.seed(47) dat <- rgamma(20, shape = 2, scale = 3) # Shapiro-Wilk generalized goodness-of-fit test #---------------------------------------------- gof.list <- gofTest(dat, distribution = "gamma") gof.list #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Test Statistic: W = 0.9834958 # #Test Statistic Parameter: n = 20 # #P-value: 0.970903 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. dev.new() plot(gof.list) #---------- # Redo the example above, but use the bias-corrected mle gofTest(dat, distribution = "gamma", est.arg.list = list(method = "bcmle")) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 1.656376 # scale = 4.676680 # #Estimation Method: bcmle # #Data: dat # #Sample Size: 20 # #Test Statistic: W = 0.9834346 # #Test Statistic Parameter: n = 20 # #P-value: 0.9704046 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Komogorov-Smirnov goodness-of-fit test (pre-specified parameters) #------------------------------------------------------------------ gofTest(dat, test = "ks", distribution = "gamma", param.list = list(shape = 2, scale = 3)) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Kolmogorov-Smirnov GOF # #Hypothesized Distribution: Gamma(shape = 2, scale = 3) # #Data: dat # #Sample Size: 20 # #Test Statistic: ks = 0.2313878 # #Test Statistic Parameter: n = 20 # #P-value: 0.2005083 # #Alternative Hypothesis: True cdf does not equal the # Gamma(shape = 2, scale = 3) # Distribution. #---------- # ProUCL Version of Komogorov-Smirnov goodness-of-fit test # for a Gamma Distribution (estimated parameters) #--------------------------------------------------------- gofTest(dat, test = "proucl.ks.gamma", distribution = "gamma") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: ProUCL Kolmogorov-Smirnov Gamma GOF # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: MLE # #Data: dat # #Sample Size: 20 # #Test Statistic: D = 0.0988692 # #Test Statistic Parameter: n = 20 # #Critical Values: D.0.01 = 0.228 # D.0.05 = 0.196 # D.0.10 = 0.180 # #P-value: >= 0.10 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Chi-squared goodness-of-fit test (estimated parameters) #-------------------------------------------------------- gofTest(dat, test = "chisq", distribution = "gamma", n.classes = 4) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Chi-square GOF # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Test Statistic: Chi-square = 1.2 # #Test Statistic Parameter: df = 1 # #P-value: 0.2733217 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Clean up rm(dat, gof.list) graphics.off() #-------------------------------------------------------------------- # Example 10-2 of USEPA (2009, page 10-14) gives an example of # using the Shapiro-Wilk test to test the assumption of normality # for nickel concentrations (ppb) in groundwater collected over # 4 years. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #4 8 Well.1 56.0 #5 10 Well.1 8.7 #6 1 Well.2 19.0 #7 3 Well.2 81.5 #8 6 Well.2 331.0 #9 8 Well.2 14.0 #10 10 Well.2 64.4 #11 1 Well.3 39.0 #12 3 Well.3 151.0 #13 6 Well.3 27.0 #14 8 Well.3 21.4 #15 10 Well.3 578.0 #16 1 Well.4 3.1 #17 3 Well.4 942.0 #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 # Test for a normal distribution: #-------------------------------- gof.list <- gofTest(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df) gof.list #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Normal # #Estimated Parameter(s): mean = 169.5250 # sd = 259.7175 # #Estimation Method: mvue # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Statistic: W = 0.6788888 # #Test Statistic Parameter: n = 20 # #P-value: 2.17927e-05 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. dev.new() plot(gof.list) #---------- # Test for a lognormal distribution: #----------------------------------- gofTest(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df, dist = "lnorm") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): meanlog = 3.918529 # sdlog = 1.801404 # #Estimation Method: mvue # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Statistic: W = 0.978946 # #Test Statistic Parameter: n = 20 # #P-value: 0.9197735 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. #---------- # Test for a lognormal distribution, but use the # Mean and CV parameterization: #----------------------------------------------- gofTest(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df, dist = "lnormAlt") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): mean = 213.415628 # cv = 2.809377 # #Estimation Method: mvue # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Statistic: W = 0.978946 # #Test Statistic Parameter: n = 20 # #P-value: 0.9197735 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. #---------- # Clean up rm(gof.list) graphics.off() #--------------------------------------------------------------------------- # Generate 20 observations from a normal distribution with mean=3 and sd=2, and # generate 10 observaions from a normal distribution with mean=2 and sd=2 then # test whether these sets of observations come from the same distribution. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(300) dat1 <- rnorm(20, mean = 3, sd = 2) dat2 <- rnorm(10, mean = 1, sd = 2) gofTest(x = dat1, y = dat2, test = "ks") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: 2-Sample K-S GOF # #Hypothesized Distribution: Equal # #Data: x = dat1 # y = dat2 # #Sample Sizes: n.x = 20 # n.y = 10 # #Test Statistic: ks = 0.7 # #Test Statistic Parameters: n = 20 # m = 10 # #P-value: 0.001669561 # #Alternative Hypothesis: The cdf of 'dat1' does not equal # the cdf of 'dat2'. #---------- # Clean up rm(dat1, dat2)
# Generate 20 observations from a gamma distribution with # parameters shape = 2 and scale = 3 then run various # goodness-of-fit tests. # (Note: the call to set.seed lets you reproduce this example.) set.seed(47) dat <- rgamma(20, shape = 2, scale = 3) # Shapiro-Wilk generalized goodness-of-fit test #---------------------------------------------- gof.list <- gofTest(dat, distribution = "gamma") gof.list #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Test Statistic: W = 0.9834958 # #Test Statistic Parameter: n = 20 # #P-value: 0.970903 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. dev.new() plot(gof.list) #---------- # Redo the example above, but use the bias-corrected mle gofTest(dat, distribution = "gamma", est.arg.list = list(method = "bcmle")) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 1.656376 # scale = 4.676680 # #Estimation Method: bcmle # #Data: dat # #Sample Size: 20 # #Test Statistic: W = 0.9834346 # #Test Statistic Parameter: n = 20 # #P-value: 0.9704046 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Komogorov-Smirnov goodness-of-fit test (pre-specified parameters) #------------------------------------------------------------------ gofTest(dat, test = "ks", distribution = "gamma", param.list = list(shape = 2, scale = 3)) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Kolmogorov-Smirnov GOF # #Hypothesized Distribution: Gamma(shape = 2, scale = 3) # #Data: dat # #Sample Size: 20 # #Test Statistic: ks = 0.2313878 # #Test Statistic Parameter: n = 20 # #P-value: 0.2005083 # #Alternative Hypothesis: True cdf does not equal the # Gamma(shape = 2, scale = 3) # Distribution. #---------- # ProUCL Version of Komogorov-Smirnov goodness-of-fit test # for a Gamma Distribution (estimated parameters) #--------------------------------------------------------- gofTest(dat, test = "proucl.ks.gamma", distribution = "gamma") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: ProUCL Kolmogorov-Smirnov Gamma GOF # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: MLE # #Data: dat # #Sample Size: 20 # #Test Statistic: D = 0.0988692 # #Test Statistic Parameter: n = 20 # #Critical Values: D.0.01 = 0.228 # D.0.05 = 0.196 # D.0.10 = 0.180 # #P-value: >= 0.10 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Chi-squared goodness-of-fit test (estimated parameters) #-------------------------------------------------------- gofTest(dat, test = "chisq", distribution = "gamma", n.classes = 4) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Chi-square GOF # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 1.909462 # scale = 4.056819 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Test Statistic: Chi-square = 1.2 # #Test Statistic Parameter: df = 1 # #P-value: 0.2733217 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Clean up rm(dat, gof.list) graphics.off() #-------------------------------------------------------------------- # Example 10-2 of USEPA (2009, page 10-14) gives an example of # using the Shapiro-Wilk test to test the assumption of normality # for nickel concentrations (ppb) in groundwater collected over # 4 years. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #4 8 Well.1 56.0 #5 10 Well.1 8.7 #6 1 Well.2 19.0 #7 3 Well.2 81.5 #8 6 Well.2 331.0 #9 8 Well.2 14.0 #10 10 Well.2 64.4 #11 1 Well.3 39.0 #12 3 Well.3 151.0 #13 6 Well.3 27.0 #14 8 Well.3 21.4 #15 10 Well.3 578.0 #16 1 Well.4 3.1 #17 3 Well.4 942.0 #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 # Test for a normal distribution: #-------------------------------- gof.list <- gofTest(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df) gof.list #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Normal # #Estimated Parameter(s): mean = 169.5250 # sd = 259.7175 # #Estimation Method: mvue # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Statistic: W = 0.6788888 # #Test Statistic Parameter: n = 20 # #P-value: 2.17927e-05 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. dev.new() plot(gof.list) #---------- # Test for a lognormal distribution: #----------------------------------- gofTest(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df, dist = "lnorm") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): meanlog = 3.918529 # sdlog = 1.801404 # #Estimation Method: mvue # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Statistic: W = 0.978946 # #Test Statistic Parameter: n = 20 # #P-value: 0.9197735 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. #---------- # Test for a lognormal distribution, but use the # Mean and CV parameterization: #----------------------------------------------- gofTest(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df, dist = "lnormAlt") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): mean = 213.415628 # cv = 2.809377 # #Estimation Method: mvue # #Data: Nickel.ppb # #Data Source: EPA.09.Ex.10.1.nickel.df # #Sample Size: 20 # #Test Statistic: W = 0.978946 # #Test Statistic Parameter: n = 20 # #P-value: 0.9197735 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. #---------- # Clean up rm(gof.list) graphics.off() #--------------------------------------------------------------------------- # Generate 20 observations from a normal distribution with mean=3 and sd=2, and # generate 10 observaions from a normal distribution with mean=2 and sd=2 then # test whether these sets of observations come from the same distribution. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(300) dat1 <- rnorm(20, mean = 3, sd = 2) dat2 <- rnorm(10, mean = 1, sd = 2) gofTest(x = dat1, y = dat2, test = "ks") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: 2-Sample K-S GOF # #Hypothesized Distribution: Equal # #Data: x = dat1 # y = dat2 # #Sample Sizes: n.x = 20 # n.y = 10 # #Test Statistic: ks = 0.7 # #Test Statistic Parameters: n = 20 # m = 10 # #P-value: 0.001669561 # #Alternative Hypothesis: The cdf of 'dat1' does not equal # the cdf of 'dat2'. #---------- # Clean up rm(dat1, dat2)
Perform a goodness-of-fit test to determine whether a data set appears to come from a normal distribution, lognormal distribution, or lognormal distribution (alternative parameterization) based on a sample of data that has been subjected to Type I or Type II censoring.
gofTestCensored(x, censored, censoring.side = "left", test = "sf", distribution = "norm", est.arg.list = NULL, prob.method = "hirsch-stedinger", plot.pos.con = 0.375, keep.data = TRUE, data.name = NULL, censoring.name = NULL)
gofTestCensored(x, censored, censoring.side = "left", test = "sf", distribution = "norm", est.arg.list = NULL, prob.method = "hirsch-stedinger", plot.pos.con = 0.375, keep.data = TRUE, data.name = NULL, censoring.name = NULL)
x |
numeric vector of observations.
Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
test |
character string defining which goodness-of-fit test to perform. Possible values are:
The Shapiro-Wilk test is only available for singly censored data. See the DETAILS section for more information. |
distribution |
a character string denoting the abbreviation of the assumed distribution.
Only continous distributions are allowed. See the help file for
The results for the goodness-of-fit test are
identical for Also, the results for the goodness-of-fit test are
identical for |
est.arg.list |
a list of arguments to be passed to the function estimating the distribution
parameters. For example, if |
prob.method |
character string indicating what method to use to compute the plotting positions
(empirical probabilities) when The |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant to use when |
keep.data |
logical scalar indicating whether to return the original data. The
default value is |
data.name |
optional character string indicating the name for the data used for argument |
censoring.name |
optional character string indicating the name for the data used for argument |
Let denote a vector of
observations from from some distribution with cdf
.
Suppose we want to test the null hypothesis that
is the cdf of a normal (Gaussian) distribution with
some arbitrary mean
and standard deviation
against the
alternative hypothesis that
is the cdf of some other distribution. The
table below shows the random variable for which
is the assumed cdf, given
the value of the argument
distribution
.
Value of | Random Variable for | |
distribution |
Distribution Name | which is the cdf |
"norm" |
Normal | |
"lnorm" |
Lognormal (Log-space) | |
"lnormAlt" |
Lognormal (Untransformed) | |
Assume (
) of these observations are known and
(
) of these observations are all censored below (left-censored) or
all censored above (right-censored) at
fixed censoring levels
For the case when , the data are said to be Type I
multiply censored. For the case when
,
set
. If the data are left-censored
and all
known observations are greater
than or equal to
, or if the data are right-censored and all
known observations are less than or equal to
, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let denote the number of observations censored below or above censoring
level
for
, so that
Let denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
right-censored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For left-censored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity does not necessarily represent
the
'th “largest” observation from the (unknown) complete sample.
Note that for singly left-censored data:
and for singly right-censored data:
Finally, let (omega) denote the set of
subscripts in the
“ordered” sample that correspond to uncensored observations.
Shapiro-Wilk Goodness-of-Fit Test for Singly Censored Data (test="sw"
)
Equation (8) in the help file for gofTest
shows that for the case of
complete ordered data , the Shapiro-Wilk
-statistic is the same as
the square of the sample product-moment correlation between the vectors
and
:
where
and is defined by:
where denotes the transpose operator, and
is the vector
of expected values and
is the variance-covariance matrix of the order
statistics of a random sample of size
from a standard normal distribution.
That is, the values of
are the expected values of the standard
normal order statistics weighted by their variance-covariance matrix, and
normalized so that
Computing Shapiro-Wilk W-Statistic for Singly Censored Data
For the case of singly censored data, following Smith and Bain (1976) and
Verrill and Johnson (1988), Royston (1993) generalizes the Shapiro-Wilk
-statistic to:
where for left singly-censored data:
and for right singly-censored data:
Just like the function gofTest
,
when test="sw"
, the function gofTestCensored
uses Royston's (1992a)
approximation for the coefficients (see the help file for
gofTest
).
Computing P-Values for the Shapiro-Wilk Test
Verrill and Johnson (1988) show that the asymptotic distribution of the statistic
in Equation (9) above is normal, but the rate of convergence is
“surprisingly slow” even for complete samples. They provide a table of
empirical percentiles of the distribution for the -statistic shown in
Equation (9) above for several sample sizes and percentages of censoring.
Based on the tables given in Verrill and Johnson (1988), Royston (1993) approximated
the 90'th, 95'th, and 99'th percentiles of the distribution of the z-statistic
computed from the -statistic. (The distribution of this z-statistic is
assumed to be normal, but not necessarily a standard normal.) Denote these
percentiles by
,
, and
. The true mean and
standard deviation of the z-statistic are estimated by the intercept and slope,
respectively, from the linear regression of
on
for
= 0.9, 0.95, and 0.99, where
denotes the cumulative distribution function of the standard normal distribution.
The p-value associated with this test is then computed as:
Note: Verrill and Johnson (1988) produced their tables based on Type II censoring.
Royston's (1993) approximation to the p-value of these tests, however, should be
fairly accurate for Type I censored data as well.
Testing Goodness-of-Fit for Any Continuous Distribution
The function gofTestCensored
extends the Shapiro-Wilk test that
accounts for censoring to test for goodness-of-fit for any continuous
distribution by using the idea of Chen and Balakrishnan (1995),
who proposed a general purpose approximate goodness-of-fit test based on
the Cramer-von Mises or Anderson-Darling goodness-of-fit tests for normality.
The function gofTestCensored
modifies the approach of
Chen and Balakrishnan (1995) by using the same first 2 steps, and then
applying the Shapiro-Wilk test that accounts for censoring:
Let denote the vector of
ordered observations, ignoring censoring status.
Compute cumulative probabilities for each
based on the
cumulative distribution function for the hypothesized distribution. That is,
compute
, where
denotes the
hypothesized cumulative distribution function with parameter(s)
,
and
denotes the estimated parameter(s) using an estimation
method that accounts for censoring (e.g., assuming a Gamma
distribution with alternative parameterization, call the function
link{egammaAltCensored}
).
Compute standard normal deviates based on the computed cumulative
probabilities:
Perform the Shapiro-Wilk goodness-of-fit test (that accounts for
censoring) on the 's.
Shapiro-Francia Goodness-of-Fit Test (test="sf"
)
Equation (15) in the help file for gofTest
shows that for the complete
ordered data , the Shapiro-Francia
-statistic is the
same as the squared Pearson correlation coefficient associated with a normal
probability plot.
Computing Shapiro-Francia W'-Statistic for Censored Data
For the case of singly censored data, following Smith and Bain (1976) and
Verrill and Johnson (1988), Royston (1993) extends the computation of the
Weisberg-Bingham Approximation to the -statistic to the case of singly
censored data:
where for left singly-censored data:
and for right singly-censored data:
and is defined as:
where
and denotes the standard normal cdf. Note: Do not confuse the elements
of the vector
with the scalar
which denotes the number
of censored observations. We use
here to be consistent with the
notation in the help file for
gofTest
.
Just like the function gofTest
,
when test="sf"
, the function gofTestCensored
uses Royston's (1992a)
approximation for the coefficients (see the help file for
gofTest
).
In general, the Shapiro-Francia test statistic can be extended to multiply
censored data using Equation (14) with defined as
the orderd values of
associated with uncensored observations, and
defined as the ordered values of
associated with uncensored observations:
and where the plotting positions in Equation (20) are replaced with any of the
plotting positions available in ppointsCensored
(see the description for the argument prob.method
).
Computing P-Values for the Shapiro-Francia Test
Verrill and Johnson (1988) show that the asymptotic distribution of the statistic
in Equation (14) above is normal, but the rate of convergence is
“surprisingly slow” even for complete samples. They provide a table of
empirical percentiles of the distribution for the -statistic shown
in Equation (14) above for several sample sizes and percentages of censoring.
As for the Shapiro-Wilk test, based on the tables given in Verrill and Johnson (1988),
Royston (1993) approximated the 90'th, 95'th, and 99'th percentiles of the
distribution of the z-statistic computed from the -statistic.
(The distribution of this z-statistic is
assumed to be normal, but not necessarily a standard normal.) Denote these
percentiles by
,
, and
. The true mean and
standard deviation of the z-statistic are estimated by the intercept and slope,
respectively, from the linear regression of
on
for
= 0.9, 0.95, and 0.99, where
denotes the cumulative distribution function of the standard normal distribution.
The p-value associated with this test is then computed as:
Note: Verrill and Johnson (1988) produced their tables based on Type II censoring.
Royston's (1993) approximation to the p-value of these tests, however, should be
fairly accurate for Type I censored data as well, although this is an area that
requires further investigation.
Testing Goodness-of-Fit for Any Continuous Distribution
The function gofTestCensored
extends the Shapiro-Francia test that
accounts for censoring to test for goodness-of-fit for any continuous
distribution by using the idea of Chen and Balakrishnan (1995),
who proposed a general purpose approximate goodness-of-fit test based on
the Cramer-von Mises or Anderson-Darling goodness-of-fit tests for normality.
The function gofTestCensored
modifies the approach of
Chen and Balakrishnan (1995) by using the same first 2 steps, and then
applying the Shapiro-Francia test that accounts for censoring:
Let denote the vector of
ordered observations, ignoring censoring status.
Compute cumulative probabilities for each
based on the
cumulative distribution function for the hypothesized distribution. That is,
compute
where
denotes the
hypothesized cumulative distribution function with parameter(s)
,
and
denotes the estimated parameter(s) using an estimation
method that accounts for censoring (e.g., assuming a Gamma
distribution with alternative parameterization, call the function
link{egammaAltCensored}
).
Compute standard normal deviates based on the computed cumulative
probabilities:
Perform the Shapiro-Francia goodness-of-fit test (that accounts for
censoring) on the 's.
Probability Plot Correlation Coefficient (PPCC) Goodness-of-Fit Test (test="ppcc"
)
The function gofTestCensored
computes the PPCC test statistic using Blom
plotting positions. It can be shown that the square of this statistic is equivalent
to the Weisberg-Bingham Approximation to the Shapiro-Francia -test
(Weisberg and Bingham, 1975; Royston, 1993). Thus the PPCC goodness-of-fit test
is equivalent to the Shapiro-Francia goodness-of-fit test.
a list of class "gofCensored"
containing the results of the goodness-of-fit
test. See the help files for gofCensored.object
for details.
The Shapiro-Wilk test (Shapiro and Wilk, 1965) and the Shapiro-Francia test (Shapiro and Francia, 1972) are probably the two most commonly used hypothesis tests to test departures from normality. The Shapiro-Wilk test is most powerful at detecting short-tailed (platykurtic) and skewed distributions, and least powerful against symmetric, moderately long-tailed (leptokurtic) distributions. Conversely, the Shapiro-Francia test is more powerful against symmetric long-tailed distributions and less powerful against short-tailed distributions (Royston, 1992b; 1993).
In practice, almost any goodness-of-fit test will not reject the null hypothesis
if the number of observations is relatively small. Conversely, almost any goodness-of-fit
test will reject the null hypothesis if the number of observations is very large,
since “real” data are never distributed according to any theoretical distribution
(Conover, 1980, p.367). For most cases, however, the distribution of “real” data
is close enough to some theoretical distribution that fairly accurate results may be
provided by assuming that particular theoretical distribution. One way to asses the
goodness of the fit is to use goodness-of-fit tests. Another way is to look at
quantile-quantile (Q-Q) plots (see qqPlotCensored
).
Steven P. Millard ([email protected])
Birnbaum, Z.W., and F.H. Tingey. (1951). One-Sided Confidence Contours for Probability Distribution Functions. Annals of Mathematical Statistics 22, 592-596.
Blom, G. (1958). Statistical Estimates and Transformed Beta Variables. John Wiley and Sons, New York.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Dallal, G.E., and L. Wilkinson. (1986). An Analytic Approximation to the Distribution of Lilliefor's Test for Normality. The American Statistician 40, 294-296.
D'Agostino, R.B. (1970). Transformation to Normality of the Null Distribution of .
Biometrika 57, 679-681.
D'Agostino, R.B. (1971). An Omnibus Test of Normality for Moderate and Large Size Samples. Biometrika 58, 341-348.
D'Agostino, R.B. (1986b). Tests for the Normal Distribution. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York.
D'Agostino, R.B., and E.S. Pearson (1973). Tests for Departures from Normality.
Empirical Results for the Distributions of and
.
Biometrika 60(3), 613-622.
D'Agostino, R.B., and G.L. Tietjen (1973). Approaches to the Null Distribution of .
Biometrika 60(1), 169-173.
Fisher, R.A. (1950). Statistical Methods for Research Workers. 11'th Edition. Hafner Publishing Company, New York, pp.99-100.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Kendall, M.G., and A. Stuart. (1991). The Advanced Theory of Statistics, Volume 2: Inference and Relationship. Fifth Edition. Oxford University Press, New York.
Royston, J.P. (1992a). Approximating the Shapiro-Wilk W-Test for Non-Normality. Statistics and Computing 2, 117-119.
Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897-912.
Royston, J.P. (1992c). A Pocket-Calculator Algorithm for the Shapiro-Francia Test of Non-Normality: An Application to Medicine. Statistics in Medicine 12, 181-184.
Royston, P. (1993). A Toolkit for Testing for Non-Normality in Complete and Censored Samples. The Statistician 42, 37-43.
Ryan, T., and B. Joiner. (1973). Normal Probability Plots and Tests for Normality. Technical Report, Pennsylvannia State University, Department of Statistics.
Shapiro, S.S., and R.S. Francia. (1972). An Approximate Analysis of Variance Test for Normality. Journal of the American Statistical Association 67(337), 215-219.
Shapiro, S.S., and M.B. Wilk. (1965). An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52, 591-611.
Verrill, S., and R.A. Johnson. (1987). The Asymptotic Equivalence of Some Modified Shapiro-Wilk Statistics – Complete and Censored Sample Cases. The Annals of Statistics 15(1), 413-419.
Verrill, S., and R.A. Johnson. (1988). Tables and Large-Sample Distribution Theory for Censored-Data Correlation Statistics for Testing Normality. Journal of the American Statistical Association 83, 1192-1197.
Weisberg, S., and C. Bingham. (1975). An Approximate Analysis of Variance Test for Non-Normality Suitable for Machine Calculation. Technometrics 17, 133-134.
gofTest
, gofCensored.object
,
print.gofCensored
, plot.gofCensored
,
shapiro.test
, Normal, Lognormal,
enormCensored
, elnormCensored
,
elnormAltCensored
, qqPlotCensored
.
# Generate 30 observations from a gamma distribution with # parameters mean=10 and cv=1 and then censor observations less than 5. # Then test the hypothesis that these data came from a gamma # distribution using the Shapiro-Wilk test. # # The p-value for the complete data is p = 0.86, while # the p-value for the censored data is p = 0.52. # (Note: the call to set.seed lets you reproduce this example.) set.seed(598) dat <- sort(rgammaAlt(30, mean = 10, cv = 1)) dat # [1] 0.5313509 1.4741833 1.9936208 2.7980636 3.4509840 # [6] 3.7987348 4.5542952 5.5207531 5.5253596 5.7177872 #[11] 5.7513827 9.1086375 9.8444090 10.6247123 10.9304922 #[16] 11.7925398 13.3432689 13.9562777 14.6029065 15.0563342 #[21] 15.8730642 16.0039936 16.6910715 17.0288922 17.8507891 #[26] 19.1105522 20.2657141 26.3815970 30.2912797 42.8726101 dat.censored <- dat censored <- dat.censored < 5 dat.censored[censored] <- 5 # Results for complete data: #--------------------------- gofTest(dat, test = "sw", dist = "gammaAlt") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): mean = 12.4248552 # cv = 0.7901752 # #Estimation Method: MLE # #Data: dat # #Sample Size: 30 # #Test Statistic: W = 0.981471 # #Test Statistic Parameter: n = 30 # #P-value: 0.8631802 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. # Results for censored data: #--------------------------- gof.list <- gofTestCensored(dat.censored, censored, test = "sw", distribution = "gammaAlt") gof.list #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Wilk GOF # (Singly Censored Data) # Based on Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Censoring Side: left # #Censoring Level(s): 5 # #Estimated Parameter(s): mean = 12.4911448 # cv = 0.7617343 # #Estimation Method: MLE # #Data: dat.censored # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 23.3% # #Test Statistic: W = 0.9613711 # #Test Statistic Parameters: N = 30.0000000 # DELTA = 0.2333333 # #P-value: 0.522329 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. # Plot the results for the censored data #--------------------------------------- dev.new() plot(gof.list) #========== # Continue the above example, but now test the hypothesis that # these data came from a lognormal distribution # (alternative parameterization) using the Shapiro-Wilk test. # # The p-value for the complete data is p = 0.056, while # the p-value for the censored data is p = 0.11. # Results for complete data: #--------------------------- gofTest(dat, test = "sw", dist = "lnormAlt") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): mean = 13.757239 # cv = 1.148872 # #Estimation Method: mvue # #Data: dat # #Sample Size: 30 # #Test Statistic: W = 0.9322226 # #Test Statistic Parameter: n = 30 # #P-value: 0.05626823 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. # Results for censored data: #--------------------------- gof.list <- gofTestCensored(dat.censored, censored, test = "sw", distribution = "lnormAlt") gof.list #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Wilk GOF # (Singly Censored Data) # #Hypothesized Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 5 # #Estimated Parameter(s): mean = 13.0382221 # cv = 0.9129512 # #Estimation Method: MLE # #Data: dat.censored # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 23.3% # #Test Statistic: W = 0.9292406 # #Test Statistic Parameters: N = 30.0000000 # DELTA = 0.2333333 # #P-value: 0.114511 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. # Plot the results for the censored data #--------------------------------------- dev.new() plot(gof.list) #---------- # Redo the above example, but specify the quasi-minimum variance # unbiased estimator of the mean. Note that the method of # estimating the parameters has no effect on the goodness-of-fit # test (see the DETAILS section above). gofTestCensored(dat.censored, censored, test = "sw", distribution = "lnormAlt", est.arg.list = list(method = "qmvue")) #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Wilk GOF # (Singly Censored Data) # #Hypothesized Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 5 # #Estimated Parameter(s): mean = 12.8722749 # cv = 0.8712549 # #Estimation Method: Quasi-MVUE # #Data: dat.censored # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 23.3% # #Test Statistic: W = 0.9292406 # #Test Statistic Parameters: N = 30.0000000 # DELTA = 0.2333333 # #P-value: 0.114511 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. #---------- # Clean up rm(dat, dat.censored, censored, gof.list) graphics.off() #========== # Check the assumption that the silver data stored in Helsel.Cohn.88.silver.df # follows a lognormal distribution and plot the goodness-of-fit test results. # Note that the small p-value and the shape of the Q-Q plot # (an inverted S-shape) suggests that the log transformation is not quite strong # enough to "bring in" the tails (i.e., the log-transformed silver data has tails # that are slightly too long relative to a normal distribution). # Helsel and Cohn (1988, p.2002) note that the gross outlier of 560 mg/L tends to # make the shape of the data resemble a gamma distribution. dum.list <- with(Helsel.Cohn.88.silver.df, gofTestCensored(Ag, Censored, test = "sf", dist = "lnorm")) dum.list #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Francia GOF # (Multiply Censored Data) # #Hypothesized Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 0.1 0.2 0.3 0.5 1.0 2.0 2.5 5.0 # 6.0 10.0 20.0 25.0 # #Estimated Parameter(s): meanlog = -1.040572 # sdlog = 2.354847 # #Estimation Method: MLE # #Data: Ag # #Censoring Variable: Censored # #Sample Size: 56 # #Percent Censored: 60.7% # #Test Statistic: W = 0.8957198 # #Test Statistic Parameters: N = 56.0000000 # DELTA = 0.6071429 # #P-value: 0.03490314 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. dev.new() plot(dum.list) #---------- # Clean up #--------- rm(dum.list) graphics.off() #========== # Chapter 15 of USEPA (2009) gives several examples of looking # at normal Q-Q plots and estimating the mean and standard deviation # for manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will test whether the data appear to come from a normal # distribution, then we will test to see whether they appear to come # from a lognormal distribution. #-------------------------------------------------------------------- # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now test whether the data appear to come from # a normal distribution. Note that these data # are multiply censored, so we'll use the # Shapiro-Francia test. #---------------------------------------------- gof.normal <- with(EPA.09.Ex.15.1.manganese.df, gofTestCensored(Manganese.ppb, Censored, test = "sf")) gof.normal #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Francia GOF # (Multiply Censored Data) # #Hypothesized Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 15.23508 # sd = 30.62812 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Test Statistic: W = 0.8368016 # #Test Statistic Parameters: N = 25.00 # DELTA = 0.24 # #P-value: 0.004662658 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. # Plot the results: #------------------ dev.new() plot(gof.normal) #---------- # Now test to see whether the data appear to come from # a lognormal distribuiton. #----------------------------------------------------- gof.lognormal <- with(EPA.09.Ex.15.1.manganese.df, gofTestCensored(Manganese.ppb, Censored, test = "sf", distribution = "lnorm")) gof.lognormal #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Francia GOF # (Multiply Censored Data) # #Hypothesized Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Test Statistic: W = 0.9864426 # #Test Statistic Parameters: N = 25.00 # DELTA = 0.24 # #P-value: 0.9767731 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. # Plot the results: #------------------ dev.new() plot(gof.lognormal) #---------- # Clean up #--------- rm(gof.normal, gof.lognormal) graphics.off()
# Generate 30 observations from a gamma distribution with # parameters mean=10 and cv=1 and then censor observations less than 5. # Then test the hypothesis that these data came from a gamma # distribution using the Shapiro-Wilk test. # # The p-value for the complete data is p = 0.86, while # the p-value for the censored data is p = 0.52. # (Note: the call to set.seed lets you reproduce this example.) set.seed(598) dat <- sort(rgammaAlt(30, mean = 10, cv = 1)) dat # [1] 0.5313509 1.4741833 1.9936208 2.7980636 3.4509840 # [6] 3.7987348 4.5542952 5.5207531 5.5253596 5.7177872 #[11] 5.7513827 9.1086375 9.8444090 10.6247123 10.9304922 #[16] 11.7925398 13.3432689 13.9562777 14.6029065 15.0563342 #[21] 15.8730642 16.0039936 16.6910715 17.0288922 17.8507891 #[26] 19.1105522 20.2657141 26.3815970 30.2912797 42.8726101 dat.censored <- dat censored <- dat.censored < 5 dat.censored[censored] <- 5 # Results for complete data: #--------------------------- gofTest(dat, test = "sw", dist = "gammaAlt") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): mean = 12.4248552 # cv = 0.7901752 # #Estimation Method: MLE # #Data: dat # #Sample Size: 30 # #Test Statistic: W = 0.981471 # #Test Statistic Parameter: n = 30 # #P-value: 0.8631802 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. # Results for censored data: #--------------------------- gof.list <- gofTestCensored(dat.censored, censored, test = "sw", distribution = "gammaAlt") gof.list #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Wilk GOF # (Singly Censored Data) # Based on Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Censoring Side: left # #Censoring Level(s): 5 # #Estimated Parameter(s): mean = 12.4911448 # cv = 0.7617343 # #Estimation Method: MLE # #Data: dat.censored # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 23.3% # #Test Statistic: W = 0.9613711 # #Test Statistic Parameters: N = 30.0000000 # DELTA = 0.2333333 # #P-value: 0.522329 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. # Plot the results for the censored data #--------------------------------------- dev.new() plot(gof.list) #========== # Continue the above example, but now test the hypothesis that # these data came from a lognormal distribution # (alternative parameterization) using the Shapiro-Wilk test. # # The p-value for the complete data is p = 0.056, while # the p-value for the censored data is p = 0.11. # Results for complete data: #--------------------------- gofTest(dat, test = "sw", dist = "lnormAlt") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): mean = 13.757239 # cv = 1.148872 # #Estimation Method: mvue # #Data: dat # #Sample Size: 30 # #Test Statistic: W = 0.9322226 # #Test Statistic Parameter: n = 30 # #P-value: 0.05626823 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. # Results for censored data: #--------------------------- gof.list <- gofTestCensored(dat.censored, censored, test = "sw", distribution = "lnormAlt") gof.list #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Wilk GOF # (Singly Censored Data) # #Hypothesized Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 5 # #Estimated Parameter(s): mean = 13.0382221 # cv = 0.9129512 # #Estimation Method: MLE # #Data: dat.censored # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 23.3% # #Test Statistic: W = 0.9292406 # #Test Statistic Parameters: N = 30.0000000 # DELTA = 0.2333333 # #P-value: 0.114511 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. # Plot the results for the censored data #--------------------------------------- dev.new() plot(gof.list) #---------- # Redo the above example, but specify the quasi-minimum variance # unbiased estimator of the mean. Note that the method of # estimating the parameters has no effect on the goodness-of-fit # test (see the DETAILS section above). gofTestCensored(dat.censored, censored, test = "sw", distribution = "lnormAlt", est.arg.list = list(method = "qmvue")) #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Wilk GOF # (Singly Censored Data) # #Hypothesized Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 5 # #Estimated Parameter(s): mean = 12.8722749 # cv = 0.8712549 # #Estimation Method: Quasi-MVUE # #Data: dat.censored # #Censoring Variable: censored # #Sample Size: 30 # #Percent Censored: 23.3% # #Test Statistic: W = 0.9292406 # #Test Statistic Parameters: N = 30.0000000 # DELTA = 0.2333333 # #P-value: 0.114511 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. #---------- # Clean up rm(dat, dat.censored, censored, gof.list) graphics.off() #========== # Check the assumption that the silver data stored in Helsel.Cohn.88.silver.df # follows a lognormal distribution and plot the goodness-of-fit test results. # Note that the small p-value and the shape of the Q-Q plot # (an inverted S-shape) suggests that the log transformation is not quite strong # enough to "bring in" the tails (i.e., the log-transformed silver data has tails # that are slightly too long relative to a normal distribution). # Helsel and Cohn (1988, p.2002) note that the gross outlier of 560 mg/L tends to # make the shape of the data resemble a gamma distribution. dum.list <- with(Helsel.Cohn.88.silver.df, gofTestCensored(Ag, Censored, test = "sf", dist = "lnorm")) dum.list #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Francia GOF # (Multiply Censored Data) # #Hypothesized Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 0.1 0.2 0.3 0.5 1.0 2.0 2.5 5.0 # 6.0 10.0 20.0 25.0 # #Estimated Parameter(s): meanlog = -1.040572 # sdlog = 2.354847 # #Estimation Method: MLE # #Data: Ag # #Censoring Variable: Censored # #Sample Size: 56 # #Percent Censored: 60.7% # #Test Statistic: W = 0.8957198 # #Test Statistic Parameters: N = 56.0000000 # DELTA = 0.6071429 # #P-value: 0.03490314 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. dev.new() plot(dum.list) #---------- # Clean up #--------- rm(dum.list) graphics.off() #========== # Chapter 15 of USEPA (2009) gives several examples of looking # at normal Q-Q plots and estimating the mean and standard deviation # for manganese concentrations (ppb) in groundwater at five background wells. # In EnvStats these data are stored in the data frame # EPA.09.Ex.15.1.manganese.df. # Here we will test whether the data appear to come from a normal # distribution, then we will test to see whether they appear to come # from a lognormal distribution. #-------------------------------------------------------------------- # First look at the data: #----------------------- EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #... #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE longToWide(EPA.09.Ex.15.1.manganese.df, "Manganese.Orig.ppb", "Sample", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 #Sample.1 <5 <5 <5 6.3 17.9 #Sample.2 12.1 7.7 5.3 11.9 22.7 #Sample.3 16.9 53.6 12.6 10 3.3 #Sample.4 21.6 9.5 106.3 <2 8.4 #Sample.5 <2 45.9 34.5 77.2 <2 # Now test whether the data appear to come from # a normal distribution. Note that these data # are multiply censored, so we'll use the # Shapiro-Francia test. #---------------------------------------------- gof.normal <- with(EPA.09.Ex.15.1.manganese.df, gofTestCensored(Manganese.ppb, Censored, test = "sf")) gof.normal #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Francia GOF # (Multiply Censored Data) # #Hypothesized Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): mean = 15.23508 # sd = 30.62812 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Test Statistic: W = 0.8368016 # #Test Statistic Parameters: N = 25.00 # DELTA = 0.24 # #P-value: 0.004662658 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. # Plot the results: #------------------ dev.new() plot(gof.normal) #---------- # Now test to see whether the data appear to come from # a lognormal distribuiton. #----------------------------------------------------- gof.lognormal <- with(EPA.09.Ex.15.1.manganese.df, gofTestCensored(Manganese.ppb, Censored, test = "sf", distribution = "lnorm")) gof.lognormal #Results of Goodness-of-Fit Test #Based on Type I Censored Data #------------------------------- # #Test Method: Shapiro-Francia GOF # (Multiply Censored Data) # #Hypothesized Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: Censored # #Sample Size: 25 # #Percent Censored: 24% # #Test Statistic: W = 0.9864426 # #Test Statistic Parameters: N = 25.00 # DELTA = 0.24 # #P-value: 0.9767731 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. # Plot the results: #------------------ dev.new() plot(gof.lognormal) #---------- # Clean up #--------- rm(gof.normal, gof.lognormal) graphics.off()
Objects of S3 class "gofTwoSample"
are returned by the EnvStats function
gofTest
when both the x
and y
arguments are supplied.
Objects of S3 class "gofTwoSample"
are lists that contain
information about the assumed distribution, the estimated or
user-supplied distribution parameters, and the test statistic
and p-value.
Required Components
The following components must be included in a legitimate list of
class "gofTwoSample"
.
distribution |
a character string with the value |
statistic |
a numeric scalar with a names attribute containing the name and value of the goodness-of-fit statistic. |
sample.size |
a numeric scalar containing the number of non-missing observations in the sample used for the goodness-of-fit test. |
parameters |
numeric vector with a names attribute containing
the name(s) and value(s) of the parameter(s) associated with the
test statistic given in the |
p.value |
numeric scalar containing the p-value associated with the goodness-of-fit statistic. |
alternative |
character string indicating the alternative hypothesis. |
method |
character string indicating the name of the goodness-of-fit test. |
data |
a list of length 2 containing the numeric vectors actually used for the goodness-of-fit test (i.e., the original data but with any missing or infinite values removed). |
data.name |
a character vector of length 2 indicating the name of the data
object used for the |
Optional Component
The following component is included when the arguments x
and/or y
contain missing (NA
), undefined (NaN
) and/or infinite
(Inf
, -Inf
) values.
bad.obs |
numeric vector of length 2 indicating the number of missing ( |
Generic functions that have methods for objects of class
"gofTwoSample"
include: print
, plot
.
Since objects of class "gofTwoSample"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
print.gofTwoSample
, plot.gofTwoSample
,
Goodness-of-Fit Tests.
# Create an object of class "gofTwoSample", then print it out. # Generate 20 observations from a normal distribution with mean=3 and sd=2, and # generate 10 observaions from a normal distribution with mean=2 and sd=2 then # test whether these sets of observations come from the same distribution. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(300) dat1 <- rnorm(20, mean = 3, sd = 2) dat2 <- rnorm(10, mean = 1, sd = 2) gofTest(x = dat1, y = dat2, test = "ks") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: 2-Sample K-S GOF # #Hypothesized Distribution: Equal # #Data: x = dat1 # y = dat2 # #Sample Sizes: n.x = 20 # n.y = 10 # #Test Statistic: ks = 0.7 # #Test Statistic Parameters: n = 20 # m = 10 # #P-value: 0.001669561 # #Alternative Hypothesis: The cdf of 'dat1' does not equal # the cdf of 'dat2'. #---------- # Clean up rm(dat1, dat2)
# Create an object of class "gofTwoSample", then print it out. # Generate 20 observations from a normal distribution with mean=3 and sd=2, and # generate 10 observaions from a normal distribution with mean=2 and sd=2 then # test whether these sets of observations come from the same distribution. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(300) dat1 <- rnorm(20, mean = 3, sd = 2) dat2 <- rnorm(10, mean = 1, sd = 2) gofTest(x = dat1, y = dat2, test = "ks") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: 2-Sample K-S GOF # #Hypothesized Distribution: Equal # #Data: x = dat1 # y = dat2 # #Sample Sizes: n.x = 20 # n.y = 10 # #Test Statistic: ks = 0.7 # #Test Statistic Parameters: n = 20 # m = 10 # #P-value: 0.001669561 # #Alternative Hypothesis: The cdf of 'dat1' does not equal # the cdf of 'dat2'. #---------- # Clean up rm(dat1, dat2)
Generate a generalized pivotal quantity (GPQ) for a confidence interval for the mean of a Normal distribution based on singly or multiply censored data.
gpqCiNormSinglyCensored(n, n.cen, probs, nmc, method = "mle", censoring.side = "left", seed = NULL, names = TRUE) gpqCiNormMultiplyCensored(n, cen.index, probs, nmc, method = "mle", censoring.side = "left", seed = NULL, names = TRUE)
gpqCiNormSinglyCensored(n, n.cen, probs, nmc, method = "mle", censoring.side = "left", seed = NULL, names = TRUE) gpqCiNormMultiplyCensored(n, cen.index, probs, nmc, method = "mle", censoring.side = "left", seed = NULL, names = TRUE)
n |
positive integer |
n.cen |
for the case of singly censored data, a positive integer indicating the number of
censored observations. The value of |
cen.index |
for the case of multiply censored data, a sorted vector of unique integers
indicating the indices of the censored observations when the observations are
“ordered”. The length of |
probs |
numeric vector of values between 0 and 1 indicating the confidence level(s) associated with the GPQ(s). |
nmc |
positive integer |
method |
character string indicating the method to use for parameter estimation. |
censoring.side |
character string indicating on which side the censoring occurs. The possible
values are |
seed |
positive integer to pass to the function |
names |
a logical scalar passed to |
The functions gpqCiNormSinglyCensored
and gpqCiNormMultiplyCensored
are called by enormCensored
when ci.method="gpq"
. They are
used to construct generalized pivotal quantities to create confidence intervals
for the mean of an assumed normal distribution.
This idea was introduced by Schmee et al. (1985) in the context of Type II singly
censored data. The function
gpqCiNormSinglyCensored
generates GPQs using a modification of
Algorithm 12.1 of Krishnamoorthy and Mathew (2009, p. 329). Algorithm 12.1 is
used to generate GPQs for a tolerance interval. The modified algorithm for
generating GPQs for confidence intervals for the mean is as follows:
Generate a random sample of observations from a standard normal
(i.e., N(0,1)) distribution and let
denote the ordered (sorted) observations.
Set the smallest n.cen
observations as censored.
Compute the estimates of and
by calling
enormCensored
using the method
specified by the method
argument, and denote these estimates as
.
Compute the t-like pivotal quantity
.
Repeat steps 1-4 nmc
times to produce an empirical distribution of
the t-like pivotal quantity.
A two-sided confidence interval for
is then
computed as:
where denotes the
'th empirical quantile of the
nmc
generated values.
Schmee at al. (1985) derived this method in the context of Type II singly censored data (for which these limits are exact within Monte Carlo error), but state that according to Regal (1982) this method produces confidence intervals that are close apporximations to the correct limits for Type I censored data.
The function
gpqCiNormMultiplyCensored
is an extension of this idea to multiply censored
data. The algorithm is the same as for singly censored data, except
Step 2 changes to:
2. Set observations as censored for elements of the argument cen.index
that have the value TRUE
.
The functions gpqCiNormSinglyCensored
and gpqCiNormMultiplyCensored
are
computationally intensive and provided to the user to allow you to create your own
tables.
a numeric vector containing the GPQ(s).
Steven P. Millard ([email protected])
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Regal, R. (1982). Applying Order Statistic Censored Normal Confidence Intervals to Time Censored Data. Unpublished manuscript, University of Minnesota, Duluth, Department of Mathematical Sciences.
Schmee, J., D.Gladstein, and W. Nelson. (1985). Confidence Limits for Parameters of a Normal Distribution from Singly Censored Samples, Using Maximum Likelihood. Technometrics 27(2) 119–128.
enormCensored
, estimateCensored.object
.
# Reproduce the entries for n=10 observations with n.cen=6 in Table 4 # of Schmee et al. (1985, p.122). # # Notes: # 1. This table applies to right-censored data, and the # quantity "r" in this table refers to the number of # uncensored observations. # # 2. Passing a value for the argument "seed" simply allows # you to reproduce this example. # NOTE: Here to save computing time for the sake of example, we will specify # just 100 Monte Carlos, whereas Krishnamoorthy and Mathew (2009) # suggest *10,000* Monte Carlos. # Here are the values given in Schmee et al. (1985): Schmee.values <- c(-3.59, -2.60, -1.73, -0.24, 0.43, 0.58, 0.73) probs <- c(0.025, 0.05, 0.1, 0.5, 0.9, 0.95, 0.975) names(Schmee.values) <- paste(probs * 100, "%", sep = "") Schmee.values # 2.5% 5% 10% 50% 90% 95% 97.5% #-3.59 -2.60 -1.73 -0.24 0.43 0.58 0.73 gpqs <- gpqCiNormSinglyCensored(n = 10, n.cen = 6, probs = probs, nmc = 100, censoring.side = "right", seed = 529) round(gpqs, 2) # 2.5% 5% 10% 50% 90% 95% 97.5% #-2.46 -2.03 -1.38 -0.14 0.54 0.65 0.84 # This is what you get if you specify nmc = 1000 with the # same value for seed: #----------------------------------------------- # 2.5% 5% 10% 50% 90% 95% 97.5% #-3.50 -2.49 -1.67 -0.25 0.41 0.57 0.71 # Clean up #--------- rm(Schmee.values, probs, gpqs) #========== # Example of using gpqCiNormMultiplyCensored #------------------------------------------- # Consider the following set of multiply left-censored data: dat <- 12:16 censored <- c(TRUE, FALSE, TRUE, FALSE, FALSE) # Since the data are "ordered" we can identify the indices of the # censored observations in the ordered data as follow: cen.index <- (1:length(dat))[censored] cen.index #[1] 1 3 # Now we can generate a GPQ using gpqCiNormMultiplyCensored. # Here we'll generate a GPQs to use to create a # 95% confidence interval for left-censored data. # NOTE: Here to save computing time for the sake of example, we will specify # just 100 Monte Carlos, whereas Krishnamoorthy and Mathew (2009) # suggest *10,000* Monte Carlos. gpqCiNormMultiplyCensored(n = 5, cen.index = cen.index, probs = c(0.025, 0.975), nmc = 100, seed = 237) # 2.5% 97.5% #-1.315592 1.848513 #---------- # Clean up #--------- rm(dat, censored, cen.index)
# Reproduce the entries for n=10 observations with n.cen=6 in Table 4 # of Schmee et al. (1985, p.122). # # Notes: # 1. This table applies to right-censored data, and the # quantity "r" in this table refers to the number of # uncensored observations. # # 2. Passing a value for the argument "seed" simply allows # you to reproduce this example. # NOTE: Here to save computing time for the sake of example, we will specify # just 100 Monte Carlos, whereas Krishnamoorthy and Mathew (2009) # suggest *10,000* Monte Carlos. # Here are the values given in Schmee et al. (1985): Schmee.values <- c(-3.59, -2.60, -1.73, -0.24, 0.43, 0.58, 0.73) probs <- c(0.025, 0.05, 0.1, 0.5, 0.9, 0.95, 0.975) names(Schmee.values) <- paste(probs * 100, "%", sep = "") Schmee.values # 2.5% 5% 10% 50% 90% 95% 97.5% #-3.59 -2.60 -1.73 -0.24 0.43 0.58 0.73 gpqs <- gpqCiNormSinglyCensored(n = 10, n.cen = 6, probs = probs, nmc = 100, censoring.side = "right", seed = 529) round(gpqs, 2) # 2.5% 5% 10% 50% 90% 95% 97.5% #-2.46 -2.03 -1.38 -0.14 0.54 0.65 0.84 # This is what you get if you specify nmc = 1000 with the # same value for seed: #----------------------------------------------- # 2.5% 5% 10% 50% 90% 95% 97.5% #-3.50 -2.49 -1.67 -0.25 0.41 0.57 0.71 # Clean up #--------- rm(Schmee.values, probs, gpqs) #========== # Example of using gpqCiNormMultiplyCensored #------------------------------------------- # Consider the following set of multiply left-censored data: dat <- 12:16 censored <- c(TRUE, FALSE, TRUE, FALSE, FALSE) # Since the data are "ordered" we can identify the indices of the # censored observations in the ordered data as follow: cen.index <- (1:length(dat))[censored] cen.index #[1] 1 3 # Now we can generate a GPQ using gpqCiNormMultiplyCensored. # Here we'll generate a GPQs to use to create a # 95% confidence interval for left-censored data. # NOTE: Here to save computing time for the sake of example, we will specify # just 100 Monte Carlos, whereas Krishnamoorthy and Mathew (2009) # suggest *10,000* Monte Carlos. gpqCiNormMultiplyCensored(n = 5, cen.index = cen.index, probs = c(0.025, 0.975), nmc = 100, seed = 237) # 2.5% 97.5% #-1.315592 1.848513 #---------- # Clean up #--------- rm(dat, censored, cen.index)
Generate a generalized pivotal quantity (GPQ) for a tolerance interval for a Normal distribution based on singly or multiply censored data.
gpqTolIntNormSinglyCensored(n, n.cen, p, probs, nmc, method = "mle", censoring.side = "left", seed = NULL, names = TRUE) gpqTolIntNormMultiplyCensored(n, cen.index, p, probs, nmc, method = "mle", censoring.side = "left", seed = NULL, names = TRUE)
gpqTolIntNormSinglyCensored(n, n.cen, p, probs, nmc, method = "mle", censoring.side = "left", seed = NULL, names = TRUE) gpqTolIntNormMultiplyCensored(n, cen.index, p, probs, nmc, method = "mle", censoring.side = "left", seed = NULL, names = TRUE)
n |
positive integer |
n.cen |
for the case of singly censored data, a positive integer indicating the number of
censored observations. The value of |
cen.index |
for the case of multiply censored data, a sorted vector of unique integers indicating the
indices of the censored observations when the observations are “ordered”.
The length of |
p |
numeric scalar strictly greater than 0 and strictly less than 1 indicating the quantile for which to generate the GPQ(s) (i.e., the coverage associated with a one-sided tolerance interval). |
probs |
numeric vector of values between 0 and 1 indicating the confidence level(s) associated with the GPQ(s). |
nmc |
positive integer |
method |
character string indicating the method to use for parameter estimation. |
censoring.side |
character string indicating on which side the censoring occurs. The possible values are
|
seed |
positive integer to pass to the function |
names |
a logical scalar passed to |
The function gpqTolIntNormSinglyCensored
generates GPQs as described in Algorithm 12.1
of Krishnamoorthy and Mathew (2009, p. 329). The function
gpqTolIntNormMultiplyCensored
is an extension of this idea to multiply censored data.
These functions are called by tolIntNormCensored
when ti.method="gpq"
,
and also by eqnormCensored
when ci=TRUE
and ci.method="gpq"
. See
the help files for these functions for an explanation of GPQs.
Note that technically these are only GPQs if the data are Type II censored. However, Krishnamoorthy and Mathew (2009, p. 328) state that in the case of Type I censored data these quantities should approximate the true GPQs and the results appear to be satisfactory, even for small sample sizes.
The functions gpqTolIntNormSinglyCensored
and gpqTolIntNormMultiplyCensored
are
computationally intensive and provided to the user to allow you to create your own tables.
a numeric vector containing the GPQ(s).
Tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Meeker, 1991; Krishnamoorthy and Mathew, 2009). References that discuss tolerance intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Steven P. Millard ([email protected])
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
tolIntNormCensored
, eqnormCensored
,
enormCensored
, estimateCensored.object
.
# Reproduce the entries for n=10 observations with n.cen=1 in Table 12.2 # of Krishnamoorthy and Mathew (2009, p.331). # # (Note: passing a value for the argument "seed" simply allows you to # reproduce this example.) # # NOTE: Here to save computing time for the sake of example, we will specify # just 100 Monte Carlos, whereas Krishnamoorthy and Mathew (2009) # suggest *10,000* Monte Carlos. gpqTolIntNormSinglyCensored(n = 10, n.cen = 1, p = 0.05, probs = 0.05, nmc = 100, seed = 529) # 5% #-3.483403 gpqTolIntNormSinglyCensored(n = 10, n.cen = 1, p = 0.1, probs = 0.05, nmc = 100, seed = 497) # 5% #-2.66705 gpqTolIntNormSinglyCensored(n = 10, n.cen = 1, p = 0.9, probs = 0.95, nmc = 100, seed = 623) # 95% #2.478654 gpqTolIntNormSinglyCensored(n = 10, n.cen = 1, p = 0.95, probs = 0.95, nmc = 100, seed = 623) # 95% #3.108452 #========== # Example of using gpqTolIntNormMultiplyCensored #----------------------------------------------- # Consider the following set of multiply left-censored data: dat <- 12:16 censored <- c(TRUE, FALSE, TRUE, FALSE, FALSE) # Since the data are "ordered" we can identify the indices of the # censored observations in the ordered data as follow: cen.index <- (1:length(dat))[censored] cen.index #[1] 1 3 # Now we can generate a GPQ using gpqTolIntNormMultiplyCensored. # Here we'll generate a GPQ corresponding to an upper tolerance # interval with coverage 90% with 95% confidence for # left-censored data. # NOTE: Here to save computing time for the sake of example, we will specify # just 100 Monte Carlos, whereas Krishnamoorthy and Mathew (2009) # suggest *10,000* Monte Carlos. gpqTolIntNormMultiplyCensored(n = 5, cen.index = cen.index, p = 0.9, probs = 0.95, nmc = 100, seed = 237) # 95% #3.952052 #========== # Clean up #--------- rm(dat, censored, cen.index)
# Reproduce the entries for n=10 observations with n.cen=1 in Table 12.2 # of Krishnamoorthy and Mathew (2009, p.331). # # (Note: passing a value for the argument "seed" simply allows you to # reproduce this example.) # # NOTE: Here to save computing time for the sake of example, we will specify # just 100 Monte Carlos, whereas Krishnamoorthy and Mathew (2009) # suggest *10,000* Monte Carlos. gpqTolIntNormSinglyCensored(n = 10, n.cen = 1, p = 0.05, probs = 0.05, nmc = 100, seed = 529) # 5% #-3.483403 gpqTolIntNormSinglyCensored(n = 10, n.cen = 1, p = 0.1, probs = 0.05, nmc = 100, seed = 497) # 5% #-2.66705 gpqTolIntNormSinglyCensored(n = 10, n.cen = 1, p = 0.9, probs = 0.95, nmc = 100, seed = 623) # 95% #2.478654 gpqTolIntNormSinglyCensored(n = 10, n.cen = 1, p = 0.95, probs = 0.95, nmc = 100, seed = 623) # 95% #3.108452 #========== # Example of using gpqTolIntNormMultiplyCensored #----------------------------------------------- # Consider the following set of multiply left-censored data: dat <- 12:16 censored <- c(TRUE, FALSE, TRUE, FALSE, FALSE) # Since the data are "ordered" we can identify the indices of the # censored observations in the ordered data as follow: cen.index <- (1:length(dat))[censored] cen.index #[1] 1 3 # Now we can generate a GPQ using gpqTolIntNormMultiplyCensored. # Here we'll generate a GPQ corresponding to an upper tolerance # interval with coverage 90% with 95% confidence for # left-censored data. # NOTE: Here to save computing time for the sake of example, we will specify # just 100 Monte Carlos, whereas Krishnamoorthy and Mathew (2009) # suggest *10,000* Monte Carlos. gpqTolIntNormMultiplyCensored(n = 5, cen.index = cen.index, p = 0.9, probs = 0.95, nmc = 100, seed = 237) # 95% #3.952052 #========== # Clean up #--------- rm(dat, censored, cen.index)
These data are the results of an experiment in which different groups of rats were exposed to different concentration levels of ethylene thiourea (ETU), which is a decomposition product of a certain class of fungicides that can be found in treated foods (Graham et al., 1975; Rodricks, 1992, p.133). In this experiment, the outcome of concern was the number of rats that developed thyroid tumors.
Graham.et.al.75.etu.df
Graham.et.al.75.etu.df
A data frame with 6 observations on the following 4 variables.
dose
a numeric vector of dose (ppm/day) of ETU.
tumors
a numeric vector indicating number of rats that developed thyroid tumors.
n
a numeric vector indicating the number of rats in the dose group.
proportion
a numeric vector indicating proportion of rats that developed thyroid tumors.
Graham, S.L., K.J. Davis, W.H. Hansen, and C.H. Graham. (1975). Effects of Prolonged Ethylene Thiourea Ingestion on the Thyroid of the Rat. Food and Cosmetics Toxicology, 13(5), 493–499.
Rodricks, J.V. (1992). Calculated Risks: The Toxicity and Human Health Risks of Chemicals in Our Environment. Cambridge University Press, New York, p.133.
Adjusted alpha levels to compute confidence intervals for the mean of a gamma distribution, as presented in Table 2 of Grice and Bain (1980).
data("Grice.Bain.80.mat")
data("Grice.Bain.80.mat")
A matrix of dimensions 5 by 7, with the first dimension indicating the sample size (between 5 and Inf), and the second dimension indicating the assumed significance level associated with the confidence interval (between 0.005 and 0.25). The assumed confidence level is 1 - assumed significance level.
See Grice and Bain (1980) and the help file for egamma
for more information. The data in this matrix are used when
the function egamma
is called with ci.method="chisq.adj"
.
Grice, J.V., and L.J. Bain. (1980). Inferences Concerning the Mean of the Gamma Distribution. Journal of the American Statistical Association 75, 929-933.
Grice, J.V., and L.J. Bain. (1980). Inferences Concerning the Mean of the Gamma Distribution. Journal of the American Statistical Association 75, 929-933.
USEPA. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.
# Look at Grice.Bain.80.mat Grice.Bain.80.mat # alpha.eq.005 alpha.eq.01 alpha.eq.025 alpha.eq.05 alpha.eq.075 #n.eq.5 0.0000 0.0000 0.0010 0.0086 0.0234 #n.eq.10 0.0003 0.0015 0.0086 0.0267 0.0486 #n.eq.20 0.0017 0.0046 0.0159 0.0380 0.0619 #n.eq.40 0.0030 0.0070 0.0203 0.0440 0.0685 #n.eq.Inf 0.0050 0.0100 0.0250 0.0500 0.0750 # alpha.eq.10 alpha.eq.25 #n.eq.5 0.0432 0.2038 #n.eq.10 0.0724 0.2294 #n.eq.20 0.0866 0.2403 #n.eq.40 0.0934 0.2453 #n.eq.Inf 0.1000 0.2500
# Look at Grice.Bain.80.mat Grice.Bain.80.mat # alpha.eq.005 alpha.eq.01 alpha.eq.025 alpha.eq.05 alpha.eq.075 #n.eq.5 0.0000 0.0000 0.0010 0.0086 0.0234 #n.eq.10 0.0003 0.0015 0.0086 0.0267 0.0486 #n.eq.20 0.0017 0.0046 0.0159 0.0380 0.0619 #n.eq.40 0.0030 0.0070 0.0203 0.0440 0.0685 #n.eq.Inf 0.0050 0.0100 0.0250 0.0500 0.0750 # alpha.eq.10 alpha.eq.25 #n.eq.5 0.0432 0.2038 #n.eq.10 0.0724 0.2294 #n.eq.20 0.0866 0.2403 #n.eq.40 0.0934 0.2453 #n.eq.Inf 0.1000 0.2500
Made up multiply left-censored data. There are 9 observations out of a total of 18
that are reported as <, where
denotes a detection limit. There are
2 distinct detection limits.
Helsel.Cohn.88.app.b.df
Helsel.Cohn.88.app.b.df
A data frame with 18 observations on the following 3 variables.
Conc.orig
a character vector of original observations
Conc
a numeric vector of observations with censored values coded to censoring levels
Censored
a logical vector indicating which values are censored
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997–2004, Appendix B.
Silver concentrations (mg/L) from an interlab comparison. There are 34 observations out of a total of 56 that are reported as <DL, where DL denotes a detection limit. There are 12 distinct detection limits.
Helsel.Cohn.88.silver.df
Helsel.Cohn.88.silver.df
A data frame with 56 observations on the following 4 variables.
Ag.orig
a character vector of original silver concentrations (mg/L)
Ag
a numeric vector with nondetects coded to the detection limit
Censored
a logical vector indicating which observations are censored
log.Ag
the natural logarithm of Ag
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997–2004.
Janzer, V.J. (1986). Report of the U.S. Geological Survey's Analytical Evaluation Program–Standard Reference Water Samples M6, M94, T95, N16, P8, and SED3. Technical Report, Branch of Quality Assurance, U.S. Geological Survey, Arvada, CO.
Counts of mayfly nymphs at low flow in 12 small streams. In each stream, counts were recorded above and below industrial outfalls.
data(Helsel.Hirsch.02.Mayfly.df)
data(Helsel.Hirsch.02.Mayfly.df)
A data frame with 24 observations on the following 3 variables.
Mayfly.Count
Number of mayfly nymphs counted
Stream
a factor indicating the stream number
Location
a factor indicating the location of the count (above vs. below)
Helsel, D.R., and R.M. Hirsch. (2002). Statistical Methods in Water Resources Research. Techniques of Water Resources Investigations, Book 4, Chapter A3. U.S. Geological Survey, 139–140. https://pubs.usgs.gov/tm/04/a03/tm4a3.pdf.
Detailed abstract of the manuscript:
Hosking, J.R.M., J.R. Wallis, and E.F. Wood. (1985). Estimation of the
Generalized Extreme-Value Distribution by the Method of Probability-Weighted
Moments. Technometrics 27(3), 251–261.
Abstract
Hosking et al. (1985) use the method of probability-weighted moments,
introduced by Greenwood et al. (1979), to estimate the parameters of the
generalized extreme value distribution (GEVD) with parameters
location=
,
scale=
, and
shape=
. Hosking et al. (1985) derive the asymptotic
distributions of the probability-weighted moment estimators (PWME), and compare
the asymptotic and small-sample statistical properties (via computer simulation)
of the PWME with maximum likelihood estimators (MLE) and Jenkinson's (1969)
method of sextiles estimators (JSE). They also compare the statistical
properties of quantile estimators (which are based on the distribution parameter
estimators). Finally, they derive a test of the null hypothesis that the
shape parameter is zero, and assess its performance via computer simulation.
Hosking et al. (1985) note that when , the moments and
probability-weighted moments of the GEVD do not exist. They also note that in
practice the shape parameter usually lies between -1/2 and 1/2.
Hosking et al. (1985) found that the asymptotic efficiency of the PWME
(the limit as the sample size approaches infinity of the ratio of the
variance of the MLE divided by the variance of the PWME) tends to 0 as the
shape parameter approaches 1/2 or -1/2. For values of within the
range
, however, the efficiency of the estimator of location
is close to 100
are greater than 70
Hosking et al. (1985) found that the asymptotic efficiency of the PWME is poor for
outside the range
.
For the small sample results, Hosking et al. (1985) considered several possible
forms of the PWME (see equations (8)-(10) below). The best overall results were
given by the plotting-position PWME defined by equations (9) and (10) with
and
.
Small sample results for estimating the parameters show that for
all three methods give almost identical results. For
the results
for the different estimators are a bit different, but not dramatically so. The
MLE tends to be slightly less biased than the other two methods. For estimating
the shape parameter, the MLE has a slightly larger standard deviation, and the
PWME has consistently the smallest standard deviation.
Small sample results for estimating large quantiles show that for
all three methods are comparable. For
the PWME and JSE are
comparable and in general have much smaller standard deviations than the MLE.
All three methods are very inaccurate for estimating large quantiles in small
samples, especially when
.
Hosking et al. (1985) derive a test of the null hypothesis
based on the PWME of
. The test is performed by computing the
statistic:
and comparing to a standard normal distribution (see
zTestGevdShape
). Based on computer simulations using the
plotting-position PWME, they found that a sample size of ensures
an adequate normal approximation. They also found this test has power comparable
to the modified likelihood-ratio test, which was found by Hosking (1984) to be
the best overall test of
of the thirteen tests he considered.
More Details
Probability-Weighted Moments and Parameters of the GEVD
The definition of a probability-weighted moment, introduced by
Greenwood et al. (1979), is as follows. Let denote a random variable
with cdf
, and let
denote the
'th quantile of the
distribution. Then the
'th probability-weighted moment is given by:
where ,
, and
are real numbers.
Hosking et al. (1985) set
and Greenwood et al. (1979) show that
where
denotes the expected value of the 'th order statistic (i.e., the maximum)
in a sample of size
. Hosking et al. (1985) show that if
has a
GEVD with parameters
location=
,
scale=
, and
shape=
, where
, then
for , where
denotes the
gamma function. Thus,
Estimating Distribution Parameters
Using the results of Landwehr et al. (1979), Hosking et al. (1985) show that
given a random sample of values from some arbitrary distribution, an
unbiased, distribution-free, and parameter-free estimator of the
probability-weighted moment
defined above is given by:
where the quantity denotes the
'th order statistic in the
random sample of size
. Hosking et al. (1985) note that this estimator is
closely related to U-statistics (Hoeffding, 1948; Lehmann, 1975, pp. 362-371).
An alternative “plotting position” estimator is given by:
where
denotes the plotting position of the 'th order statistic in the random
sample of size
, that is, a distribution-free estimate of the cdf of
evaluated at the
'th order statistic. Typically, plotting
positions have the form:
where . For this form of plotting position, the
plotting-position estimators in (10) are asymptotically equivalent to the
U-statistic estimators in (9).
Although the unbiased and plotting position estimators are asymptotically equivalent (Hosking, 1990), Hosking and Wallis (1995) recommend using the unbiased estimator for almost all applications because of its superior performance in small and moderate samples.
Using equations (6)-(8) above, i.e., the three equations involving
,
, and
, Hosking et al. (1985) define
the probability-weighted moment estimators of
,
, and
as the solutions to these three
simultaneous equations, with the values of the probability-weighted moments
replaced by their estimated values (using either the unbiased or plotting posistion
estiamtors in (9) and (10) above). Hosking et al. (1985) note that the third
equation (equation (8)) must be solved iteratively for the PWME of
.
Using the unbiased estimators of the PWMEs to solve for
, the PWMEs
of
and
are given by:
Hosking et al. (1985) show that when the unbiased estimates of the PWMEs are used
to estimate the probability-weighted moments, the estimates of and
satisfy the feasibility criteria
almost surely.
Hosking et al. (1985) show that the asymptotic distribution of the PWME is
multivariate normal with mean equal to , and they
derive the formula for the asymptotic variance-covariance matrix as:
where
denotes the variance-covariance matrix of the estimators of the probability-weighted
moments defined in either equation (9) or (10) above (recall that these two
estimators are asymptotically equivalent), and the matrix is defined by:
for . Hosking et al. (1985) provide formulas for the matrix
in Appendix C of their manuscript. Note that there is a typographical error in
equation (C.11) (Jon Hosking, personal communication, 1996). In the second line
of this equation, the quantity should be replaced with
.
The matrix in equation (16) is not easily computed. Its inverse, however,
is easy to compute and then can be inverted numerically (Jon Hosking, 1996,
personal communication). The inverse of
is given by:
and by equation (5) above it can be shown that:
for .
Estimating Distribution Quantiles
If has a GEVD with parameters
location=
,
scale=
, and
shape=
, where
,
then the
'th quantile of the distribution is given by:
. Given estimated values of the location, scale, and shape
parameters, the
'th quantile of the distribution is estimated as:
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Greenwood, J.A., J.M. Landwehr, N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments: Definition and Relation to Parameters of Several Distributions Expressible in Inverse Form. Water Resources Research 15(5), 1049–1054.
Hoeffding, W. (1948). A Class of Statistics with Asymptotically Normal Distribution. Annals of Mathematical Statistics 19, 293–325.
Hosking, J.R.M. (1985). Algorithm AS 215: Maximum-Likelihood Estimation of the Parameters of the Generalized Extreme-Value Distribution. Applied Statistics 34(3), 301–310.
Hosking, J.R.M. (1990). -Moments: Analysis and Estimation of
Distributions Using Linear Combinations of Order Statistics. Journal of
the Royal Statistical Society, Series B 52(1), 105–124.
Hosking, J.R.M., and J.R. Wallis (1995). A Comparison of Unbiased and
Plotting-Position Estimators of Moments. Water Resources
Research 31(8), 2019–2025.
Jenkinson, A.F. (1969). Statistics of Extremes. Technical Note 98, World Meteorological Office, Geneva.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, pp.4-8.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Lehmann, E.L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, Oakland, CA, 457pp.
Generalized Extreme Value Distribution, egevd
.
This class of objects is returned by functions that perform hypothesis tests
(e.g., the R function t.test
, the EnvStats function
kendallSeasonalTrendTest
, etc.).
Objects of class "htest"
are lists that contain information about the null
and alternative hypotheses, the estimated distribution parameters, the test statistic,
the p-value, and (optionally) confidence intervals for distribution parameters.
Objects of S3 class "htest"
are returned by any of the
EnvStats functions that perform hypothesis tests as listed
here: Hypothesis Tests.
(Note that functions that perform goodness-of-fit tests
return objects of class "gof"
or "gofTwoSample"
.)
Objects of class "htest"
generated by EnvStats functions may
contain additional components called
estimation.method
(method used to estimate the population parameter(s)),
sample.size
, and
bad.obs
(number of missing (NA
), undefined (NaN
), or infinite
(Inf
, -Inf
) values removed prior to performing the hypothesis test),
and interval
(a list with information about a confidence, prediction, or
tolerance interval).
Required Components
The following components must be included in a legitimate list of
class "htest"
.
null.value |
numeric vector containing the value(s) of the population parameter(s) specified by
the null hypothesis. This vector has a |
alternative |
character string indicating the alternative hypothesis (the value of the input
argument |
method |
character string giving the name of the test used. |
estimate |
numeric vector containing the value(s) of the estimated population parameter(s)
involved in the null hypothesis. This vector has a |
data.name |
character string containing the actual name(s) of the input data. |
statistic |
numeric scalar containing the value of the test statistic, with a
|
parameters |
numeric vector containing the parameter(s) associated with the null distribution of
the test statistic. This vector has a |
p.value |
numeric scalar containing the p-value for the test under the null hypothesis. |
Optional Components
The following component may optionally be included in an object of
of class "htest"
generated by R functions that test hypotheses:
conf.int |
numeric vector of length 2 containing lower and upper confidence limits for the
estimated population parameter. This vector has an attribute called
|
The following components may be included in objects of class "htest"
generated by EnvStats functions:
sample.size |
numeric scalar containing the number of non-missing observations in the sample used for the hypothesis test. |
estimation.method |
character string containing the method used to compute the estimated distribution
parameter(s). The value of this component will depend on the available estimation
methods (see |
bad.obs |
the number of missing ( |
interval |
a list containing information about a confidence, prediction, or tolerance interval. |
Generic functions that have methods for objects of class
"htest"
include: print
.
Since objects of class "htest"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
print.htest
, Hypothesis Tests.
# Create an object of class "htest", then print it out. #------------------------------------------------------ htest.obj <- chenTTest(EPA.02d.Ex.9.mg.per.L.vec, mu = 30) mode(htest.obj) #[1] "list" class(htest.obj) #[1] "htest" names(htest.obj) # [1] "statistic" "parameters" "p.value" "estimate" # [5] "null.value" "alternative" "method" "sample.size" # [9] "data.name" "bad.obs" "interval" htest.obj #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mean = 30 # #Alternative Hypothesis: True mean is greater than 30 # #Test Name: One-sample t-Test # Modified for # Positively-Skewed Distributions # (Chen, 1995) # #Estimated Parameter(s): mean = 34.566667 # sd = 27.330598 # skew = 2.365778 # #Data: EPA.02d.Ex.9.mg.per.L.vec # #Sample Size: 60 # #Test Statistic: t = 1.574075 # #Test Statistic Parameter: df = 59 # #P-values: z = 0.05773508 # t = 0.06040889 # Avg. of z and t = 0.05907199 # #Confidence Interval for: mean # #Confidence Interval Method: Based on z # #Confidence Interval Type: Lower # #Confidence Level: 95% # #Confidence Interval: LCL = 29.82 # UCL = Inf #========== # Extract the test statistic #--------------------------- htest.obj$statistic # t #1.574075 #========== # Clean up #--------- rm(htest.obj)
# Create an object of class "htest", then print it out. #------------------------------------------------------ htest.obj <- chenTTest(EPA.02d.Ex.9.mg.per.L.vec, mu = 30) mode(htest.obj) #[1] "list" class(htest.obj) #[1] "htest" names(htest.obj) # [1] "statistic" "parameters" "p.value" "estimate" # [5] "null.value" "alternative" "method" "sample.size" # [9] "data.name" "bad.obs" "interval" htest.obj #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mean = 30 # #Alternative Hypothesis: True mean is greater than 30 # #Test Name: One-sample t-Test # Modified for # Positively-Skewed Distributions # (Chen, 1995) # #Estimated Parameter(s): mean = 34.566667 # sd = 27.330598 # skew = 2.365778 # #Data: EPA.02d.Ex.9.mg.per.L.vec # #Sample Size: 60 # #Test Statistic: t = 1.574075 # #Test Statistic Parameter: df = 59 # #P-values: z = 0.05773508 # t = 0.06040889 # Avg. of z and t = 0.05907199 # #Confidence Interval for: mean # #Confidence Interval Method: Based on z # #Confidence Interval Type: Lower # #Confidence Level: 95% # #Confidence Interval: LCL = 29.82 # UCL = Inf #========== # Extract the test statistic #--------------------------- htest.obj$statistic # t #1.574075 #========== # Clean up #--------- rm(htest.obj)
This class of objects is returned by EnvStats functions that perform
hypothesis tests based on censored data.
Objects of class "htestCensored"
are lists that contain information about
the null and alternative hypotheses, the censoring side, the censoring levels,
the percentage of observations that are censored,
the estimated distribution parameters (if applicable), the test statistic,
the p-value, and (optionally, if applicable)
confidence intervals for distribution parameters.
Objects of S3 class "htestCensored"
are returned by
the functions listed in the section Hypothesis Tests
in the help file
EnvStats Functions for Censored Data.
Currently, the only function listed is
twoSampleLinearRankTestCensored
.
Required Components
The following components must be included in a legitimate list of
class "htestCensored"
.
statistic |
numeric scalar containing the value of the test statistic, with a
|
parameters |
numeric vector containing the parameter(s) associated with the null distribution of
the test statistic. This vector has a |
p.value |
numeric scalar containing the p-value for the test under the null hypothesis. |
null.value |
numeric vector containing the value(s) of the population parameter(s) specified by
the null hypothesis. This vector has a |
alternative |
character string indicating the alternative hypothesis (the value of the input
argument |
method |
character string giving the name of the test used. |
sample.size |
numeric scalar containing the number of non-missing observations in the sample used for the hypothesis test. |
data.name |
character string containing the actual name(s) of the input data. |
bad.obs |
the number of missing ( |
censoring.side |
character string indicating whether the data are left- or right-censored. |
censoring.name |
character string indicating the name of the data object used to identify which values are censored. |
censoring.levels |
numeric scalar or vector indicating the censoring level(s). |
percent.censored |
numeric scalar indicating the percent of non-missing observations that are censored. |
Optional Components
The following component may optionally be included in an object of
of class "htestCensored"
:
estimate |
numeric vector containing the value(s) of the estimated population parameter(s)
involved in the null hypothesis. This vector has a |
estimation.method |
character string containing the method used to compute the estimated distribution
parameter(s). The value of this component will depend on the available estimation
methods (see |
interval |
a list containing information about a confidence, prediction, or tolerance interval. |
Generic functions that have methods for objects of class
"htestCensored"
include: print
.
Since objects of class "htestCensored"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
print.htestCensored
, Censored Data.
# Create an object of class "htestCensored", then print it out. #-------------------------------------------------------------- htestCensored.obj <- with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "tarone-ware", alternative = "greater")) mode(htestCensored.obj) #[1] "list" class(htestCensored.obj) #[1] "htest" names(htestCensored.obj) # [1] "statistic" "parameters" "p.value" # [4] "estimate" "null.value" "alternative" # [7] "method" "estimation.method" "sample.size" #[10] "data.name" "bad.obs" "censoring.side" #[13] "censoring.name" "censoring.levels" "percent.censored" htestCensored.obj #Results of Hypothesis Test #Based on Censored Data #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) > Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Tarone-Ware Test # with Hypergeometric Variance # #Censoring Side: left # #Data: x = PCE.ppb[Well.type == "Compliance"] # y = PCE.ppb[Well.type == "Background"] # #Censoring Variable: x = Censored[Well.type == "Compliance"] # y = Censored[Well.type == "Background"] # #Sample Sizes: nx = 8 # ny = 6 # #Percent Censored: x = 12.5% # y = 50.0% # #Test Statistics: nu = 8.458912 # var.nu = 20.912407 # z = 1.849748 # #P-value: 0.03217495 #========== # Extract the test statistics #---------------------------- htestCensored.obj$statistic # nu var.nu z # 8.458912 20.912407 1.849748 #========== # Clean up #--------- rm(htestCensored.obj)
# Create an object of class "htestCensored", then print it out. #-------------------------------------------------------------- htestCensored.obj <- with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "tarone-ware", alternative = "greater")) mode(htestCensored.obj) #[1] "list" class(htestCensored.obj) #[1] "htest" names(htestCensored.obj) # [1] "statistic" "parameters" "p.value" # [4] "estimate" "null.value" "alternative" # [7] "method" "estimation.method" "sample.size" #[10] "data.name" "bad.obs" "censoring.side" #[13] "censoring.name" "censoring.levels" "percent.censored" htestCensored.obj #Results of Hypothesis Test #Based on Censored Data #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) > Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Tarone-Ware Test # with Hypergeometric Variance # #Censoring Side: left # #Data: x = PCE.ppb[Well.type == "Compliance"] # y = PCE.ppb[Well.type == "Background"] # #Censoring Variable: x = Censored[Well.type == "Compliance"] # y = Censored[Well.type == "Background"] # #Sample Sizes: nx = 8 # ny = 6 # #Percent Censored: x = 12.5% # y = 50.0% # #Test Statistics: nu = 8.458912 # var.nu = 20.912407 # z = 1.849748 # #P-value: 0.03217495 #========== # Extract the test statistics #---------------------------- htestCensored.obj$statistic # nu var.nu z # 8.458912 20.912407 1.849748 #========== # Clean up #--------- rm(htestCensored.obj)
Predict concentration using a calibration line (or curve) and inverse regression.
inversePredictCalibrate(object, obs.y = NULL, n.points = ifelse(is.null(obs.y), 100, length(obs.y)), intervals = FALSE, coverage = 0.99, simultaneous = FALSE, individual = FALSE, trace = FALSE)
inversePredictCalibrate(object, obs.y = NULL, n.points = ifelse(is.null(obs.y), 100, length(obs.y)), intervals = FALSE, coverage = 0.99, simultaneous = FALSE, individual = FALSE, trace = FALSE)
object |
an object of class |
obs.y |
optional numeric vector of observed values for the machine signal.
The default value is |
n.points |
optional integer indicating the number of points at which to predict concentrations
(i.e., perform inverse regression). The default value is |
intervals |
optional logical scalar indicating whether to compute confidence intervals for
the predicted concentrations. The default value is |
coverage |
optional numeric scalar between 0 and 1 indicating the confidence level associated with
the confidence intervals for the predicted concentrations.
The default value is |
simultaneous |
optional logical scalar indicating whether to base the confidence intervals
for the predicted values on simultaneous or non-simultaneous prediction limits.
The default value is |
individual |
optional logical scalar indicating whether to base the confidence intervals for the predicted values
on prediction limits for the mean ( |
trace |
optional logical scalar indicating whether to print out (trace) the progress of
the inverse prediction for each of the specified values of |
A simple and frequently used calibration model is a straight line where the
response variable denotes the signal of the machine and the
predictor variable
denotes the true concentration in the physical
sample. The error term is assumed to follow a normal distribution with
mean 0. Note that the average value of the signal for a blank (
)
is the intercept. Other possible calibration models include higher order
polynomial models such as a quadratic or cubic model.
In a typical setup, a small number of samples (e.g., ) with known
concentrations are measured and the signal is recorded. A sample with no
chemical in it, called a blank, is also measured. (You have to be careful
to define exactly what you mean by a “blank.” A blank could mean
a container from the lab that has nothing in it but is prepared in a similar
fashion to containers with actual samples in them. Or it could mean a
field blank: the container was taken out to the field and subjected to the
same process that all other containers were subjected to, except a physical
sample of soil or water was not placed in the container.) Usually,
replicate measures at the same known concentrations are taken.
(The term “replicate” must be well defined to distinguish between for
example the same physical samples that are measured more than once vs. two
different physical samples of the same known concentration.)
The function calibrate
initially fits a linear calibration
line or curve. Once the calibration line is fit, samples with unknown
concentrations are measured and their signals are recorded. In order to
produce estimated concentrations, you have to use inverse regression to
map the signals to the estimated concentrations. We can quantify the
uncertainty in the estimated concentration by combining inverse regression
with prediction limits for the signal .
A numeric matrix containing the results of the inverse calibration.
The first two columns are labeled obs.y
and pred.x
containing
the values of the argument obs.y
and the predicted values of x
(the concentration), respectively. If intervals=TRUE
, then the matrix also
contains the columns lpl.x
and upl.x
corresponding to the lower and
upper prediction limits for x
. Also, if intervals=TRUE
, then the
matrix has the attributes coverage
(the value of the argument coverage
)
and simultaneous
(the value of the argument simultaneous
).
Almost always the process of determining the concentration of a chemical in
a soil, water, or air sample involves using some kind of machine that
produces a signal, and this signal is related to the concentration of the
chemical in the physical sample. The process of relating the machine signal
to the concentration of the chemical is called calibration
(see calibrate
). Once calibration has been performed,
estimated concentrations in physical samples with unknown concentrations
are computed using inverse regression. The uncertainty in the process used
to estimate the concentration may be quantified with decision, detection,
and quantitation limits.
In practice, only the point estimate of concentration is reported (along
with a possible qualifier), without confidence bounds for the true
concentration . This is most unfortunate because it gives the
impression that there is no error associated with the reported concentration.
Indeed, both the International Organization for Standardization (ISO) and
the International Union of Pure and Applied Chemistry (IUPAC) recommend
always reporting both the estimated concentration and the uncertainty
associated with this estimate (Currie, 1997).
Steven P. Millard ([email protected])
Currie, L.A. (1997). Detection: International Update, and Some Emerging Di-Lemmas Involving Calibration, the Blank, and Multiple Detection Decisions. Chemometrics and Intelligent Laboratory Systems 37, 151–181.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, Chapter 3 and p.335.
Hubaux, A., and G. Vos. (1970). Decision and Detection Limits for Linear Calibration Curves. Annals of Chemistry 42, 849–855.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.562–575.
pointwise
, calibrate
, detectionLimitCalibrate
, lm
# The data frame EPA.97.cadmium.111.df contains calibration data # for cadmium at mass 111 (ng/L) that appeared in # Gibbons et al. (1997b) and were provided to them by the U.S. EPA. # Here we # 1. Display a plot of these data along with the fitted calibration # line and 99% non-simultaneous prediction limits. # 2. Then based on an observed signal of 60 from a sample with # unknown concentration, we use the calibration line to estimate # the true concentration and use the prediction limits to compute # confidence bounds for the true concentration. # An observed signal of 60 results in an estimated value of cadmium # of 59.97 ng/L and a confidence interval of [53.83, 66.15]. # See Millard and Neerchal (2001, pp.566-569) for more details on # this example. Cadmium <- EPA.97.cadmium.111.df$Cadmium Spike <- EPA.97.cadmium.111.df$Spike calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) newdata <- data.frame(Spike = seq(min(Spike), max(Spike), length.out = 100)) pred.list <- predict(calibrate.list, newdata = newdata, se.fit = TRUE) pointwise.list <- pointwise(pred.list, coverage = 0.99, individual = TRUE) plot(Spike, Cadmium, ylim = c(min(pointwise.list$lower), max(pointwise.list$upper)), xlab = "True Concentration (ng/L)", ylab = "Observed Concentration (ng/L)") abline(calibrate.list, lwd=2) lines(newdata$Spike, pointwise.list$lower, lty=8, lwd=2) lines(newdata$Spike, pointwise.list$upper, lty=8, lwd=2) title(paste("Calibration Line and 99% Prediction Limits", "for US EPA Cadmium 111 Data", sep = "\n")) # Now estimate the true concentration based on # an observed signal of 60 ng/L. inversePredictCalibrate(calibrate.list, obs.y = 60, intervals = TRUE, coverage = 0.99, individual = TRUE) # obs.y pred.x lpl.x upl.x #[1,] 60 59.97301 53.8301 66.15422 #attr(, "coverage"): #[1] 0.99 #attr(, "simultaneous"): #[1] FALSE rm(Cadmium, Spike, calibrate.list, newdata, pred.list, pointwise.list)
# The data frame EPA.97.cadmium.111.df contains calibration data # for cadmium at mass 111 (ng/L) that appeared in # Gibbons et al. (1997b) and were provided to them by the U.S. EPA. # Here we # 1. Display a plot of these data along with the fitted calibration # line and 99% non-simultaneous prediction limits. # 2. Then based on an observed signal of 60 from a sample with # unknown concentration, we use the calibration line to estimate # the true concentration and use the prediction limits to compute # confidence bounds for the true concentration. # An observed signal of 60 results in an estimated value of cadmium # of 59.97 ng/L and a confidence interval of [53.83, 66.15]. # See Millard and Neerchal (2001, pp.566-569) for more details on # this example. Cadmium <- EPA.97.cadmium.111.df$Cadmium Spike <- EPA.97.cadmium.111.df$Spike calibrate.list <- calibrate(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) newdata <- data.frame(Spike = seq(min(Spike), max(Spike), length.out = 100)) pred.list <- predict(calibrate.list, newdata = newdata, se.fit = TRUE) pointwise.list <- pointwise(pred.list, coverage = 0.99, individual = TRUE) plot(Spike, Cadmium, ylim = c(min(pointwise.list$lower), max(pointwise.list$upper)), xlab = "True Concentration (ng/L)", ylab = "Observed Concentration (ng/L)") abline(calibrate.list, lwd=2) lines(newdata$Spike, pointwise.list$lower, lty=8, lwd=2) lines(newdata$Spike, pointwise.list$upper, lty=8, lwd=2) title(paste("Calibration Line and 99% Prediction Limits", "for US EPA Cadmium 111 Data", sep = "\n")) # Now estimate the true concentration based on # an observed signal of 60 ng/L. inversePredictCalibrate(calibrate.list, obs.y = 60, intervals = TRUE, coverage = 0.99, individual = TRUE) # obs.y pred.x lpl.x upl.x #[1,] 60 59.97301 53.8301 66.15422 #attr(, "coverage"): #[1] 0.99 #attr(, "simultaneous"): #[1] FALSE rm(Cadmium, Spike, calibrate.list, newdata, pred.list, pointwise.list)
Compute the interquartile range for a set of data.
iqr(x, na.rm = FALSE)
iqr(x, na.rm = FALSE)
x |
numeric vector of observations. |
na.rm |
logical scalar indicating whether to remove missing values from |
Let denote a random sample of
observations from
some distribution associated with a random variable
. The sample
interquartile range is defined as:
where denotes the
'th quantile of the distribution and
denotes the estimate of this quantile (i.e., the sample
'th quantile).
See the R help file for quantile
for information on how sample
quantiles are computed.
A numeric scalar – the interquartile range.
The interquartile range is a robust estimate of the spread of the
distribution. It is the distance between the two ends of a boxplot
(see the R help file for boxplot
). For a normal distribution
with standard deviation it can be shown that:
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.
Hirsch, R.M., D.R. Helsel, T.A. Cohn, and E.J. Gilroy. (1993). Statistical Analysis of Hydrologic Data. In: Maidment, D.R., ed. Handbook of Hydrology. McGraw-Hill, New York, Chapter 17, pp.5–7.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
Summary Statistics, summaryFull
,
var
, sd
.
# Generate 20 observations from a normal distribution with parameters # mean=10 and sd=2, and compute the standard deviation and # interquartile range. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rnorm(20, mean=10, sd=2) sd(dat) #[1] 1.180226 iqr(dat) #[1] 1.489932 #---------- # Repeat the last example, but add a couple of large "outliers" to the # data. Note that the estimated standard deviation is greatly affected # by the outliers, while the interquartile range is not. summaryStats(dat, quartiles = TRUE) # N Mean SD Median Min Max 1st Qu. 3rd Qu. #dat 20 9.8612 1.1802 9.6978 7.6042 11.8756 9.1618 10.6517 new.dat <- c(dat, 20, 50) sd(dat) #[1] 1.180226 sd(new.dat) #[1] 8.79796 iqr(dat) #[1] 1.489932 iqr(new.dat) #[1] 1.851472 #---------- # Clean up rm(dat, new.dat)
# Generate 20 observations from a normal distribution with parameters # mean=10 and sd=2, and compute the standard deviation and # interquartile range. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rnorm(20, mean=10, sd=2) sd(dat) #[1] 1.180226 iqr(dat) #[1] 1.489932 #---------- # Repeat the last example, but add a couple of large "outliers" to the # data. Note that the estimated standard deviation is greatly affected # by the outliers, while the interquartile range is not. summaryStats(dat, quartiles = TRUE) # N Mean SD Median Min Max 1st Qu. 3rd Qu. #dat 20 9.8612 1.1802 9.6978 7.6042 11.8756 9.1618 10.6517 new.dat <- c(dat, 20, 50) sd(dat) #[1] 1.180226 sd(new.dat) #[1] 8.79796 iqr(dat) #[1] 1.489932 iqr(new.dat) #[1] 1.851472 #---------- # Clean up rm(dat, new.dat)
Perform a nonparametric test for a monotonic trend within each season based on Kendall's tau statistic, and optionally compute a confidence interval for the slope across all seasons.
kendallSeasonalTrendTest(y, ...) ## S3 method for class 'formula' kendallSeasonalTrendTest(y, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: kendallSeasonalTrendTest(y, season, year, alternative = "two.sided", correct = TRUE, ci.slope = TRUE, conf.level = 0.95, independent.obs = TRUE, data.name = NULL, season.name = NULL, year.name = NULL, parent.of.data = NULL, subset.expression = NULL, ...) ## S3 method for class 'data.frame' kendallSeasonalTrendTest(y, ...) ## S3 method for class 'matrix' kendallSeasonalTrendTest(y, ...)
kendallSeasonalTrendTest(y, ...) ## S3 method for class 'formula' kendallSeasonalTrendTest(y, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: kendallSeasonalTrendTest(y, season, year, alternative = "two.sided", correct = TRUE, ci.slope = TRUE, conf.level = 0.95, independent.obs = TRUE, data.name = NULL, season.name = NULL, year.name = NULL, parent.of.data = NULL, subset.expression = NULL, ...) ## S3 method for class 'data.frame' kendallSeasonalTrendTest(y, ...) ## S3 method for class 'matrix' kendallSeasonalTrendTest(y, ...)
y |
an object containing data for the trend test. In the default method,
the argument |
data |
specifies an optional data frame, list or environment (or object coercible by
|
subset |
specifies an optional vector specifying a subset of observations to be used. |
na.action |
specifies a function which indicates what should happen when the data contain |
season |
numeric or character vector or a factor indicating the seasons in which the observations in
|
year |
numeric vector indicating the years in which the observations in |
alternative |
character string indicating the kind of alternative hypothesis. The
possible values are |
correct |
logical scalar indicating whether to use the correction for continuity in
computing the |
ci.slope |
logical scalar indicating whether to compute a confidence interval for the
slope. The default value is |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated
with the confidence interval for the slope. The default value is
|
independent.obs |
logical scalar indicating whether to assume the observations in |
data.name |
character string indicating the name of the data used for the trend test.
The default value is |
season.name |
character string indicating the name of the data used for the season.
The default value is |
year.name |
character string indicating the name of the data used for the year.
The default value is |
parent.of.data |
character string indicating the source of the data used for the trend test. |
subset.expression |
character string indicating the expression used to subset the data. |
... |
additional arguments affecting the test for trend. |
Hirsch et al. (1982) introduced a modification of Kendall's test for trend
(see kendallTrendTest
) that allows for seasonality in observations collected over time.
They call this test the seasonal Kendall test. Their test is appropriate for testing for
trend in each season when the trend is always in the same direction across all seasons.
van Belle and Hughes (1984) introduced a heterogeneity test for trend which is appropriate for testing
for trend in any direction in any season. Hirsch and Slack (1984) proposed an extension to the seasonal
Kendall test that allows for serial dependence in the observations. The function
kendallSeasonalTrendTest
includes all of these tests, as well as an extension of the
van Belle-Hughes heterogeneity test to the case of serial dependence.
Testing for Trend Assuming Serial Independence
The Model
Assume observations are taken over two or more years, and assume a single year
can be divided into two or more seasons. Let denote the number of seasons.
Let
and
denote two continuous random variables with some joint
(bivariate) distribution (which may differ from season to season). Let
denote the number of bivariate observations taken in the
'th season (over two
or more years) (
), so that
denote the bivariate observations from this distribution for season
, assume these bivariate observations are mutually independent, and let
denote the value of Kendall's tau for that season (see kendallTrendTest
).
Also, assume all of the observations are independent.
The function kendallSeasonalTrendTest
assumes that the values always
denote the year in which the observation was taken. Note that within any season,
the
values need not be unique. That is, there may be more than one
observation within the same year within the same season. In this case, the
argument
y
must be a numeric vector, and you must supply the additional
arguments season
and year
.
If there is only one observation per season per year (missing values allowed), it is
usually easiest to supply the argument y
as an matrix or
data frame, where
denotes the number of years. In this case
and
for and
, so if
denotes
the
matrix of observed
's and
denotes the
matrix of the
's, then
|
|
|
|
||
|
|
|
|
||
|
. | |
|||
. | |||||
. | |||||
|
|
|
|
|
|
|
|
||
|
|
|
|
||
|
. | |
|||
. | |||||
. | |||||
|
|
|
|
The null hypothesis is that within each season the and
random
variables are independent; that is, within each season there is no trend in the
observations over time. This null hypothesis can be expressed as:
The Seasonal Kendall Test for Trend
Hirsch et al.'s (1982) extension of Kendall's tau statistic to test the null
hypothesis (6) is based on simply summing together the Kendall -statistics
for each season and computing the following statistic:
or, using the correction for continuity,
where
and denotes the
sign
function:
|
|
|
|
|
|
|
|
Note that the quantity in Equation (10) is simply the Kendall -statistic for
season
(
) (see Equation (3) in the help file for
kendallTrendTest
).
For each season, if the predictor variables (the 's) are strictly increasing
(e.g., Equation (3) above), then Equation (10) simplifies to
Under the null hypothesis (6), the quantity defined in Equation (7) or (8)
is approximately distributed as a standard normal random variable.
Note that there may be missing values in the observations, so let
denote the number of
pairs without missing values for season
.
The statistic in Equation (9) has mean and variance given by:
Since all the observations are assumed to be mutually independent,
Furthermore, under the null hypothesis (6),
and, in the case of no tied observations,
for (see equation (7) in the help file for
kendallTrendTest
).
In the case of tied observations,
|
|
|
|
|
|
|
|
|
where is the number of tied groups in the
observations for
season
,
is the size of the
'th tied group in the
observations for season
,
is the number of tied groups in the
observations for season
, and
is the size of the
'th tied
group in the
observations for season
(see Equation (9) in the help file for
kendallTrendTest
).
Estimating , Slope, and Intercept
The function kendall.SeasonalTrendTest
returns estimated values of
Kendall's , the slope, and the intercept for each season, as well as a
single estimate for each of these three quantities combined over all seasons.
The overall estimate of
is the weighted average of the p seasonal
's:
where
(see Equation (2) in the help file for kendallTrendTest
).
We can compute the estimated slope for season as:
for . The overall estimate of slope, however, is
not the median of these
estimates of slope; instead,
following Hirsch et al. (1982, p.117), the overall estimate of slope is the median
of all two-point slopes computed within each season:
(see Equation (15) in the help file for kendallTrendTest
).
The overall estimate of intercept is the median of the seasonal estimates of
intercept:
where
and and
denote the sample medians of the
's
and
's, respectively, for season
(see Equation (16) in the help file for
kendallTrendTest
).
Confidence Interval for the Slope
Gilbert (1987, p.227-228) extends his method of computing a confidence interval for
the slope to the case of seasonal observations. Let denote the number of
defined two-point estimated slopes that are used in Equation (22) above and let
denote the ordered slopes. For Gilbert's (1987) method, a
two-sided confidence interval for the true over-all
slope across all seasons is given by:
where
is defined in Equation (14), and
denotes the
'th quantile of the standard normal distribution.
One-sided confidence intervals may computed in a similar fashion.
Usually the quantities and
will not be integers.
Gilbert (1987, p.219) suggests interpolating between adjacent values in this case,
which is what the function
kendallSeasonalTrendTest
does.
The Van Belle-Hughes Heterogeneity Test for Trend
The seasonal Kendall test described above is appropriate for testing the null
hypothesis (6) against the alternative hypothesis of a trend in at least one season.
All of the trends in each season should be in the same direction.
The seasonal Kendall test is not appropriate for testing for trend when there are
trends in a positive direction in one or more seasons and also negative trends in
one or more seasons. For example, for the following set of observations, the
seasonal Kendall statistic is 0 with an associated two-sided p-value of 1,
even though there is clearly a positive trend in season 1 and a negative trend in
season 2.
Year | Season 1 | Season 2 |
1 | 5 | 8 |
2 | 6 | 7 |
3 | 7 | 6 |
4 | 8 | 5 |
Van Belle and Hughes (1984) suggest using the following statistic to test for heterogeneity in trend prior to applying the seasonal Kendall test:
where
Under the null hypothesis (6), the statistic defined in Equation (29) is
approximately distributed as a chi-square random variable with
degrees of freedom. Note that the continuity correction is not used to
compute the
's defined in Equation (30) since using it results in an
unacceptably conservative test (van Belle and Hughes, 1984, p.132). Van Belle and
Hughes (1984) actually call the statistic in (29) a homogeneous chi-square statistic.
Here it is called a heterogeneous chi-square statistic after the alternative
hypothesis it is meant to test.
Van Belle and Hughes (1984) imply that the heterogeneity statistic defined in Equation (29) may be used to test the null hypothesis:
where is some arbitrary number between -1 and 1. For this case, however,
the distribution of the test statistic in Equation (29) is unknown since it depends
on the unknown value of
(Equations (16)-(18) above assume
and are not correct if
). The heterogeneity
chi-square statistic of Equation (29) may be assumed to be approximately
distributed as chi-square with
degrees of freedom under the null
hypothesis (32), but further study is needed to determine how well this
approximation works.
Testing for Trend Assuming Serial Dependence
The Model
Assume the same model as for the case of serial independence, except now the
observed 's are not assumed to be independent of one another, but are
allowed to be serially correlated over time. Furthermore, assume one observation
per season per year (Equations (2)-(5) above).
The Seasonal Kendall Test for Trend Modified for Serial Dependence
Hirsch and Slack (1984) introduced a modification of the seasonal Kendall test that
is robust against serial dependence (in terms of Type I error) except when the
observations have a very strong long-term persistence (very large autocorrelation) or
when the sample sizes are small (e.g., 5 years of monthly data). This modification
is based on a multivariate test introduced by Dietz and Killeen (1981).
In the case of serial dependence, Equation (15) is no longer true, so an estimate of
the correct value of must be used to compute Var(S') in
Equation (14). Let
denote the
matrix of ranks for the
observations (Equation (4) above), where the
's are ranked within
season:
|
|
|
|
||
|
|
|
|
||
|
. | |
|||
. | |||||
. | |||||
|
|
|
|
where
the function is defined in Equation (11) above, and as before
denotes
the number of
pairs without missing values for season
. Note that
by this definition, missing values are assigned the mid-rank of the non-missing
values.
Hirsch and Slack (1984) suggest using the following formula, given by Dietz and Killeen (1981), in the case where there are no missing values:
where
Note that the quantity defined in Equation (36) is Kendall's tau for season
vs. season
.
For the case of missing values, Hirsch and Slack (1984) derive the following modification of Equation (35):
Technically, the estimates in Equations (35) and (37) are not correct estimators of
covariance, and Equations (17) and (18) are not correct estimators of variance,
because the model Dietz and Killeen (1981) use assumes that observations within the
rows of (Equation (4) above) may be correlated, but observations between
rows are independent. Serial dependence induces correlation between all of the
's. In most cases, however, the serial dependence shows an exponential decay
in correlation across time and so these estimates work fairly well (see more
discussion in the BACKGROUND section below).
Estimates and Confidence Intervals
The seasonal and over-all estimates of , slope, and intercept are computed
using the same methods as in the case of serial independence. Also, the method for
computing the confidence interval for the slope is the same as in the case of serial
independence. Note that the serial dependence is accounted for in the term
in Equation (28).
The Van Belle-Hughes Heterogeneity Test for Trend Modified for Serial Dependence
Like its counterpart in the case of serial independence, the seasonal Kendall test
modified for serial dependence described above is appropriate for testing the null
hypothesis (6) against the alternative hypothesis of a trend in at least one season.
All of the trends in each season should be in the same direction.
The modified seasonal Kendall test is not appropriate for testing for trend when there are trends in a positive direction in one or more seasons and also negative trends in one or more seasons. This section describes a modification of the van Belle-Hughes heterogeneity test for trend in the presence of serial dependence.
Let denote the
vector of Kendall
-statistics for
each season:
|
||
|
||
|
. | |
. | ||
. | ||
|
The distribution of is approximately multivariate normal with
|
||
|
||
|
. | |
. | ||
. | ||
|
|
|
|
|
||
|
|
|
|
||
|
. | |
|||
. | |||||
. | |||||
|
|
|
|
where
Define the matrix
as
|
|
|
|
||
|
|
|
|
||
|
. | |
|||
. | |||||
. | |||||
|
|
|
|
Then the vector of the seasonal estimates of can be written as:
|
|
|||
|
|
|||
|
. | |
. | |
. | . | |||
. | . | |||
|
|
so the distribution of the vector in Equation (43) is approximately multivariate normal with
|
|||||
|
|||||
|
|
|
|
. | |
. | |||||
. | |||||
|
where denotes the transpose operator.
Let
denote the
contrast matrix
where denotes the
identity matrix. That is,
|
|
|
|
|
|
|
|
|
|
|
|
|
. | . | |||
. | . | ||||
. | . | ||||
|
|
|
|
|
Then the null hypothesis (32) is equivalent to the null hypothesis:
Based on theory for samples from a multivariate normal distribution (Johnson and Wichern, 2007), under the null hypothesis (47) the quantity
has approximately a chi-square distribution with degrees of freedom for
“large” values of seasonal sample sizes, where
The estimate of in Equation (49) can be computed using the same formulas
that are used for the modified seasonal Kendall test (i.e., Equation (35) or (37)
for the off-diagonal elements and Equation (17) or (18) for the diagonal elements).
As previously noted, the formulas for the variances are actually only valid if
and there is no correlation between the rows of
. The same is
true of the formulas for the covariances. More work is needed to determine the
goodness of the chi-square approximation for the test statistic in (48). The
pseudo-heterogeneity test statistic of Equation (48), however, should provide some
guidance as to whether the null hypothesis (32) (or equivalently (47)) appears to be
true.
A list of class "htest"
containing the results of the hypothesis
test. See the help file for htest.object
for details.
In addition, the following components are part of the list returned by kendallSeasonalTrendTest
:
seasonal.S |
numeric vector. The value of the Kendall S-statistic for each season. |
var.seasonal.S |
numeric vector. The variance of the Kendall S-statistic for each season.
This component only appears when |
var.cov.seasonal.S |
numeric matrix. The estimated variance-covariance matrix of the Kendall
S-statistics for each season. This component only appears when |
seasonal.estimates |
numeric matrix. The estimated Kendall's tau, slope, and intercept for each season. |
Kendall's test for independence or trend is a nonparametric test. No assumptions are made about the
distribution of the and
variables. Hirsch et al. (1982) introduced the seasonal
Kendall test to test for trend within each season. They note that Kendall's test for trend is easy to
compute, even in the presence of missing values, and can also be used with censored values.
van Belle and Hughes (1984) note that the seasonal Kendall test introduced by Hirsch et al. (1982) is similar to a multivariate extension of the sign test proposed by Jonckheere (1954). Jonckheeere's test statistic is based on the unweighted sum of the seasonal tau statistics, while Hirsch et al.'s test is based on the weighted sum (weighted by number of observations within a season) of the seasonal tau statistics.
van Belle and Hughes (1984) also note that Kendall's test for trend is slightly less powerful than the test based on Spearman's rho, but it converges to normality faster. Also, Bradley (1968, p.288) shows that for the case of a linear model with normal (Gaussian) errors, the asymptotic relative efficiency of Kendall's test for trend versus the parametric test for a zero slope is 0.98.
Based on the work of Dietz and Killeen (1981), Hirsch and Slack (1984) describe a modified version of the
seasonal Kendall test that allows for serial dependence in the observations. They performed a Monte Carlo
study to determine the empirical significance level and power of this modified test vs. the test that
assumes independent observations and found a trade-off between power and the correct significance level.
For seasons, they found the modified test gave correct significance levels for
as long as the lag-one autocorrelation was 0.6 or less, while the original test that assumes independent
observations yielded highly inflated significance levels. On the other hand, if in fact the observations
are serially independent, the original test is more powerful than the modified test.
Hirsch and Slack (1984) also looked at the performance of the test for trend introduced by
Dietz and Killeen (1981), which is a weighted sums of squares of the seasonal Kendall S-statistics,
where the matrix of weights is the inverse of the covariance matrix. The Dietz-Killeen test statistic,
unlike the one proposed by Hirsh and Slack (1984), tests for trend in either direction in any season,
and is asymptotically distributed as a chi-square random variable with (number of seasons)
degrees of freedom. Hirsch and Slack (1984), however, found that the test based on this statistic is
quite conservative (i.e., the significance level is much smaller than the assumed significance level)
and has poor power even for moderate sample sizes. The chi-square approximation becomes reasonably
close only when
if
,
if
, and
if
.
Lettenmaier (1988) notes the poor power of the test proposed by Dietz and Killeen (1981) and states the poor power apparently results from an upward bias in the estimated variance of the statistic, which can be traced to the inversion of the estimated covariance matrix. He suggests an alternative test statistic (to test trend in either direction in any season) that is the sum of the squares of the scaled seasonal Kendall S-statistics (scaled by their standard deviations). Note that this test statistic ignores information about the covariance between the seasonal Kendall S-statistics, although its distribution depends on these covariances. In the case of no serial dependence, Lettenmaier's test statistic is exactly the same as the Dietz-Killeen test statistic. In the case of serial dependence, Lettenmaier (1988) notes his test statistic is a quadratic form of a multivariate normal random variable and therefore all the moments of this random variable are easily computed. Lettenmaier (1988) approximates the distribution of his test statistic as a scaled non-central chi-square distribution (with fractional degrees of freedom). Based on extensive Monte Carlo studies, Lettenmaier (1988) shows that for the case when the trend is the same in all seasons, the seasonal Kendall's test of Hirsch and Slack (1984) is superior to his test and far superior to the Dietz-Killeen test. The power of Lettenmaier's test approached that of the seasonal Kendall test for large trend magnitudes.
Steven P. Millard ([email protected])
Bradley, J.V. (1968). Distribution-Free Statistical Tests. Prentice-Hall, Englewood Cliffs, NJ.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York, pp.256-272.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, Chapter 16.
Helsel, D.R. and R.M. Hirsch. (1988). Discussion of Applicability of the t-test for Detecting Trends in Water Quality Variables. Water Resources Bulletin 24(1), 201-204.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, NY.
Helsel, D.R., and R. M. Hirsch. (2002). Statistical Methods in Water Resources. Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. Available on-line at https://pubs.usgs.gov/tm/04/a03/tm4a3.pdf.
Hirsch, R.M., J.R. Slack, and R.A. Smith. (1982). Techniques of Trend Analysis for Monthly Water Quality Data. Water Resources Research 18(1), 107-121.
Hirsch, R.M. and J.R. Slack. (1984). A Nonparametric Trend Test for Seasonal Data with Serial Dependence. Water Resources Research 20(6), 727-732.
Hirsch, R.M., R.B. Alexander, and R.A. Smith. (1991). Selection of Methods for the Detection and Estimation of Trends in Water Quality. Water Resources Research 27(5), 803-813.
Hollander, M., and D.A. Wolfe. (1999). Nonparametric Statistical Methods, Second Edition. John Wiley and Sons, New York.
Johnson, R.A., and D.W. Wichern. (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Pearson Prentice Hall, Upper Saddle River, NJ.
Kendall, M.G. (1938). A New Measure of Rank Correlation. Biometrika 30, 81-93.
Kendall, M.G. (1975). Rank Correlation Methods. Charles Griffin, London.
Mann, H.B. (1945). Nonparametric Tests Against Trend. Econometrica 13, 245-259.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Sen, P.K. (1968). Estimates of the Regression Coefficient Based on Kendall's Tau. Journal of the American Statistical Association 63, 1379-1389.
Theil, H. (1950). A Rank-Invariant Method of Linear and Polynomial Regression Analysis, I-III. Proc. Kon. Ned. Akad. v. Wetensch. A. 53, 386-392, 521-525, 1397-1412.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
van Belle, G., and J.P. Hughes. (1984). Nonparametric Tests for Trend in Water Quality. Water Resources Research 20(1), 127-136.
kendallTrendTest
, htest.object
, cor.test
.
# Reproduce Example 14-10 on page 14-38 of USEPA (2009). This example # tests for trend in analyte concentrations (ppm) collected monthly # between 1983 and 1985. head(EPA.09.Ex.14.8.df) # Month Year Unadj.Conc Adj.Conc #1 January 1983 1.99 2.11 #2 February 1983 2.10 2.14 #3 March 1983 2.12 2.10 #4 April 1983 2.12 2.13 #5 May 1983 2.11 2.12 #6 June 1983 2.15 2.12 tail(EPA.09.Ex.14.8.df) # Month Year Unadj.Conc Adj.Conc #31 July 1985 2.31 2.23 #32 August 1985 2.32 2.24 #33 September 1985 2.28 2.23 #34 October 1985 2.22 2.24 #35 November 1985 2.19 2.25 #36 December 1985 2.22 2.23 # Plot the data #-------------- Unadj.Conc <- EPA.09.Ex.14.8.df$Unadj.Conc Adj.Conc <- EPA.09.Ex.14.8.df$Adj.Conc Month <- EPA.09.Ex.14.8.df$Month Year <- EPA.09.Ex.14.8.df$Year Time <- paste(substring(Month, 1, 3), Year - 1900, sep = "-") n <- length(Unadj.Conc) Three.Yr.Mean <- mean(Unadj.Conc) dev.new() par(mar = c(7, 4, 3, 1) + 0.1, cex.lab = 1.25) plot(1:n, Unadj.Conc, type = "n", xaxt = "n", xlab = "Time (Month)", ylab = "ANALYTE CONCENTRATION (mg/L)", main = "Figure 14-15. Seasonal Time Series Over a Three Year Period", cex.main = 1.1) axis(1, at = 1:n, labels = rep("", n)) at <- rep(c(1, 5, 9), 3) + rep(c(0, 12, 24), each = 3) axis(1, at = at, labels = Time[at]) points(1:n, Unadj.Conc, pch = 0, type = "o", lwd = 2) points(1:n, Adj.Conc, pch = 3, type = "o", col = 8, lwd = 2) abline(h = Three.Yr.Mean, lwd = 2) legend("topleft", c("Unadjusted", "Adjusted", "3-Year Mean"), bty = "n", pch = c(0, 3, -1), lty = c(1, 1, 1), lwd = 2, col = c(1, 8, 1), inset = c(0.05, 0.01)) # Perform the seasonal Kendall trend test #---------------------------------------- kendallSeasonalTrendTest(Unadj.Conc ~ Month + Year, data = EPA.09.Ex.14.8.df) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: All 12 values of tau = 0 # #Alternative Hypothesis: The seasonal taus are not all equal # (Chi-Square Heterogeneity Test) # At least one seasonal tau != 0 # and all non-zero tau's have the # same sign (z Trend Test) # #Test Name: Seasonal Kendall Test for Trend # (with continuity correction) # #Estimated Parameter(s): tau = 0.9722222 # slope = 0.0600000 # intercept = -131.7350000 # #Estimation Method: tau: Weighted Average of # Seasonal Estimates # slope: Hirsch et al.'s # Modification of # Thiel/Sen Estimator # intercept: Median of # Seasonal Estimates # #Data: y = Unadj.Conc # season = Month # year = Year # #Data Source: EPA.09.Ex.14.8.df # #Sample Sizes: January = 3 # February = 3 # March = 3 # April = 3 # May = 3 # June = 3 # July = 3 # August = 3 # September = 3 # October = 3 # November = 3 # December = 3 # Total = 36 # #Test Statistics: Chi-Square (Het) = 0.1071882 # z (Trend) = 5.1849514 # #Test Statistic Parameter: df = 11 # #P-values: Chi-Square (Het) = 1.000000e+00 # z (Trend) = 2.160712e-07 # #Confidence Interval for: slope # #Confidence Interval Method: Gilbert's Modification of # Theil/Sen Method # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.05786914 # UCL = 0.07213086 #========== # Clean up #--------- rm(Unadj.Conc, Adj.Conc, Month, Year, Time, n, Three.Yr.Mean, at) graphics.off()
# Reproduce Example 14-10 on page 14-38 of USEPA (2009). This example # tests for trend in analyte concentrations (ppm) collected monthly # between 1983 and 1985. head(EPA.09.Ex.14.8.df) # Month Year Unadj.Conc Adj.Conc #1 January 1983 1.99 2.11 #2 February 1983 2.10 2.14 #3 March 1983 2.12 2.10 #4 April 1983 2.12 2.13 #5 May 1983 2.11 2.12 #6 June 1983 2.15 2.12 tail(EPA.09.Ex.14.8.df) # Month Year Unadj.Conc Adj.Conc #31 July 1985 2.31 2.23 #32 August 1985 2.32 2.24 #33 September 1985 2.28 2.23 #34 October 1985 2.22 2.24 #35 November 1985 2.19 2.25 #36 December 1985 2.22 2.23 # Plot the data #-------------- Unadj.Conc <- EPA.09.Ex.14.8.df$Unadj.Conc Adj.Conc <- EPA.09.Ex.14.8.df$Adj.Conc Month <- EPA.09.Ex.14.8.df$Month Year <- EPA.09.Ex.14.8.df$Year Time <- paste(substring(Month, 1, 3), Year - 1900, sep = "-") n <- length(Unadj.Conc) Three.Yr.Mean <- mean(Unadj.Conc) dev.new() par(mar = c(7, 4, 3, 1) + 0.1, cex.lab = 1.25) plot(1:n, Unadj.Conc, type = "n", xaxt = "n", xlab = "Time (Month)", ylab = "ANALYTE CONCENTRATION (mg/L)", main = "Figure 14-15. Seasonal Time Series Over a Three Year Period", cex.main = 1.1) axis(1, at = 1:n, labels = rep("", n)) at <- rep(c(1, 5, 9), 3) + rep(c(0, 12, 24), each = 3) axis(1, at = at, labels = Time[at]) points(1:n, Unadj.Conc, pch = 0, type = "o", lwd = 2) points(1:n, Adj.Conc, pch = 3, type = "o", col = 8, lwd = 2) abline(h = Three.Yr.Mean, lwd = 2) legend("topleft", c("Unadjusted", "Adjusted", "3-Year Mean"), bty = "n", pch = c(0, 3, -1), lty = c(1, 1, 1), lwd = 2, col = c(1, 8, 1), inset = c(0.05, 0.01)) # Perform the seasonal Kendall trend test #---------------------------------------- kendallSeasonalTrendTest(Unadj.Conc ~ Month + Year, data = EPA.09.Ex.14.8.df) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: All 12 values of tau = 0 # #Alternative Hypothesis: The seasonal taus are not all equal # (Chi-Square Heterogeneity Test) # At least one seasonal tau != 0 # and all non-zero tau's have the # same sign (z Trend Test) # #Test Name: Seasonal Kendall Test for Trend # (with continuity correction) # #Estimated Parameter(s): tau = 0.9722222 # slope = 0.0600000 # intercept = -131.7350000 # #Estimation Method: tau: Weighted Average of # Seasonal Estimates # slope: Hirsch et al.'s # Modification of # Thiel/Sen Estimator # intercept: Median of # Seasonal Estimates # #Data: y = Unadj.Conc # season = Month # year = Year # #Data Source: EPA.09.Ex.14.8.df # #Sample Sizes: January = 3 # February = 3 # March = 3 # April = 3 # May = 3 # June = 3 # July = 3 # August = 3 # September = 3 # October = 3 # November = 3 # December = 3 # Total = 36 # #Test Statistics: Chi-Square (Het) = 0.1071882 # z (Trend) = 5.1849514 # #Test Statistic Parameter: df = 11 # #P-values: Chi-Square (Het) = 1.000000e+00 # z (Trend) = 2.160712e-07 # #Confidence Interval for: slope # #Confidence Interval Method: Gilbert's Modification of # Theil/Sen Method # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.05786914 # UCL = 0.07213086 #========== # Clean up #--------- rm(Unadj.Conc, Adj.Conc, Month, Year, Time, n, Three.Yr.Mean, at) graphics.off()
Perform a nonparametric test for a monotonic trend based on Kendall's tau statistic, and optionally compute a confidence interval for the slope.
kendallTrendTest(y, ...) ## S3 method for class 'formula' kendallTrendTest(y, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: kendallTrendTest(y, x = seq(along = y), alternative = "two.sided", correct = TRUE, ci.slope = TRUE, conf.level = 0.95, warn = TRUE, data.name = NULL, data.name.x = NULL, parent.of.data = NULL, subset.expression = NULL, ...)
kendallTrendTest(y, ...) ## S3 method for class 'formula' kendallTrendTest(y, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: kendallTrendTest(y, x = seq(along = y), alternative = "two.sided", correct = TRUE, ci.slope = TRUE, conf.level = 0.95, warn = TRUE, data.name = NULL, data.name.x = NULL, parent.of.data = NULL, subset.expression = NULL, ...)
y |
an object containing data for the trend test. In the default method,
the argument |
data |
specifies an optional data frame, list or environment (or object coercible by
|
subset |
specifies an optional vector specifying a subset of observations to be used. |
na.action |
specifies a function which indicates what should happen when the data contain |
x |
numeric vector of "predictor" values. The length of |
alternative |
character string indicating the kind of alternative hypothesis. The
possible values are |
correct |
logical scalar indicating whether to use the correction for continuity in
computing the |
ci.slope |
logical scalar indicating whether to compute a confidence interval for the
slope. The default value is |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated
with the confidence interval for the slope. The default value is
|
warn |
logical scalar indicating whether to print a warning message when
|
data.name |
character string indicating the name of the data used for the trend test.
The default value is |
data.name.x |
character string indicating the name of the data used for the predictor variable x.
If |
parent.of.data |
character string indicating the source of the data used for the trend test. |
subset.expression |
character string indicating the expression used to subset the data. |
... |
additional arguments affecting the test for trend. |
kendallTrendTest
performs Kendall's nonparametric test for a monotonic trend,
which is a special case of the test for independence based on Kendall's tau statistic
(see cor.test
). The slope is estimated using the method of Theil (1950) and
Sen (1968). When ci.slope=TRUE
, the confidence interval for the slope is
computed using Gilbert's (1987) Modification of the Theil/Sen Method.
Kendall's test for a monotonic trend is a special case of the test for independence
based on Kendall's tau statistic. The first section below explains the general case
of testing for independence. The second section explains the special case of
testing for monotonic trend. The last section explains how a simple linear
regression model is a special case of a monotonic trend and how the slope may be
estimated.
The General Case of Testing for Independence
Definition of Kendall's Tau
Let and
denote two continuous random variables with some joint
(bivariate) distribution. Let
denote a set of
bivariate observations from this distribution, and assume
these bivariate observations are mutually independent. Kendall (1938, 1975) proposed
a test for the hypothesis that the
and
random variables are
independent based on estimating the following quantity:
The quantity in Equation (1) is called Kendall's tau, although this term is more
often applied to the estimate of (see Equation (2) below).
If
and
are independent, then
. Furthermore, for most
distributions of interest, if
then the random variables
and
are independent. (It can be shown that there exist some distributions for
which
and the random variables
and
are not independent;
see Hollander and Wolfe (1999, p.364)).
Note that Kendall's tau is similar to a correlation coefficient in that
. If
and
always vary in the same direction,
that is if
always implies
, then
.
If
and
always vary in the opposite direction, that is if
always implies
, then
. If
, this indicates
and
are positively associated.
If
, this indicates
and
are negatively associated.
Estimating Kendall's Tau
The quantity in Equation (1) can be estimated by:
where
and denotes the
sign
function:
|
|
|
|
|
|
|
|
(Hollander and Wolfe, 1999, Chapter 8; Conover, 1980, pp.256–260; Gilbert, 1987, Chapter 16; Helsel and Hirsch, 1992, pp.212–216; Gibbons et al., 2009, Chapter 11). The quantity defined in Equation (2) is called Kendall's rank correlation coefficient or more often Kendall's tau.
Note that the quantity defined in Equation (3) is equal to the number of
concordant pairs minus the number of discordant pairs. Hollander and Wolfe
(1999, p.364) use the notation
instead of
, and Conover (1980, p.257)
uses the notation
.
Testing the Null Hypothesis of Independence
The null hypothesis , can be tested using the statistic
defined in Equation (3) above. Tables of the distribution of
for small
samples are given in Hollander and Wolfe (1999), Conover (1980, pp.458–459),
Gilbert (1987, p.272), Helsel and Hirsch (1992, p.469), and Gibbons (2009, p.210).
The function
kendallTrendTest
uses the large sample approximation to the
distribution of under the null hypothesis, which is given by:
where
Under the null hypothesis, the quantity defined in Equation (5) is
approximately distributed as a standard normal random variable.
Both Kendall (1975) and Mann (1945) show that the normal approximation is excellent
even for samples as small as , provided that the following continuity
correction is used:
The function kendallTrendTest
performs the usual one-sample z-test using
the statistic computed in Equation (8) or Equation (5). The argument
correct
determines which equation is used to compute the z-statistic.
By default, correct=TRUE
so Equation (8) is used.
In the case of tied observations in either the observed 's and/or observed
's, the formula for the variance of
given in Equation (7) must be
modified as follows:
|
|
|
|
|
|
|
|
|
where is the number of tied groups in the
observations,
is the size of the
'th tied group in the
observations,
is the number of tied groups in the
observations, and
is the size of the
'th tied group in the
observations.
In the case of no ties in either the
or
observations, Equation (9)
reduces to Equation (7).
The Special Case of Testing for Monotonic Trend
Often in environmental sampling, observations are taken periodically over time
(Hirsch et al., 1982; van Belle and Hughes, 1984; Hirsch and Slack, 1984). In
this case, the random variables can be thought of as
representing the observations, and the variables
are no longer random but represent the time at which the
'th observation
was taken. If the observations are equally spaced over time, then it is useful to
make the simplification
for
. This is in
fact the default value of the argument
x
for the function
kendallTrendTest
.
In the case where the 's represent time and are all distinct, the test for
independence between
and
is equivalent to testing for a monotonic
trend in
, and the test statistic
simplifies to:
Also, the formula for the variance of in the presence of ties (under the
null hypothesis
) simplifies to:
A form of the test statistic in Equation (10) was introduced by Mann (1945).
The Special Case of a Simple Linear Model: Estimating the Slope
Consider the simple linear regression model
where denotes the intercept,
denotes the slope,
, and the
's are assumed to be
independent and identically distributed random variables from the same distribution.
This is a special case of dependence between the
's and the
's, and
the null hypothesis of a zero slope can be tested using Kendall's test statistic
(Equation (3) or (10) above) and the associated z-statistic
(Equation (5) or (8) above) (Hollander and Wolfe, 1999, pp.415–420).
Theil (1950) proposed the following nonparametric estimator of the slope:
Note that the computation of the estimated slope involves computing
“two-point” estimated slopes (assuming no tied values), and taking
the median of these N values.
Sen (1968) generalized this estimator to the case where there are possibly tied
observations in the 's. In this case, Sen simply ignores the two-point
estimated slopes where the
's are tied and computes the median based on the
remaining
two-point estimated slopes. That is, Sen's estimator is given by:
(Hollander and Wolfe, 1999, pp.421–422).
Conover (1980, p. 267) suggests the following estimator for the intercept:
where and
denote the sample medians of the
's
and
's, respectively. With these estimators of slope and intercept, the
estimated regression line passes through the point
.
NOTE: The function kendallTrendTest
always returns estimates of
slope and intercept assuming a linear model (Equation (12)), while the p-value
is based on Kendall's tau, which is testing for the broader alternative of any
kind of dependence between the 's and
's.
Confidence Interval for the Slope
Theil (1950) and Sen (1968) proposed methods to compute a confidence interval for
the true slope, assuming the linear model of Equation (12) (see
Hollander and Wolfe, 1999, pp.421-422). Gilbert (1987, p.218) illustrates a
simpler method than the one given by Sen (1968) that is based on a normal
approximation. Gilbert's (1987) method is an extension of the one given in
Hollander and Wolfe (1999, p.424) that allows for ties and/or multiple
observations per time period. This method is valid for a sample size as small as
unless there are several tied observations.
Let denote the number of defined two-point estimated slopes that are used
in Equation (15) above (if there are no tied
values then
), and
let
denote the
ordered slopes. For Gilbert's (1987) method, a
two-sided confidence interval for the true slope is given by:
where
is defined in Equations (7), (9), or (11), and
denotes the
'th quantile of the standard normal distribution.
One-sided confidence intervals may computed in a similar fashion.
Usually the quantities and
will not be integers.
Gilbert (1987, p.219) suggests interpolating between adjacent values in this case,
which is what the function
kendallTrendTest
does.
A list of class "htest"
containing the results of the hypothesis
test. See the help file for htest.object
for details.
In addition, the following components are part of the list returned by
kendallTrendTest
:
S |
The value of the Kendall S-statistic. |
var.S |
The variance of the Kendall S-statistic. |
slopes |
A numeric vector of all possible two-point slope estimates.
This component is used by the function |
Kendall's test for independence or trend is a nonparametric test. No
assumptions are made about the distribution of the and
variables. Hirsch et al. (1982) introduced the "seasonal Kendall test" to
test for trend within each season. They note that Kendall's test for trend
is easy to compute, even in the presence of missing values, and can also be
used with censored values.
van Belle and Hughes (1984) note that Kendall's test for trend is slightly less powerful than the test based on Spearman's rho, but it converges to normality faster. Also, Bradley (1968, p.288) shows that for the case of a linear model with normal (Gaussian) errors, the asymptotic relative efficiency of Kendall's test for trend versus the parametric test for a zero slope is 0.98.
The results of the function kendallTrendTest
are similar to the
results of the built-in R function cor.test
with the
argument method="kendall"
except that cor.test
1) computes exact p-values when the number of pairs is less than 50 and
there are no ties, and 2) does not return a confidence interval for
the slope.
Steven P. Millard ([email protected])
Bradley, J.V. (1968). Distribution-Free Statistical Tests. Prentice-Hall, Englewood Cliffs, NJ.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York, pp.256-272.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, Chapter 16.
Helsel, D.R. and R.M. Hirsch. (1988). Discussion of Applicability of the t-test for Detecting Trends in Water Quality Variables. Water Resources Bulletin 24(1), 201-204.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, NY.
Helsel, D.R., and R. M. Hirsch. (2002). Statistical Methods in Water Resources. Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. Available on-line at https://pubs.usgs.gov/tm/04/a03/tm4a3.pdf.
Hirsch, R.M., J.R. Slack, and R.A. Smith. (1982). Techniques of Trend Analysis for Monthly Water Quality Data. Water Resources Research 18(1), 107-121.
Hirsch, R.M. and J.R. Slack. (1984). A Nonparametric Trend Test for Seasonal Data with Serial Dependence. Water Resources Research 20(6), 727-732.
Hirsch, R.M., R.B. Alexander, and R.A. Smith. (1991). Selection of Methods for the Detection and Estimation of Trends in Water Quality. Water Resources Research 27(5), 803-813.
Hollander, M., and D.A. Wolfe. (1999). Nonparametric Statistical Methods, Second Edition. John Wiley and Sons, New York.
Kendall, M.G. (1938). A New Measure of Rank Correlation. Biometrika 30, 81-93.
Kendall, M.G. (1975). Rank Correlation Methods. Charles Griffin, London.
Mann, H.B. (1945). Nonparametric Tests Against Trend. Econometrica 13, 245-259.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Sen, P.K. (1968). Estimates of the Regression Coefficient Based on Kendall's Tau. Journal of the American Statistical Association 63, 1379-1389.
Theil, H. (1950). A Rank-Invariant Method of Linear and Polynomial Regression Analysis, I-III. Proc. Kon. Ned. Akad. v. Wetensch. A. 53, 386-392, 521-525, 1397-1412.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
van Belle, G., and J.P. Hughes. (1984). Nonparametric Tests for Trend in Water Quality. Water Resources Research 20(1), 127-136.
cor.test
, kendallSeasonalTrendTest
, htest.object
.
# Reproduce Example 17-6 on page 17-33 of USEPA (2009). This example # tests for trend in sulfate concentrations (ppm) collected at various # months between 1989 and 1996. head(EPA.09.Ex.17.6.sulfate.df) # Sample.No Year Month Sampling.Date Date Sulfate.ppm #1 1 89 6 89.6 1989-06-01 480 #2 2 89 8 89.8 1989-08-01 450 #3 3 90 1 90.1 1990-01-01 490 #4 4 90 3 90.3 1990-03-01 520 #5 5 90 6 90.6 1990-06-01 485 #6 6 90 8 90.8 1990-08-01 510 # Plot the data #-------------- dev.new() with(EPA.09.Ex.17.6.sulfate.df, plot(Sampling.Date, Sulfate.ppm, pch = 15, ylim = c(400, 900), xlab = "Sampling Date", ylab = "Sulfate Conc (ppm)", main = "Figure 17-6. Time Series Plot of \nSulfate Concentrations (ppm)") ) Sulfate.fit <- lm(Sulfate.ppm ~ Sampling.Date, data = EPA.09.Ex.17.6.sulfate.df) abline(Sulfate.fit, lty = 2) # Perform the Kendall test for trend #----------------------------------- kendallTrendTest(Sulfate.ppm ~ Sampling.Date, data = EPA.09.Ex.17.6.sulfate.df) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: tau = 0 # #Alternative Hypothesis: True tau is not equal to 0 # #Test Name: Kendall's Test for Trend # (with continuity correction) # #Estimated Parameter(s): tau = 0.7667984 # slope = 26.6666667 # intercept = -1909.3333333 # #Estimation Method: slope: Theil/Sen Estimator # intercept: Conover's Estimator # #Data: y = Sulfate.ppm # x = Sampling.Date # #Data Source: EPA.09.Ex.17.6.sulfate.df # #Sample Size: 23 # #Test Statistic: z = 5.107322 # #P-value: 3.267574e-07 # #Confidence Interval for: slope # #Confidence Interval Method: Gilbert's Modification # of Theil/Sen Method # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 20.00000 # UCL = 35.71182 # Clean up #--------- rm(Sulfate.fit) graphics.off()
# Reproduce Example 17-6 on page 17-33 of USEPA (2009). This example # tests for trend in sulfate concentrations (ppm) collected at various # months between 1989 and 1996. head(EPA.09.Ex.17.6.sulfate.df) # Sample.No Year Month Sampling.Date Date Sulfate.ppm #1 1 89 6 89.6 1989-06-01 480 #2 2 89 8 89.8 1989-08-01 450 #3 3 90 1 90.1 1990-01-01 490 #4 4 90 3 90.3 1990-03-01 520 #5 5 90 6 90.6 1990-06-01 485 #6 6 90 8 90.8 1990-08-01 510 # Plot the data #-------------- dev.new() with(EPA.09.Ex.17.6.sulfate.df, plot(Sampling.Date, Sulfate.ppm, pch = 15, ylim = c(400, 900), xlab = "Sampling Date", ylab = "Sulfate Conc (ppm)", main = "Figure 17-6. Time Series Plot of \nSulfate Concentrations (ppm)") ) Sulfate.fit <- lm(Sulfate.ppm ~ Sampling.Date, data = EPA.09.Ex.17.6.sulfate.df) abline(Sulfate.fit, lty = 2) # Perform the Kendall test for trend #----------------------------------- kendallTrendTest(Sulfate.ppm ~ Sampling.Date, data = EPA.09.Ex.17.6.sulfate.df) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: tau = 0 # #Alternative Hypothesis: True tau is not equal to 0 # #Test Name: Kendall's Test for Trend # (with continuity correction) # #Estimated Parameter(s): tau = 0.7667984 # slope = 26.6666667 # intercept = -1909.3333333 # #Estimation Method: slope: Theil/Sen Estimator # intercept: Conover's Estimator # #Data: y = Sulfate.ppm # x = Sampling.Date # #Data Source: EPA.09.Ex.17.6.sulfate.df # #Sample Size: 23 # #Test Statistic: z = 5.107322 # #P-value: 3.267574e-07 # #Confidence Interval for: slope # #Confidence Interval Method: Gilbert's Modification # of Theil/Sen Method # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 20.00000 # UCL = 35.71182 # Clean up #--------- rm(Sulfate.fit) graphics.off()
Compute the sample coefficient of kurtosis or excess kurtosis.
kurtosis(x, na.rm = FALSE, method = "fisher", l.moment.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), excess = TRUE)
kurtosis(x, na.rm = FALSE, method = "fisher", l.moment.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), excess = TRUE)
x |
numeric vector of observations. |
na.rm |
logical scalar indicating whether to remove missing values from |
method |
character string specifying what method to use to compute the sample coefficient
of kurtosis. The possible values are
|
l.moment.method |
character string specifying what method to use to compute the
|
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for
the plotting positions when |
excess |
logical scalar indicating whether to compute the kurtosis ( |
Let denote a random sample of
observations from
some distribution with mean
and standard deviation
.
Product Moment Coefficient of Kurtosis
(method="moment"
or method="fisher"
)
The coefficient of kurtosis of a distribution is the fourth
standardized moment about the mean:
where
and
denotes the 'th moment about the mean (central moment).
The coefficient of excess kurtosis is defined as:
For a normal distribution, the coefficient of kurtosis is 3 and the coefficient of excess kurtosis is 0. Distributions with kurtosis less than 3 (excess kurtosis less than 0) are called platykurtic: they have shorter tails than a normal distribution. Distributions with kurtosis greater than 3 (excess kurtosis greater than 0) are called leptokurtic: they have heavier tails than a normal distribution.
When method="moment"
, the coefficient of kurtosis is estimated using the
method of moments estimator for the fourth central moment and and the method of
moments estimator for the variance:
where
This form of estimation should be used when resampling (bootstrap or jackknife).
When method="fisher"
, the coefficient of kurtosis is estimated using the
unbiased estimator for the fourth central moment (Serfling, 1980, p.73) and the
unbiased estimator for the variance.
L-Moment Coefficient of Kurtosis (method="l.moments"
)
Hosking (1990) defines the -moment analog of the coefficient of kurtosis as:
that is, the fourth -moment divided by the second
-moment. He shows
that this quantity lies in the interval (-1, 1).
When l.moment.method="unbiased"
, the -kurtosis is estimated by:
that is, the unbiased estimator of the fourth -moment divided by the
unbiased estimator of the second
-moment.
When l.moment.method="plotting.position"
, the -kurtosis is estimated by:
that is, the plotting-position estimator of the fourth -moment divided by the
plotting-position estimator of the second
-moment.
See the help file for lMoment
for more information on
estimating -moments.
A numeric scalar – the sample coefficient of kurtosis or excess kurtosis.
Traditionally, the coefficient of kurtosis has been estimated using product
moment estimators. Sometimes an estimate of kurtosis is used in a
goodness-of-fit test for normality (D'Agostino and Stephens, 1986).
Hosking (1990) introduced the idea of -moments and
-kurtosis.
Vogel and Fennessey (1993) argue that -moment ratios should replace
product moment ratios because of their superior performance (they are nearly
unbiased and better for discriminating between distributions).
They compare product moment diagrams with
-moment diagrams.
Hosking and Wallis (1995) recommend using unbiased estimators of -moments
(vs. plotting-position estimators) for almost all applications.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
Taylor, J.K. (1990). Statistical Techniques for Data Analysis. Lewis Publishers, Boca Raton, FL.
Vogel, R.M., and N.M. Fennessey. (1993). Moment Diagrams Should Replace
Product Moment Diagrams. Water Resources Research 29(6), 1745–1752.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
var
, sd
, cv
,
skewness
, summaryFull
,
Summary Statistics.
# Generate 20 observations from a lognormal distribution with parameters # mean=10 and cv=1, and estimate the coefficient of kurtosis and # coefficient of excess kurtosis. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 1) # Compute standard kurtosis first #-------------------------------- kurtosis(dat, excess = FALSE) #[1] 2.964612 kurtosis(dat, method = "moment", excess = FALSE) #[1] 2.687146 kurtosis(dat, method = "l.moment", excess = FALSE) #[1] 0.1444779 # Now compute excess kurtosis #---------------------------- kurtosis(dat) #[1] -0.0353876 kurtosis(dat, method = "moment") #[1] -0.3128536 kurtosis(dat, method = "l.moment") #[1] -2.855522 #---------- # Clean up rm(dat)
# Generate 20 observations from a lognormal distribution with parameters # mean=10 and cv=1, and estimate the coefficient of kurtosis and # coefficient of excess kurtosis. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 1) # Compute standard kurtosis first #-------------------------------- kurtosis(dat, excess = FALSE) #[1] 2.964612 kurtosis(dat, method = "moment", excess = FALSE) #[1] 2.687146 kurtosis(dat, method = "l.moment", excess = FALSE) #[1] 0.1444779 # Now compute excess kurtosis #---------------------------- kurtosis(dat) #[1] -0.0353876 kurtosis(dat, method = "moment") #[1] -0.3128536 kurtosis(dat, method = "l.moment") #[1] -2.855522 #---------- # Clean up rm(dat)
Lin and Evans (1980) reported fecal coliform measures (organisms per 100 ml) from the
Illinois River taken between 1971 and 1976. The object Lin.Evans.80.df
is a
small subset of these data that were reported by Helsel and Hirsch (1992, p.162).
Lin.Evans.80.df
Lin.Evans.80.df
A data frame with 24 observations on the following 2 variables.
Fecal.Coliform
a numeric vector of fecal coliform measure (organisms per 100 ml).
Season
an ordered factor indicating the season of collection
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, p.162.
Lin, S.D., and R.L. Evans. (1980). Coliforms and fecal streptococcus in the Illinois River at Peoria, 1971-1976. Illinois State Water Survey Report of Investigations No. 93. Urbana, IL, 28pp.
Compute the sample size necessary to achieve a specified power for a t-test for linear trend, given the scaled slope and significance level.
linearTrendTestN(slope.over.sigma, alpha = 0.05, power = 0.95, alternative = "two.sided", approx = FALSE, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
linearTrendTestN(slope.over.sigma, alpha = 0.05, power = 0.95, alternative = "two.sided", approx = FALSE, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
slope.over.sigma |
numeric vector specifying the ratio of the true slope to the standard deviation of
the error terms ( |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
power |
numeric vector of numbers between 0 and 1 indicating the power
associated with the hypothesis test. The default value is |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
round.up |
logical scalar indicating whether to round up the values of the computed
sample size(s) to the next smallest integer. The default value is
|
n.max |
positive integer greater than 2 indicating the maximum sample size.
The default value is |
tol |
numeric scalar indicating the toloerance to use in the
|
maxiter |
positive integer indicating the maximum number of iterations
argument to pass to the |
If the arguments slope.over.sigma
, alpha
, and power
are not
all the same length, they are replicated to be the same length as the length of
the longest argument.
Formulas for the power of the t-test of linear trend for specified values of
the sample size, scaled slope, and Type I error level are given in
the help file for linearTrendTestPower
. The function
linearTrendTestN
uses the uniroot
search algorithm to
determine the required sample size(s) for specified values of the power,
scaled slope, and Type I error level.
a numeric vector of sample sizes.
See the help file for linearTrendTestPower
.
Steven P. Millard ([email protected])
See the help file for linearTrendTestPower
.
linearTrendTestPower
, linearTrendTestScaledMds
,
plotLinearTrendTestDesign
, lm
, summary.lm
, kendallTrendTest
,
Power and Sample Size, Normal, t.test
.
# Look at how the required sample size for the t-test for zero slope # increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 linearTrendTestN(slope.over.sigma = 0.1, power = seq(0.5, 0.9, by = 0.1)) #[1] 18 19 21 22 25 #---------- # Repeat the last example, but compute the sample size based on the approximate # power instead of the exact: linearTrendTestN(slope.over.sigma = 0.1, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) #[1] 18 19 21 22 25 #========== # Look at how the required sample size for the t-test for zero slope decreases # with increasing scaled slope: seq(0.05, 0.2, by = 0.05) #[1] 0.05 0.10 0.15 0.20 linearTrendTestN(slope.over.sigma = seq(0.05, 0.2, by = 0.05)) #[1] 41 26 20 17 #========== # Look at how the required sample size for the t-test for zero slope decreases # with increasing values of Type I error: linearTrendTestN(slope.over.sigma = 0.1, alpha = c(0.001, 0.01, 0.05, 0.1)) #[1] 33 29 26 25
# Look at how the required sample size for the t-test for zero slope # increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 linearTrendTestN(slope.over.sigma = 0.1, power = seq(0.5, 0.9, by = 0.1)) #[1] 18 19 21 22 25 #---------- # Repeat the last example, but compute the sample size based on the approximate # power instead of the exact: linearTrendTestN(slope.over.sigma = 0.1, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) #[1] 18 19 21 22 25 #========== # Look at how the required sample size for the t-test for zero slope decreases # with increasing scaled slope: seq(0.05, 0.2, by = 0.05) #[1] 0.05 0.10 0.15 0.20 linearTrendTestN(slope.over.sigma = seq(0.05, 0.2, by = 0.05)) #[1] 41 26 20 17 #========== # Look at how the required sample size for the t-test for zero slope decreases # with increasing values of Type I error: linearTrendTestN(slope.over.sigma = 0.1, alpha = c(0.001, 0.01, 0.05, 0.1)) #[1] 33 29 26 25
Compute the power of a parametric test for linear trend, given the sample size or predictor variable values, scaled slope, and significance level.
linearTrendTestPower(n, x = lapply(n, seq), slope.over.sigma = 0, alpha = 0.05, alternative = "two.sided", approx = FALSE)
linearTrendTestPower(n, x = lapply(n, seq), slope.over.sigma = 0, alpha = 0.05, alternative = "two.sided", approx = FALSE)
n |
numeric vector of sample sizes. All values of |
x |
numeric vector of predictor variable values, or a list in which each component is
a numeric vector of predictor variable values. Usually, the predictor variable is
time (e.g., days, months, quarters, etc.). The default value is
|
slope.over.sigma |
numeric vector specifying the ratio of the true slope to the standard deviation of
the error terms ( |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
If the argument x
is a vector, it is converted into a list with one
component. If the arguments n
, x
, slope.over.sigma
, and
alpha
are not all the same length, they are replicated to be the same
length as the length of the longest argument.
Basic Model
Consider the simple linear regression model
where denotes the predictor variable (observed without error),
denotes the intercept,
denotes the slope, and the
error term
is assumed to be a random variable from a normal
distribution with mean 0 and standard deviation
. Let
denote independent observed
pairs from the model (1).
Often in environmental data analysis, we are interested in determining whether there
is a trend in some indicator variable over time. In this case, the predictor
variable is time (e.g., day, month, quarter, year, etc.), and the
values of the response variable
represent measurements taken over time.
The slope then represents the change in the average of the response variable per
one unit of time.
When the argument x
is a numeric vector, it represents the
values of the predictor variable. When the argument
x
is a
list, each component of x
is a numeric vector that represents a set values
of the predictor variable (and the number of elements may vary by component).
By default, the argument x
is a list for which the i'th component is simply
the integers from 1 to the value of the i'th element of the argument n
,
representing, for example, Day 1, Day2, ..., Day n[i]
.
In the discussion that follows, be sure not to confuse the intercept and slope
coefficients and
with the Type II error of the
hypothesis test, which is denoted by
.
Estimation of Coefficients and Confidence Interval for Slope
The standard least-squares estimators of the slope and intercept are given by:
where
(Draper and Smith, 1998, p.25; Zar, 2010, p.332-334; Berthoux and Brown, 2002, p.297; Helsel and Hirsch, p.226). The estimator of slope in Equation (3) has a normal distribution with mean equal to the true slope, and variance given by:
(Draper and Smith, 1998, p.35; Zar, 2010, p.341; Berthoux and Brown, 2002, p.299;
Helsel and Hirsch, 1992, p.227). Thus, a two-sided confidence
interval for the slope is given by:
where
and denotes the
'th quantile of
Student's t-distribution with
degrees of freedom
(Draper and Smith, 1998, p.36; Zar, 2010, p.343; Berthoux and Brown, 2002, p.300;
Helsel and Hirsch, 1992, p.240).
Testing for a Non-Zero Slope
Consider the null hypothesis of a zero slope coefficient:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
):
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative (alternative="two.sided"
)
The test of the null hypothesis (14) versus any of the three alternatives (15)-(17) is based on the Student t-statistic:
Under the null hypothesis (14), the t-statistic in (18) follows a
Student's t-distribution with degrees of freedom
(Draper and Smith, 1998, p.36; Zar, 2010, p.341;
Helsel and Hirsch, 1992, pp.238-239).
The formula for the power of the test of a zero slope depends on which alternative
is being tested.
The two subsections below describe exact and approximate formulas for the power of
the test. Note that none of the equations for the power of the t-test
requires knowledge of the values or
(the population standard deviation of the error terms), only the ratio
. The argument
slope.over.sigma
is this ratio, and it is
referred to as the “scaled slope”.
Exact Power Calculations (approx=FALSE
)
This subsection describes the exact formulas for the power of the t-test for a
zero slope.
Upper one-sided alternative (alternative="greater"
)
The standard Student's t-test rejects the null hypothesis (1) in favor of the
upper alternative hypothesis (2) at level- if
where
and, as noted previously, denotes the
'th quantile of
Student's t-distribution with
degrees of freedom.
The power of this test, denoted by
, where
denotes the
probability of a Type II error, is given by:
where
and denotes a
non-central Student's t-random variable with
degrees of freedom and non-centrality parameter
, and
denotes the cumulative distribution function of this
random variable evaluated at
(Johnson et al., 1995, pp.508-510).
Note that when the predictor variable
represents equally-spaced measures
of time (e.g., days, months, quarters, etc.) and
then the non-centrality parameter in Equation (22) becomes:
Lower one-sided alternative (alternative="less"
)
The standard Student's t-test rejects the null hypothesis (1) in favor of the
lower alternative hypothesis (3) at level- if
and the power of this test is given by:
Two-sided alternative (alternative="two.sided"
)
The standard Student's t-test rejects the null hypothesis (14) in favor of the
two-sided alternative hypothesis (17) at level- if
and the power of this test is given by:
The power of the t-test given in Equation (28) can also be expressed in terms of the
cumulative distribution function of the non-central F-distribution
as follows. Let denote a
non-central F random variable with
and
degrees of freedom and non-centrality parameter
, and let
denote the cumulative distribution function of this
random variable evaluated at
. Also, let
denote
the
'th quantile of the central F-distribution with
and
degrees of freedom. It can be shown that
where denotes “equal in distribution”. Thus, it follows that
so the formula for the power of the t-test given in Equation (28) can also be written as:
Approximate Power Calculations (approx=TRUE
)
Zar (2010, pp.115–118) presents an approximation to the power for the t-test
given in Equations (21), (26), and (28) above. His approximation to the power
can be derived by using the approximation
where denotes “approximately equal to”. Zar's approximation
can be summarized in terms of the cumulative distribution function of the
non-central t-distribution as follows:
where denotes the cumulative distribution function of the
central Student's t-distribution with
degrees of freedom evaluated at
.
The following three subsections explicitly derive the approximation to the power of
the t-test for each of the three alternative hypotheses.
Upper one-sided alternative (alternative="greater"
)
The power for the upper one-sided alternative (15) given in Equation (21) can be
approximated as:
where denotes a central Student's t-random variable with
degrees of freedom.
Lower one-sided alternative (alternative="less"
)
The power for the lower one-sided alternative (16) given in Equation (26) can be
approximated as:
Two-sided alternative (alternative="two.sided"
)
The power for the two-sided alternative (17) given in Equation (28) can be
approximated as:
a numeric vector powers.
Often in environmental data analysis, we are interested in determining whether
there is a trend in some indicator variable over time. In this case, the predictor
variable is time (e.g., day, month, quarter, year, etc.), and the
values of the response variable represent measurements taken over time. The slope
then represents the change in the average of the response variable per one unit of
time.
You can use the parametric model (1) to model your data, then use the R function
lm
to fit the regression coefficients and the summary.lm
function to perform a test for the significance of the slope coefficient. The
function linearTrendTestPower
computes the power of this t-test, given a
fixed value of the sample size, scaled slope, and significance level.
You can also use Kendall's nonparametric test for trend
if you don't want to assume the error terms are normally distributed. When the
errors are truly normally distributed, the asymptotic relative efficiency of
Kendall's test for trend versus the parametric t-test for a zero slope is 0.98,
and Kendall's test can be more powerful than the parametric t-test when the errors
are not normally distributed. Thus the function linearTrendTestPower
can
also be used to estimate the power of Kendall's test for trend.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled slope if one of the objectives of the sampling program is to determine
whether a trend is occurring. The functions linearTrendTestPower
,
linearTrendTestN
, linearTrendTestScaledMds
, and plotLinearTrendTestDesign
can be used to investigate these
relationships.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, Chapter 1.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 9.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapters 28, 31
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
linearTrendTestN
, linearTrendTestScaledMds
,
plotLinearTrendTestDesign
, lm
, summary.lm
, kendallTrendTest
,
Power and Sample Size, Normal, t.test
.
# Look at how the power of the t-test for zero slope increases with increasing # sample size: seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 power <- linearTrendTestPower(n = seq(5, 30, by = 5), slope.over.sigma = 0.1) round(power, 2) #[1] 0.06 0.13 0.34 0.68 0.93 1.00 #---------- # Repeat the last example, but compute the approximate power instead of the # exact: power <- linearTrendTestPower(n = seq(5, 30, by = 5), slope.over.sigma = 0.1, approx = TRUE) round(power, 2) #[1] 0.05 0.11 0.32 0.68 0.93 0.99 #---------- # Look at how the power of the t-test for zero slope increases with increasing # scaled slope: seq(0.05, 0.2, by = 0.05) #[1] 0.05 0.10 0.15 0.20 power <- linearTrendTestPower(15, slope.over.sigma = seq(0.05, 0.2, by = 0.05)) round(power, 2) #[1] 0.12 0.34 0.64 0.87 #---------- # Look at how the power of the t-test for zero slope increases with increasing # values of Type I error: power <- linearTrendTestPower(20, slope.over.sigma = 0.1, alpha = c(0.001, 0.01, 0.05, 0.1)) round(power, 2) #[1] 0.14 0.41 0.68 0.80 #---------- # Show that for a simple regression model, you get a greater power of detecting # a non-zero slope if you take all the observations at two endpoints, rather than # spreading the observations evenly between two endpoints. # (Note: This design usually cannot work with environmental monitoring data taken # over time since usually observations taken close together in time are not # independent.) linearTrendTestPower(x = 1:10, slope.over.sigma = 0.1) #[1] 0.1265976 linearTrendTestPower(x = c(rep(1, 5), rep(10, 5)), slope.over.sigma = 0.1) #[1] 0.2413823 #========== # Clean up #--------- rm(power)
# Look at how the power of the t-test for zero slope increases with increasing # sample size: seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 power <- linearTrendTestPower(n = seq(5, 30, by = 5), slope.over.sigma = 0.1) round(power, 2) #[1] 0.06 0.13 0.34 0.68 0.93 1.00 #---------- # Repeat the last example, but compute the approximate power instead of the # exact: power <- linearTrendTestPower(n = seq(5, 30, by = 5), slope.over.sigma = 0.1, approx = TRUE) round(power, 2) #[1] 0.05 0.11 0.32 0.68 0.93 0.99 #---------- # Look at how the power of the t-test for zero slope increases with increasing # scaled slope: seq(0.05, 0.2, by = 0.05) #[1] 0.05 0.10 0.15 0.20 power <- linearTrendTestPower(15, slope.over.sigma = seq(0.05, 0.2, by = 0.05)) round(power, 2) #[1] 0.12 0.34 0.64 0.87 #---------- # Look at how the power of the t-test for zero slope increases with increasing # values of Type I error: power <- linearTrendTestPower(20, slope.over.sigma = 0.1, alpha = c(0.001, 0.01, 0.05, 0.1)) round(power, 2) #[1] 0.14 0.41 0.68 0.80 #---------- # Show that for a simple regression model, you get a greater power of detecting # a non-zero slope if you take all the observations at two endpoints, rather than # spreading the observations evenly between two endpoints. # (Note: This design usually cannot work with environmental monitoring data taken # over time since usually observations taken close together in time are not # independent.) linearTrendTestPower(x = 1:10, slope.over.sigma = 0.1) #[1] 0.1265976 linearTrendTestPower(x = c(rep(1, 5), rep(10, 5)), slope.over.sigma = 0.1) #[1] 0.2413823 #========== # Clean up #--------- rm(power)
Compute the scaled minimal detectable slope associated with a t-test for liner trend, given the sample size or predictor variable values, power, and significance level.
linearTrendTestScaledMds(n, x = lapply(n, seq), alpha = 0.05, power = 0.95, alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, tol = 1e-07, maxiter = 1000)
linearTrendTestScaledMds(n, x = lapply(n, seq), alpha = 0.05, power = 0.95, alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, tol = 1e-07, maxiter = 1000)
n |
numeric vector of sample sizes. All values of |
x |
numeric vector of predictor variable values, or a list in which each component is
a numeric vector of predictor variable values. Usually, the predictor variable is
time (e.g., days, months, quarters, etc.). The default value is
|
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
power |
numeric vector of numbers between 0 and 1 indicating the power
associated with the hypothesis test. The default value is |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
two.sided.direction |
character string indicating the direction (positive or negative) for the
scaled minimal detectable slope when |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
tol |
numeric scalar indicating the toloerance to use in the
|
maxiter |
positive integer indicating the maximum number of iterations
argument to pass to the |
If the argument x
is a vector, it is converted into a list with one
component. If the arguments n
, x
, alpha
, and
power
are not all the same length, they are replicated to be the same
length as the length of the longest argument.
Formulas for the power of the t-test of linear trend for specified values of
the sample size, scaled slope, and Type I error level are given in
the help file for linearTrendTestPower
. The function
linearTrendTestScaledMds
uses the uniroot
search algorithm to
determine the minimal detectable scaled slope for specified values of the power,
sample size, and Type I error level.
numeric vector of computed scaled minimal detectable slopes. When
alternative="less"
, or alternative="two.sided"
and
two.sided.direction="less"
, the computed slopes are negative. Otherwise,
the slopes are positive.
See the help file for linearTrendTestPower
.
Steven P. Millard ([email protected])
See the help file for linearTrendTestPower
.
linearTrendTestPower
, linearTrendTestN
,
plotLinearTrendTestDesign
, lm
,
summary.lm
, kendallTrendTest
,
Power and Sample Size, Normal, t.test
.
# Look at how the scaled minimal detectable slope for the t-test for linear # trend increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 scaled.mds <- linearTrendTestScaledMds(n = 10, power = seq(0.5, 0.9, by = 0.1)) round(scaled.mds, 2) #[1] 0.25 0.28 0.31 0.35 0.41 #---------- # Repeat the last example, but compute the scaled minimal detectable slopes # based on the approximate power instead of the exact: scaled.mds <- linearTrendTestScaledMds(n = 10, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) round(scaled.mds, 2) #[1] 0.25 0.28 0.31 0.35 0.41 #========== # Look at how the scaled minimal detectable slope for the t-test for linear trend # decreases with increasing sample size: seq(10, 50, by = 10) #[1] 10 20 30 40 50 scaled.mds <- linearTrendTestScaledMds(seq(10, 50, by = 10), alternative = "greater") round(scaled.mds, 2) #[1] 0.40 0.13 0.07 0.05 0.03 #========== # Look at how the scaled minimal detectable slope for the t-test for linear trend # decreases with increasing values of Type I error: scaled.mds <- linearTrendTestScaledMds(10, alpha = c(0.001, 0.01, 0.05, 0.1), alternative="greater") round(scaled.mds, 2) #[1] 0.76 0.53 0.40 0.34 #---------- # Repeat the last example, but compute the scaled minimal detectable slopes # based on the approximate power instead of the exact: scaled.mds <- linearTrendTestScaledMds(10, alpha = c(0.001, 0.01, 0.05, 0.1), alternative="greater", approx = TRUE) round(scaled.mds, 2) #[1] 0.70 0.52 0.41 0.36 #========== # Clean up #--------- rm(scaled.mds)
# Look at how the scaled minimal detectable slope for the t-test for linear # trend increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 scaled.mds <- linearTrendTestScaledMds(n = 10, power = seq(0.5, 0.9, by = 0.1)) round(scaled.mds, 2) #[1] 0.25 0.28 0.31 0.35 0.41 #---------- # Repeat the last example, but compute the scaled minimal detectable slopes # based on the approximate power instead of the exact: scaled.mds <- linearTrendTestScaledMds(n = 10, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) round(scaled.mds, 2) #[1] 0.25 0.28 0.31 0.35 0.41 #========== # Look at how the scaled minimal detectable slope for the t-test for linear trend # decreases with increasing sample size: seq(10, 50, by = 10) #[1] 10 20 30 40 50 scaled.mds <- linearTrendTestScaledMds(seq(10, 50, by = 10), alternative = "greater") round(scaled.mds, 2) #[1] 0.40 0.13 0.07 0.05 0.03 #========== # Look at how the scaled minimal detectable slope for the t-test for linear trend # decreases with increasing values of Type I error: scaled.mds <- linearTrendTestScaledMds(10, alpha = c(0.001, 0.01, 0.05, 0.1), alternative="greater") round(scaled.mds, 2) #[1] 0.76 0.53 0.40 0.34 #---------- # Repeat the last example, but compute the scaled minimal detectable slopes # based on the approximate power instead of the exact: scaled.mds <- linearTrendTestScaledMds(10, alpha = c(0.001, 0.01, 0.05, 0.1), alternative="greater", approx = TRUE) round(scaled.mds, 2) #[1] 0.70 0.52 0.41 0.36 #========== # Clean up #--------- rm(scaled.mds)
-Moments
Estimate the 'th
-moment from a random sample.
lMoment(x, r = 1, method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), na.rm = FALSE)
lMoment(x, r = 1, method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), na.rm = FALSE)
x |
numeric vector of observations. |
r |
positive integer specifying the order of the moment. |
method |
character string specifying what method to use to compute the
|
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for the
plotting positions when |
na.rm |
logical scalar indicating whether to remove missing values from |
Definitions: -Moments and
-Moment Ratios
The definition of an -moment given by Hosking (1990) is as follows.
Let
denote a random variable with cdf
, and let
denote the
'th quantile of the distribution. Furthermore, let
denote the order statistics of a random sample of size drawn from the
distribution of
. Then the
'th
-moment is given by:
for .
Hosking (1990) shows that the above equation can be rewritten as:
where
The first four -moments are given by:
Thus, the first -moment is a measure of location, and the second
-moment is a measure of scale.
Hosking (1990) defines the -moment ratios of
to be:
for . He shows that for a non-degenerate random variable
with a finite mean, these quantities lie in the interval
.
The quantity
is the -moment analog of the coefficient of skewness, and the quantity
is the -moment analog of the coefficient of kurtosis. Hosking (1990) also
defines an
-moment analog of the coefficient of variation (denoted the
-CV) as:
He shows that for a positive-valued random variable, the -CV lies
in the interval
.
Relationship Between -Moments and Probability-Weighted Moments
Hosking (1990) and Hosking and Wallis (1995) show that -moments can be
written as linear combinations of probability-weighted moments:
where
See the help file for pwMoment
for more information on
probability-weighted moments.
Estimating L-Moments
The two commonly used methods for estimating -moments are the
“unbiased” method based on U-statistics (Hoeffding, 1948;
Lehmann, 1975, pp. 362-371), and the “plotting-position” method.
Hosking and Wallis (1995) recommend using the unbiased method for almost all
applications.
Unbiased Estimators (method="unbiased"
)
Using the relationship between -moments and probability-weighted moments
explained above, the unbiased estimator of the
'th
-moment is based on
unbiased estimators of probability-weighted moments and is given by:
where
Plotting-Position Estimators (method="plotting.position"
)
Using the relationship between -moments and probability-weighted moments
explained above, the plotting-position estimator of the
'th
-moment
is based on the plotting-position estimators of probability-weighted moments and
is given by:
where
and
denotes the plotting position of the 'th order statistic in the random
sample of size
, that is, a distribution-free estimate of the cdf of
evaluated at the
'th order statistic. Typically, plotting
positions have the form:
where . For this form of plotting position, the
plotting-position estimators are asymptotically equivalent to their
unbiased estimator counterparts.
Estimating -Moment Ratios
-moment ratios are estimated by simply replacing the population
-moments with the estimated
-moments. The estimated ratios
based on the unbiased estimators are given by:
and the estimated ratios based on the plotting-position estimators are given by:
In particular, the -moment skew is estimated by:
or
and the -moment kurtosis is estimated by:
or
Similarly, the -moment coefficient of variation can be estimated using
the unbiased
-moment estimators:
or using the plotting-position L-moment estimators:
A numeric scalar–the value of the 'th
-moment as defined by Hosking (1990).
Hosking (1990) introduced the idea of -moments, which are expectations
of certain linear combinations of order statistics, as the basis of a general
theory of describing theoretical probability distributions, computing summary
statistics from observed data, estimating distribution parameters and quantiles,
and performing hypothesis tests. The theory of
-moments parallels the
theory of conventional moments.
-moments have several advantages over
conventional moments, including:
-moments can characterize a wider range of distributions because
they always exist as long as the distribution has a finite mean.
-moments are estimated by linear combinations of order statistics,
so estimators based on
-moments are more robust to the presence of
outliers than estimators based on conventional moments.
Based on the author's and others' experience, -moment estimators
are less biased and approximate their asymptotic distribution more closely in
finite samples than estimators based on conventional moments.
-moment estimators are sometimes more efficient (smaller RMSE) than
even maximum likelihood estimators for small samples.
Hosking (1990) presents a table with formulas for the -moments of common
probability distributions. Articles that illustrate the use of
-moments
include Fill and Stedinger (1995), Hosking and Wallis (1995), and
Vogel and Fennessey (1993).
Hosking (1990) and Hosking and Wallis (1995) show the relationship between
probabiity-weighted moments and -moments.
Steven P. Millard ([email protected])
Fill, H.D., and J.R. Stedinger. (1995). Moment and Probability Plot
Correlation Coefficient Goodness-of-Fit Tests for the Gumbel Distribution and
Impact of Autocorrelation. Water Resources Research 31(1), 225–229.
Hosking, J.R.M. (1990). L-Moments: Analysis and Estimation of Distributions Using Linear Combinations of Order Statistics. Journal of the Royal Statistical Society, Series B 52(1), 105–124.
Hosking, J.R.M., and J.R. Wallis (1995). A Comparison of Unbiased and
Plotting-Position Estimators of Moments. Water Resources Research
31(8), 2019–2025.
Vogel, R.M., and N.M. Fennessey. (1993). Moment Diagrams Should
Replace Product Moment Diagrams. Water Resources Research 29(6),
1745–1752.
cv
, skewness
, kurtosis
,
pwMoment
.
# Generate 20 observations from a generalized extreme value distribution # with parameters location=10, scale=2, and shape=.25, then compute the # first four L-moments. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgevd(20, location = 10, scale = 2, shape = 0.25) lMoment(dat) #[1] 10.59556 lMoment(dat, 2) #[1] 1.0014 lMoment(dat, 3) #[1] 0.1681165 lMoment(dat, 4) #[1] 0.08732692 #---------- # Now compute some L-moments based on the plotting-position estimators: lMoment(dat, method = "plotting.position") #[1] 10.59556 lMoment(dat, 2, method = "plotting.position") #[1] 1.110264 lMoment(dat, 3, method="plotting.position", plot.pos.cons = c(.325,1)) #[1] -0.4430792 #---------- # Clean up #--------- rm(dat)
# Generate 20 observations from a generalized extreme value distribution # with parameters location=10, scale=2, and shape=.25, then compute the # first four L-moments. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgevd(20, location = 10, scale = 2, shape = 0.25) lMoment(dat) #[1] 10.59556 lMoment(dat, 2) #[1] 1.0014 lMoment(dat, 3) #[1] 0.1681165 lMoment(dat, 4) #[1] 0.08732692 #---------- # Now compute some L-moments based on the plotting-position estimators: lMoment(dat, method = "plotting.position") #[1] 10.59556 lMoment(dat, 2, method = "plotting.position") #[1] 1.110264 lMoment(dat, 3, method="plotting.position", plot.pos.cons = c(.325,1)) #[1] -0.4430792 #---------- # Clean up #--------- rm(dat)
Density, distribution function, quantile function, and random generation
for the three-parameter lognormal distribution with parameters meanlog
,
sdlog
, and threshold
.
dlnorm3(x, meanlog = 0, sdlog = 1, threshold = 0) plnorm3(q, meanlog = 0, sdlog = 1, threshold = 0) qlnorm3(p, meanlog = 0, sdlog = 1, threshold = 0) rlnorm3(n, meanlog = 0, sdlog = 1, threshold = 0)
dlnorm3(x, meanlog = 0, sdlog = 1, threshold = 0) plnorm3(q, meanlog = 0, sdlog = 1, threshold = 0) qlnorm3(p, meanlog = 0, sdlog = 1, threshold = 0) rlnorm3(n, meanlog = 0, sdlog = 1, threshold = 0)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
meanlog |
vector of means of the distribution of the random variable on the log scale.
The default is |
sdlog |
vector of (positive) standard deviations of the random variable on the log scale.
The default is |
threshold |
vector of thresholds of the random variable on the log scale. The default
is |
The three-parameter lognormal distribution is simply the usual two-parameter lognormal distribution with a location shift.
Let be a random variable with a three-parameter lognormal distribution
with parameters
meanlog=
,
sdlog=
, and
threshold=
. Then the random variable
has a lognormal distribution with parameters
meanlog=
and
sdlog=
. Thus,
dlnorm3
calls dlnorm
using the arguments
x = x - threshold
, meanlog = meanlog
,
sdlog = sdlog
plnorm3
calls plnorm
using the arguments
q = q - threshold
, meanlog = meanlog
, sdlog = sdlog
qlnorm3
calls qlnorm
using the arguments
q = q
, meanlog = meanlog
, sdlog = sdlog
and then adds
the argument threshold
to the result.
rlnorm3
calls rlnorm
using the arguments
n = n
, meanlog = meanlog
, sdlog = sdlog
and then adds
the argument threshold
to the result.
The threshold parameter affects only the location of the
three-parameter lognormal distribution; it has no effect on the variance
or the shape of the distribution.
Denote the mean, variance, and coefficient of variation of by:
Then the mean, variance, and coefficient of variation of are given by:
The relationships between the parameters ,
,
,
, and
are as follows:
where
Since quantiles of a distribution are preserved under monotonic transformations,
the median of is:
dlnorm3
gives the density, plnorm3
gives the distribution function,
qlnorm3
gives the quantile function, and rlnorm3
generates random
deviates.
The two-parameter lognormal distribution is the distribution of a random variable whose logarithm is normally distributed. The two major characteristics of the two-parameter lognormal distribution are that it is bounded below at 0, and it is skewed to the right. The three-parameter lognormal distribution is a generalization of the two-parameter lognormal distribution in which the distribution is shifted so that the threshold parameter is some arbitrary number, not necessarily 0.
The three-parameter lognormal distribution was introduced by Wicksell (1917) in a study of the distribution of ages at first marriage. Both the two- and three-parameter lognormal distributions have been used in a variety of fields, including economics and business, industry, biology, ecology, atmospheric science, and geology (Crow and Shimizu, 1988). Royston (1992) has discussed the application of the three-parameter lognormal distribution in the field of medicine.
The two-parameter lognormal distribution is often used to characterize chemical concentrations in the environment. Ott (1990) has shown mathematically how a series of successive random dilutions gives rise to a distribution that can be approximated by a two-parameter lognormal distribution.
The three-pararameter lognormal distribution starts to resemble a normal
distribution as the parameter (the standard deviation of
tends to 0.
Steven P. Millard ([email protected])
Aitchison, J., and J.A.C. Brown (1957). The Lognormal Distribution (with special references to its uses in economics). Cambridge University Press, London, 176pp.
Crow, E.L., and K. Shimizu. (1988). Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, 387pp.
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Ott, W.R. (1990). A Physical Explanation of the Lognormality of Pollutant Concentrations. Journal of the Air and Waste Management Association 40, 1378–1383.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL, Chapter 9.
Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897–912.
Wicksell, S.D. (1917). On Logarithmic Correlation with an Application to the Distribution of Ages at First Marriage. Medd. Lunds. Astr. Obs. 84, 1–21.
Lognormal, elnorm3
,
Probability Distributions and Random Numbers.
# Density of the three-parameter lognormal distribution with # parameters meanlog=1, sdlog=2, and threshold=10, evaluated at 10.5: dlnorm3(10.5, 1, 2, 10) #[1] 0.278794 #---------- # The cdf of the three-parameter lognormal distribution with # parameters meanlog=2, sdlog=3, and threshold=5, evaluated at 9: plnorm3(9, 2, 3, 5) #[1] 0.4189546 #---------- # The median of the three-parameter lognormal distribution with # parameters meanlog=2, sdlog=3, and threshold=20: qlnorm3(0.5, 2, 3, 20) #[1] 27.38906 #---------- # Random sample of 3 observations from the three-parameter lognormal # distribution with parameters meanlog=2, sdlog=1, and threshold=-5. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnorm3(3, 2, 1, -5) #[1] 18.6339749 -0.8873173 39.0561521
# Density of the three-parameter lognormal distribution with # parameters meanlog=1, sdlog=2, and threshold=10, evaluated at 10.5: dlnorm3(10.5, 1, 2, 10) #[1] 0.278794 #---------- # The cdf of the three-parameter lognormal distribution with # parameters meanlog=2, sdlog=3, and threshold=5, evaluated at 9: plnorm3(9, 2, 3, 5) #[1] 0.4189546 #---------- # The median of the three-parameter lognormal distribution with # parameters meanlog=2, sdlog=3, and threshold=20: qlnorm3(0.5, 2, 3, 20) #[1] 27.38906 #---------- # Random sample of 3 observations from the three-parameter lognormal # distribution with parameters meanlog=2, sdlog=1, and threshold=-5. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnorm3(3, 2, 1, -5) #[1] 18.6339749 -0.8873173 39.0561521
Density, distribution function, quantile function, and random generation
for the lognormal distribution with parameters mean
and cv
.
dlnormAlt(x, mean = exp(1/2), cv = sqrt(exp(1) - 1), log = FALSE) plnormAlt(q, mean = exp(1/2), cv = sqrt(exp(1) - 1), lower.tail = TRUE, log.p = FALSE) qlnormAlt(p, mean = exp(1/2), cv = sqrt(exp(1) - 1), lower.tail = TRUE, log.p = FALSE) rlnormAlt(n, mean = exp(1/2), cv = sqrt(exp(1) - 1))
dlnormAlt(x, mean = exp(1/2), cv = sqrt(exp(1) - 1), log = FALSE) plnormAlt(q, mean = exp(1/2), cv = sqrt(exp(1) - 1), lower.tail = TRUE, log.p = FALSE) qlnormAlt(p, mean = exp(1/2), cv = sqrt(exp(1) - 1), lower.tail = TRUE, log.p = FALSE) rlnormAlt(n, mean = exp(1/2), cv = sqrt(exp(1) - 1))
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
mean |
vector of (positive) means of the distribution of the random variable. |
cv |
vector of (positive) coefficients of variation of the random variable. |
log , log.p
|
logical; if |
lower.tail |
logical; if |
Let be a random variable with a lognormal distribution
with parameters
meanlog=
and
sdlog=
. That is,
and
denote the mean and standard deviation of the random variable
on the log scale. The relationship between these parameters and the
mean (
mean=
) and coefficient of variation (
cv=
)
of the distribution on the original scale is given by:
Thus, the functions dlnormAlt
, plnormAlt
, qlnormAlt
, and
rlnormAlt
call the R functions dlnorm
,
plnorm
, qlnorm
, and rlnorm
,
respectively using the following values for the meanlog
and sdlog
parameters: sdlog <- sqrt(log(1 + cv^2))
, meanlog <- log(mean) - (sdlog^2)/2
dlnormAlt
gives the density, plnormAlt
gives the distribution function,
qlnormAlt
gives the quantile function, and rlnormAlt
generates random
deviates.
The two-parameter lognormal distribution is the distribution of a random variable whose logarithm is normally distributed. The two major characteristics of the lognormal distribution are that it is bounded below at 0, and it is skewed to the right.
Because the empirical distribution of many variables is inherently positive and skewed to the right (e.g., size of organisms, amount of rainfall, size of income, etc.), the lognormal distribution has been widely applied in several fields, including economics, business, industry, biology, ecology, atmospheric science, and geology (Aitchison and Brown, 1957; Crow and Shimizu, 1988).
Gibrat (1930) derived the lognormal distribution from theoretical assumptions, calling it the "law of proportionate effect", but Kapteyn (1903) had described a machine that was the mechanical equivalent. The basic idea is that the Central Limit Theorem states that the distribution of the sum of several independent random variables tends to look like a normal distribution, no matter what the underlying distribution(s) of the original random variables, hence the product of several independent random variables tends to look like a lognormal distribution.
The lognormal distribution is often used to characterize chemical concentrations in the environment. Ott (1990) has shown mathematically how a series of successive random dilutions gives rise to a distribution that can be approximated by a lognormal distribution.
A lognormal distribution starts to resemble a normal distribution as the
parameter (the standard deviation of the log of the distribution)
tends to 0.
Some EPA guidance documents (e.g., Singh et al., 2002; Singh et al., 2010a,b) discourage using the assumption of a lognormal distribution for some types of environmental data and recommend instead assessing whether the data appear to fit a gamma distribution.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Limpert, E., W.A. Stahel, and M. Abbt. (2001). Log-Normal Distributions Across the Sciences: Keys and Clues. BioScience 51, 341–352.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Lognormal, elnormAlt
,
Probability Distributions and Random Numbers.
# Density of the lognormal distribution with parameters # mean=10 and cv=1, evaluated at 5: dlnormAlt(5, mean = 10, cv = 1) #[1] 0.08788173 #---------- # The cdf of the lognormal distribution with parameters mean=2 and cv=3, # evaluated at 4: plnormAlt(4, 2, 3) #[1] 0.8879132 #---------- # The median of the lognormal distribution with parameters # mean=10 and cv=1: qlnormAlt(0.5, mean = 10, cv = 1) #[1] 7.071068 #---------- # Random sample of 3 observations from a lognormal distribution with # parameters mean=10 and cv=1. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormAlt(3, mean = 10, cv = 1) #[1] 18.615797 4.341402 31.265293
# Density of the lognormal distribution with parameters # mean=10 and cv=1, evaluated at 5: dlnormAlt(5, mean = 10, cv = 1) #[1] 0.08788173 #---------- # The cdf of the lognormal distribution with parameters mean=2 and cv=3, # evaluated at 4: plnormAlt(4, 2, 3) #[1] 0.8879132 #---------- # The median of the lognormal distribution with parameters # mean=10 and cv=1: qlnormAlt(0.5, mean = 10, cv = 1) #[1] 7.071068 #---------- # Random sample of 3 observations from a lognormal distribution with # parameters mean=10 and cv=1. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormAlt(3, mean = 10, cv = 1) #[1] 18.615797 4.341402 31.265293
Density, distribution function, quantile function, and random generation
for a mixture of two lognormal distribution with parameters
meanlog1
, sdlog1
, meanlog2
, sdlog2
, and p.mix
.
dlnormMix(x, meanlog1 = 0, sdlog1 = 1, meanlog2 = 0, sdlog2 = 1, p.mix = 0.5) plnormMix(q, meanlog1 = 0, sdlog1 = 1, meanlog2 = 0, sdlog2 = 1, p.mix = 0.5) qlnormMix(p, meanlog1 = 0, sdlog1 = 1, meanlog2 = 0, sdlog2 = 1, p.mix = 0.5) rlnormMix(n, meanlog1 = 0, sdlog1 = 1, meanlog2 = 0, sdlog2 = 1, p.mix = 0.5)
dlnormMix(x, meanlog1 = 0, sdlog1 = 1, meanlog2 = 0, sdlog2 = 1, p.mix = 0.5) plnormMix(q, meanlog1 = 0, sdlog1 = 1, meanlog2 = 0, sdlog2 = 1, p.mix = 0.5) qlnormMix(p, meanlog1 = 0, sdlog1 = 1, meanlog2 = 0, sdlog2 = 1, p.mix = 0.5) rlnormMix(n, meanlog1 = 0, sdlog1 = 1, meanlog2 = 0, sdlog2 = 1, p.mix = 0.5)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
meanlog1 |
vector of means of the first lognormal random variable on the log scale.
The default is |
sdlog1 |
vector of standard deviations of the first lognormal random variable on
the log scale. The default is |
meanlog2 |
vector of means of the second lognormal random variable on the log scale.
The default is |
sdlog2 |
vector of standard deviations of the second lognormal random variable on
the log scale. The default is |
p.mix |
vector of probabilities between 0 and 1 indicating the mixing proportion.
For |
Let denote the density of a
lognormal random variable with parameters
meanlog=
and
sdlog=
. The density,
, of a
lognormal mixture random variable with parameters
meanlog1=
,
sdlog1=
,
meanlog2=
,
sdlog2=
, and
p.mix=
is given by:
dlnormMix
gives the density, plnormMix
gives the distribution function,
qlnormMix
gives the quantile function, and rlnormMix
generates random
deviates.
A lognormal mixture distribution is often used to model positive-valued data
that appear to be “contaminated”; that is, most of the values appear to
come from a single lognormal distribution, but a few “outliers” are
apparent. In this case, the value of meanlog2
would be larger than the
value of meanlog1
, and the mixing proportion p.mix
would be fairly
close to 0 (e.g., p.mix=0.1
). The value of the second standard deviation
(sdlog2
) may or may not be the same as the value for the first
(sdlog1
).
Steven P. Millard ([email protected])
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135-146.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, pp.53-54, and Chapter 8.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Lognormal, NormalMix, Probability Distributions and Random Numbers.
# Density of a lognormal mixture with parameters meanlog1=0, sdlog1=1, # meanlog2=2, sdlog2=3, p.mix=0.5, evaluated at 1.5: dlnormMix(1.5, meanlog1 = 0, sdlog1 = 1, meanlog2 = 2, sdlog2 = 3, p.mix = 0.5) #[1] 0.1609746 #---------- # The cdf of a lognormal mixture with parameters meanlog1=0, sdlog1=1, # meanlog2=2, sdlog2=3, p.mix=0.2, evaluated at 4: plnormMix(4, 0, 1, 2, 3, 0.2) #[1] 0.8175281 #---------- # The median of a lognormal mixture with parameters meanlog1=0, sdlog1=1, # meanlog2=2, sdlog2=3, p.mix=0.2: qlnormMix(0.5, 0, 1, 2, 3, 0.2) #[1] 1.156891 #---------- # Random sample of 3 observations from a lognormal mixture with # parameters meanlog1=0, sdlog1=1, meanlog2=3, sdlog2=4, p.mix=0.2. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormMix(3, 0, 1, 2, 3, 0.2) #[1] 0.08975283 1.07591103 7.85482514
# Density of a lognormal mixture with parameters meanlog1=0, sdlog1=1, # meanlog2=2, sdlog2=3, p.mix=0.5, evaluated at 1.5: dlnormMix(1.5, meanlog1 = 0, sdlog1 = 1, meanlog2 = 2, sdlog2 = 3, p.mix = 0.5) #[1] 0.1609746 #---------- # The cdf of a lognormal mixture with parameters meanlog1=0, sdlog1=1, # meanlog2=2, sdlog2=3, p.mix=0.2, evaluated at 4: plnormMix(4, 0, 1, 2, 3, 0.2) #[1] 0.8175281 #---------- # The median of a lognormal mixture with parameters meanlog1=0, sdlog1=1, # meanlog2=2, sdlog2=3, p.mix=0.2: qlnormMix(0.5, 0, 1, 2, 3, 0.2) #[1] 1.156891 #---------- # Random sample of 3 observations from a lognormal mixture with # parameters meanlog1=0, sdlog1=1, meanlog2=3, sdlog2=4, p.mix=0.2. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormMix(3, 0, 1, 2, 3, 0.2) #[1] 0.08975283 1.07591103 7.85482514
Density, distribution function, quantile function, and random generation
for a mixture of two lognormal distribution with parameters
mean1
, cv1
, mean2
, cv2
, and p.mix
.
dlnormMixAlt(x, mean1 = exp(1/2), cv1 = sqrt(exp(1) - 1), mean2 = exp(1/2), cv2 = sqrt(exp(1) - 1), p.mix = 0.5) plnormMixAlt(q, mean1 = exp(1/2), cv1 = sqrt(exp(1) - 1), mean2 = exp(1/2), cv2 = sqrt(exp(1) - 1), p.mix = 0.5) qlnormMixAlt(p, mean1 = exp(1/2), cv1 = sqrt(exp(1) - 1), mean2 = exp(1/2), cv2 = sqrt(exp(1) - 1), p.mix = 0.5) rlnormMixAlt(n, mean1 = exp(1/2), cv1 = sqrt(exp(1) - 1), mean2 = exp(1/2), cv2 = sqrt(exp(1) - 1), p.mix = 0.5)
dlnormMixAlt(x, mean1 = exp(1/2), cv1 = sqrt(exp(1) - 1), mean2 = exp(1/2), cv2 = sqrt(exp(1) - 1), p.mix = 0.5) plnormMixAlt(q, mean1 = exp(1/2), cv1 = sqrt(exp(1) - 1), mean2 = exp(1/2), cv2 = sqrt(exp(1) - 1), p.mix = 0.5) qlnormMixAlt(p, mean1 = exp(1/2), cv1 = sqrt(exp(1) - 1), mean2 = exp(1/2), cv2 = sqrt(exp(1) - 1), p.mix = 0.5) rlnormMixAlt(n, mean1 = exp(1/2), cv1 = sqrt(exp(1) - 1), mean2 = exp(1/2), cv2 = sqrt(exp(1) - 1), p.mix = 0.5)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
mean1 |
vector of means of the first lognormal random variable. The default is |
cv1 |
vector of coefficient of variations of the first lognormal random variable.
The default is |
mean2 |
vector of means of the second lognormal random variable. The default is |
cv2 |
vector of coefficient of variations of the second lognormal random variable.
The default is |
p.mix |
vector of probabilities between 0 and 1 indicating the mixing proportion.
For |
Let denote the density of a
lognormal random variable with parameters
mean=
and
cv=
. The density,
, of a
lognormal mixture random variable with parameters
mean1=
,
cv1=
,
mean2=
,
cv2=
, and
p.mix=
is given by:
The default values for mean1
and cv1
correspond to a
lognormal distribution with parameters
meanlog=0
and sdlog=1
. Similarly for the default values
of mean2
and cv2
.
dlnormMixAlt
gives the density, plnormMixAlt
gives the distribution
function, qlnormMixAlt
gives the quantile function, and
rlnormMixAlt
generates random deviates.
A lognormal mixture distribution is often used to model positive-valued data
that appear to be “contaminated”; that is, most of the values appear to
come from a single lognormal distribution, but a few “outliers” are
apparent. In this case, the value of mean2
would be larger than the
value of mean1
, and the mixing proportion p.mix
would be fairly
close to 0 (e.g., p.mix=0.1
).
Steven P. Millard ([email protected])
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135-146.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, pp.53-54, and Chapter 8.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
LognormalAlt, LognormalMix, Lognormal, NormalMix, Probability Distributions and Random Numbers.
# Density of a lognormal mixture with parameters mean=2, cv1=3, # mean2=4, cv2=5, p.mix=0.5, evaluated at 1.5: dlnormMixAlt(1.5, mean1 = 2, cv1 = 3, mean2 = 4, cv2 = 5, p.mix = 0.5) #[1] 0.1436045 #---------- # The cdf of a lognormal mixture with parameters mean=2, cv1=3, # mean2=4, cv2=5, p.mix=0.5, evaluated at 1.5: plnormMixAlt(1.5, mean1 = 2, cv1 = 3, mean2 = 4, cv2 = 5, p.mix = 0.5) #[1] 0.6778064 #---------- # The median of a lognormal mixture with parameters mean=2, cv1=3, # mean2=4, cv2=5, p.mix=0.5: qlnormMixAlt(0.5, 2, 3, 4, 5, 0.5) #[1] 0.6978355 #---------- # Random sample of 3 observations from a lognormal mixture with # parameters mean1=2, cv1=3, mean2=4, cv2=5, p.mix=0.5. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormMixAlt(3, 2, 3, 4, 5, 0.5) #[1] 0.70672151 14.43226313 0.05521329
# Density of a lognormal mixture with parameters mean=2, cv1=3, # mean2=4, cv2=5, p.mix=0.5, evaluated at 1.5: dlnormMixAlt(1.5, mean1 = 2, cv1 = 3, mean2 = 4, cv2 = 5, p.mix = 0.5) #[1] 0.1436045 #---------- # The cdf of a lognormal mixture with parameters mean=2, cv1=3, # mean2=4, cv2=5, p.mix=0.5, evaluated at 1.5: plnormMixAlt(1.5, mean1 = 2, cv1 = 3, mean2 = 4, cv2 = 5, p.mix = 0.5) #[1] 0.6778064 #---------- # The median of a lognormal mixture with parameters mean=2, cv1=3, # mean2=4, cv2=5, p.mix=0.5: qlnormMixAlt(0.5, 2, 3, 4, 5, 0.5) #[1] 0.6978355 #---------- # Random sample of 3 observations from a lognormal mixture with # parameters mean1=2, cv1=3, mean2=4, cv2=5, p.mix=0.5. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormMixAlt(3, 2, 3, 4, 5, 0.5) #[1] 0.70672151 14.43226313 0.05521329
Density, distribution function, quantile function, and random generation
for the truncated lognormal distribution with parameters meanlog
,
sdlog
, min
, and max
.
dlnormTrunc(x, meanlog = 0, sdlog = 1, min = 0, max = Inf) plnormTrunc(q, meanlog = 0, sdlog = 1, min = 0, max = Inf) qlnormTrunc(p, meanlog = 0, sdlog = 1, min = 0, max = Inf) rlnormTrunc(n, meanlog = 0, sdlog = 1, min = 0, max = Inf)
dlnormTrunc(x, meanlog = 0, sdlog = 1, min = 0, max = Inf) plnormTrunc(q, meanlog = 0, sdlog = 1, min = 0, max = Inf) qlnormTrunc(p, meanlog = 0, sdlog = 1, min = 0, max = Inf) rlnormTrunc(n, meanlog = 0, sdlog = 1, min = 0, max = Inf)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
meanlog |
vector of means of the distribution of the non-truncated random variable
on the log scale.
The default is |
sdlog |
vector of (positive) standard deviations of the non-truncated random variable
on the log scale.
The default is |
min |
vector of minimum values for truncation on the left. The default value is
|
max |
vector of maximum values for truncation on the right. The default value is
|
See the help file for the lognormal distribution for information about the density and cdf of a lognormal distribution.
Probability Density and Cumulative Distribution Function
Let denote a random variable with density function
and
cumulative distribution function
, and let
denote the truncated version of
where
is truncated
below at
min=
and above at
max=
. Then the density
function of
, denoted
, is given by:
and the cdf of Y, denoted , is given by:
|
0 | for |
|
for |
|
1 | for |
|
Quantiles
The quantile
of
is given by:
|
|
for |
|
for |
|
|
for |
|
Random Numbers
Random numbers are generated using the inverse transformation method:
where is a random deviate from a uniform
distribution.
dlnormTrunc
gives the density, plnormTrunc
gives the distribution function,
qlnormTrunc
gives the quantile function, and rlnormTrunc
generates random
deviates.
A truncated lognormal distribution is sometimes used as an input distribution for probabilistic risk assessment.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Schneider, H. (1986). Truncated and Censored Samples from Normal Populations. Marcel Dekker, New York, Chapter 2.
Lognormal, Probability Distributions and Random Numbers.
# Density of a truncated lognormal distribution with parameters # meanlog=1, sdlog=0.75, min=0, max=10, evaluated at 2 and 4: dlnormTrunc(c(2, 4), 1, 0.75, 0, 10) #[1] 0.2551219 0.1214676 #---------- # The cdf of a truncated lognormal distribution with parameters # meanlog=1, sdlog=0.75, min=0, max=10, evaluated at 2 and 4: plnormTrunc(c(2, 4), 1, 0.75, 0, 10) #[1] 0.3558867 0.7266934 #---------- # The median of a truncated lognormal distribution with parameters # meanlog=1, sdlog=0.75, min=0, max=10: qlnormTrunc(.5, 1, 0.75, 0, 10) #[1] 2.614945 #---------- # A random sample of 3 observations from a truncated lognormal distribution # with parameters meanlog=1, sdlog=0.75, min=0, max=10. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormTrunc(3, 1, 0.75, 0, 10) #[1] 5.754805 4.372218 1.706815
# Density of a truncated lognormal distribution with parameters # meanlog=1, sdlog=0.75, min=0, max=10, evaluated at 2 and 4: dlnormTrunc(c(2, 4), 1, 0.75, 0, 10) #[1] 0.2551219 0.1214676 #---------- # The cdf of a truncated lognormal distribution with parameters # meanlog=1, sdlog=0.75, min=0, max=10, evaluated at 2 and 4: plnormTrunc(c(2, 4), 1, 0.75, 0, 10) #[1] 0.3558867 0.7266934 #---------- # The median of a truncated lognormal distribution with parameters # meanlog=1, sdlog=0.75, min=0, max=10: qlnormTrunc(.5, 1, 0.75, 0, 10) #[1] 2.614945 #---------- # A random sample of 3 observations from a truncated lognormal distribution # with parameters meanlog=1, sdlog=0.75, min=0, max=10. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormTrunc(3, 1, 0.75, 0, 10) #[1] 5.754805 4.372218 1.706815
Density, distribution function, quantile function, and random generation
for the truncated lognormal distribution with parameters mean
,
cv
, min
, and max
.
dlnormTruncAlt(x, mean = exp(1/2), cv = sqrt(exp(1) - 1), min = 0, max = Inf) plnormTruncAlt(q, mean = exp(1/2), cv = sqrt(exp(1) - 1), min = 0, max = Inf) qlnormTruncAlt(p, mean = exp(1/2), cv = sqrt(exp(1) - 1), min = 0, max = Inf) rlnormTruncAlt(n, mean = exp(1/2), cv = sqrt(exp(1) - 1), min = 0, max = Inf)
dlnormTruncAlt(x, mean = exp(1/2), cv = sqrt(exp(1) - 1), min = 0, max = Inf) plnormTruncAlt(q, mean = exp(1/2), cv = sqrt(exp(1) - 1), min = 0, max = Inf) qlnormTruncAlt(p, mean = exp(1/2), cv = sqrt(exp(1) - 1), min = 0, max = Inf) rlnormTruncAlt(n, mean = exp(1/2), cv = sqrt(exp(1) - 1), min = 0, max = Inf)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
mean |
vector of means of the distribution of the non-truncated random variable.
The default is |
cv |
vector of (positive) coefficient of variations of the non-truncated random variable.
The default is |
min |
vector of minimum values for truncation on the left. The default value is
|
max |
vector of maximum values for truncation on the right. The default value is
|
See the help file for LognormalAlt for information about the density and cdf of a lognormal distribution with this alternative parameterization.
Let denote a random variable with density function
and
cumulative distribution function
, and let
denote the truncated version of
where
is truncated
below at
min=
and above at
max=
. Then the density
function of
, denoted
, is given by:
and the cdf of Y, denoted , is given by:
|
0 | for |
|
for |
|
1 | for |
|
The quantile
of
is given by:
|
|
for |
|
for |
|
|
for |
|
Random numbers are generated using the inverse transformation method:
where is a random deviate from a uniform
distribution.
dlnormTruncAlt
gives the density, plnormTruncAlt
gives the distribution function,
qlnormTruncAlt
gives the quantile function, and rlnormTruncAlt
generates random
deviates.
A truncated lognormal distribution is sometimes used as an input distribution for probabilistic risk assessment.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Schneider, H. (1986). Truncated and Censored Samples from Normal Populations. Marcel Dekker, New York, Chapter 2.
LognormalAlt, Probability Distributions and Random Numbers.
# Density of a truncated lognormal distribution with parameters # mean=10, cv=1, min=0, max=20, evaluated at 2 and 12: dlnormTruncAlt(c(2, 12), 10, 1, 0, 20) #[1] 0.08480874 0.03649884 #---------- # The cdf of a truncated lognormal distribution with parameters # mean=10, cv=1, min=0, max=20, evaluated at 2 and 12: plnormTruncAlt(c(2, 4), 10, 1, 0, 20) #[1] 0.07230627 0.82467603 #---------- # The median of a truncated lognormal distribution with parameters # mean=10, cv=1, min=0, max=20: qlnormTruncAlt(.5, 10, 1, 0, 20) #[1] 6.329505 #---------- # A random sample of 3 observations from a truncated lognormal distribution # with parameters mean=10, cv=1, min=0, max=20. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormTruncAlt(3, 10, 1, 0, 20) #[1] 6.685391 17.445387 18.543553
# Density of a truncated lognormal distribution with parameters # mean=10, cv=1, min=0, max=20, evaluated at 2 and 12: dlnormTruncAlt(c(2, 12), 10, 1, 0, 20) #[1] 0.08480874 0.03649884 #---------- # The cdf of a truncated lognormal distribution with parameters # mean=10, cv=1, min=0, max=20, evaluated at 2 and 12: plnormTruncAlt(c(2, 4), 10, 1, 0, 20) #[1] 0.07230627 0.82467603 #---------- # The median of a truncated lognormal distribution with parameters # mean=10, cv=1, min=0, max=20: qlnormTruncAlt(.5, 10, 1, 0, 20) #[1] 6.329505 #---------- # A random sample of 3 observations from a truncated lognormal distribution # with parameters mean=10, cv=1, min=0, max=20. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rlnormTruncAlt(3, 10, 1, 0, 20) #[1] 6.685391 17.445387 18.543553
Given a data frame or matrix in long format, convert it to wide format based on
the levels of two variables in the data frame. This is a simplified version of
the R function reshape
with the argument direction="wide"
.
longToWide(x, data.var, row.var, col.var, row.labels = levels(factor(x[, row.var])), col.labels = levels(factor(x[, col.var])), paste.row.name = FALSE, paste.col.name = FALSE, sep = ".", check.names = FALSE, ...)
longToWide(x, data.var, row.var, col.var, row.labels = levels(factor(x[, row.var])), col.labels = levels(factor(x[, col.var])), paste.row.name = FALSE, paste.col.name = FALSE, sep = ".", check.names = FALSE, ...)
x |
data frame or matrix to convert to wide format. Must have at least 3 columns corresponding to the data variable, row variable, and column variable, respectively. |
data.var |
character string or numeric scalar indicating column variable name in |
row.var |
character string or numeric scalar indicating column variable name in |
col.var |
character string or numeric scalar indicating column variable name in |
row.labels |
optional character vector indicating labels to use for rows. The default value is the levels
of the variable indicated by |
col.labels |
optional character vector indicating labels to use for columns. The default value is the levels
of the variable indicated by |
paste.row.name |
logical scalar indicating whether to paste the name of the variable used to define the row names
(i.e., the value of |
paste.col.name |
logical scalar indicating whether to paste the name of the variable used to define the column names
(i.e., the value of |
sep |
character string separator used when |
check.names |
argument to |
... |
other arguments to |
The combination of values in x[, row.var]
and x[, col.var]
must yield
unique values, where
is the number of rows in
x
.
longToWide
returns a matrix when x
is a matrix and a data frame when x
is a data frame. The number of rows is equal to the number of
unique values in x[, row.var]
and the number of columns is equal to the number of
unique values in x[, col.var]
.
Steven P. Millard ([email protected]), based on a template from Phil Dixon.
EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #4 8 Well.1 56.0 #5 10 Well.1 8.7 #6 1 Well.2 19.0 #7 3 Well.2 81.5 #8 6 Well.2 331.0 #9 8 Well.2 14.0 #10 10 Well.2 64.4 #11 1 Well.3 39.0 #12 3 Well.3 151.0 #13 6 Well.3 27.0 #14 8 Well.3 21.4 #15 10 Well.3 578.0 #16 1 Well.4 3.1 #17 3 Well.4 942.0 #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 longToWide(EPA.09.Ex.10.1.nickel.df, "Nickel.ppb", "Month", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 #Month.1 58.8 19.0 39.0 3.1 #Month.3 1.0 81.5 151.0 942.0 #Month.6 262.0 331.0 27.0 85.6 #Month.8 56.0 14.0 21.4 10.0 #Month.10 8.7 64.4 578.0 637.0
EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #4 8 Well.1 56.0 #5 10 Well.1 8.7 #6 1 Well.2 19.0 #7 3 Well.2 81.5 #8 6 Well.2 331.0 #9 8 Well.2 14.0 #10 10 Well.2 64.4 #11 1 Well.3 39.0 #12 3 Well.3 151.0 #13 6 Well.3 27.0 #14 8 Well.3 21.4 #15 10 Well.3 578.0 #16 1 Well.4 3.1 #17 3 Well.4 942.0 #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 longToWide(EPA.09.Ex.10.1.nickel.df, "Nickel.ppb", "Month", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 #Month.1 58.8 19.0 39.0 3.1 #Month.3 1.0 81.5 151.0 942.0 #Month.6 262.0 331.0 27.0 85.6 #Month.8 56.0 14.0 21.4 10.0 #Month.10 8.7 64.4 578.0 637.0
Copper and zinc concentrations (mg/L) in shallow ground water from two geological
zones (Alluvial Fan and Basin-Trough) in the San Joaquin Valley, CA. There are 68
samples from the Alluvial Fan zone and 50 from the Basin-Trough zone. Some
observations are reported as <, where
denotes a detection limit. There
are multiple detection limits for both the copper and zinc data in each of the
geological zones.
Millard.Deverel.88.df
Millard.Deverel.88.df
A data frame with 118 observations on the following 8 variables.
Cu.orig
a character vector of original copper concentrations (mg/L)
Cu
a numeric vector of copper concentrations with nondetects coded to their detection limit
Cu.censored
a logical vector indicating which copper concentrations are censored
Zn.orig
a character vector of original zinc concentrations (mg/L)
Zn
a numeric vector of zinc concentrations with nondetects coded to their detection limit
Zn.censored
a logical vector indicating which zinc concentrations are censored
Zone
a factor indicating the zone (alluvial fan vs. basin trough)
Location
a numeric vector indicating the sampling location
Millard, S.P., and S.J. Deverel. (1988). Nonparametric Statistical Methods for Comparing Two Sites Based on Data With Multiple Nondetect Limits. Water Resources Research, 24(12), 2087-2098.
Deverel, S.J., R.J. Gilliom, R. Fujii, J.A. Izbicki, and J.C. Fields. (1984). Areal Distribution of Selenium and Other Inorganic Constituents in Shallow Ground Water of the San Luis Drain Service Area, San Joaquin, California: A Preliminary Study. U.S. Geological Survey Water Resources Investigative Report 84-4319.
Artificial 1,2,3,4-Tetrachlorobenzene (TcCB) concentrations with censored values;
based on the reference area data stored in EPA.94b.tccb.df
. The data
frame EPA.94b.tccb.df
contains TcCB concentrations (ppb) in soil samples
at a reference area and a cleanup area. The data frame Modified.TcCB.df
contains a modified version of the data from the reference area.
For this data set, the concentrations of TcCB less than 0.5 ppb have been recoded as
<0.5
.
Modified.TcCB.df
Modified.TcCB.df
A data frame with 47 observations on the following 3 variables.
TcCB.orig
a character vector of original TcCB concentrations (ppb)
TcCB
a numeric vector with censored observations set to their detection level
Censored
a logical vector indicating which observations are censored
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, p.595.
USEPA. (1994b). Statistical Methods for Evaluating the Attainment of Cleanup Standards, Volume 3: Reference-Based Standards for Soils and Solid Media. EPA/230-R-94-004. Office of Policy, Planning, and Evaluation, U.S. Environmental Protection Agency, Washington, D.C.
Show the NEWS file of the EnvStats package.
newsEnvStats()
newsEnvStats()
The function newsEnvStats
displays the contents of the EnvStats NEWS file in a
separate text window. You can also access the NEWS file with the command
news(package="EnvStats")
, which returns the contents of the file to the R command
window.
None.
Steven P. Millard ([email protected])
news
.
Air lead levels collected by the National Institute for Occupational Safety and Health (NIOSH) at 15 different areas within the Alma American Labs, Fairply, CO, for health hazard evaluation (HETA 89-052) on Februay 23, 1989.
NIOSH.89.air.lead.vec
NIOSH.89.air.lead.vec
A numeric vector with 15 elements containing air lead concentrations ().
Krishnamoorthy, K., T. Matthew, and G. Ramachandran. (2006). Generalized P-Values and Confidence Intervals: A Novel Approach for Analyzing Lognormally Distributed Exposure Data. Journal of Occupational and Environmental Hygiene, 3, 642–650.
Zou, G.Y., C.Y. Huo, and J. Taleban. (2009). Simple Confidence Intervals for Lognormal Means and their Differences with Environmental Applications. Environmetrics, 20, 172–180.
Density, distribution function, quantile function, and random generation
for a mixture of two normal distribution with parameters
mean1
, sd1
, mean2
, sd2
, and p.mix
.
dnormMix(x, mean1 = 0, sd1 = 1, mean2 = 0, sd2 = 1, p.mix = 0.5) pnormMix(q, mean1 = 0, sd1 = 1, mean2 = 0, sd2 = 1, p.mix = 0.5) qnormMix(p, mean1 = 0, sd1 = 1, mean2 = 0, sd2 = 1, p.mix = 0.5) rnormMix(n, mean1 = 0, sd1 = 1, mean2 = 0, sd2 = 1, p.mix = 0.5)
dnormMix(x, mean1 = 0, sd1 = 1, mean2 = 0, sd2 = 1, p.mix = 0.5) pnormMix(q, mean1 = 0, sd1 = 1, mean2 = 0, sd2 = 1, p.mix = 0.5) qnormMix(p, mean1 = 0, sd1 = 1, mean2 = 0, sd2 = 1, p.mix = 0.5) rnormMix(n, mean1 = 0, sd1 = 1, mean2 = 0, sd2 = 1, p.mix = 0.5)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
mean1 |
vector of means of the first normal random variable.
The default is |
sd1 |
vector of standard deviations of the first normal random variable.
The default is |
mean2 |
vector of means of the second normal random variable.
The default is |
sd2 |
vector of standard deviations of the second normal random variable.
The default is |
p.mix |
vector of probabilities between 0 and 1 indicating the mixing proportion.
For |
Let denote the density of a
normal random variable with parameters
mean=
and
sd=
. The density,
, of a
normal mixture random variable with parameters
mean1=
,
sd1=
,
mean2=
,
sd2=
, and
p.mix=
is given by:
dnormMix
gives the density, pnormMix
gives the distribution function,
qnormMix
gives the quantile function, and rnormMix
generates random
deviates.
A normal mixture distribution is sometimes used to model data
that appear to be “contaminated”; that is, most of the values appear to
come from a single normal distribution, but a few “outliers” are
apparent. In this case, the value of mean2
would be larger than the
value of mean1
, and the mixing proportion p.mix
would be fairly
close to 0 (e.g., p.mix=0.1
). The value of the second standard deviation
(sd2
) may or may not be the same as the value for the first
(sd1
).
Another application of the normal mixture distribution is to bi-modal data; that is, data exhibiting two modes.
Steven P. Millard ([email protected])
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, pp.53-54, and Chapter 8.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Normal, LognormalMix, Probability Distributions and Random Numbers.
# Density of a normal mixture with parameters mean1=0, sd1=1, # mean2=4, sd2=2, p.mix=0.5, evaluated at 1.5: dnormMix(1.5, mean2=4, sd2=2) #[1] 0.1104211 #---------- # The cdf of a normal mixture with parameters mean1=10, sd1=2, # mean2=20, sd2=2, p.mix=0.1, evaluated at 15: pnormMix(15, 10, 2, 20, 2, 0.1) #[1] 0.8950323 #---------- # The median of a normal mixture with parameters mean1=10, sd1=2, # mean2=20, sd2=2, p.mix=0.1: qnormMix(0.5, 10, 2, 20, 2, 0.1) #[1] 10.27942 #---------- # Random sample of 3 observations from a normal mixture with # parameters mean1=0, sd1=1, mean2=4, sd2=2, p.mix=0.5. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rnormMix(3, mean2=4, sd2=2) #[1] 0.07316778 2.06112801 1.05953620
# Density of a normal mixture with parameters mean1=0, sd1=1, # mean2=4, sd2=2, p.mix=0.5, evaluated at 1.5: dnormMix(1.5, mean2=4, sd2=2) #[1] 0.1104211 #---------- # The cdf of a normal mixture with parameters mean1=10, sd1=2, # mean2=20, sd2=2, p.mix=0.1, evaluated at 15: pnormMix(15, 10, 2, 20, 2, 0.1) #[1] 0.8950323 #---------- # The median of a normal mixture with parameters mean1=10, sd1=2, # mean2=20, sd2=2, p.mix=0.1: qnormMix(0.5, 10, 2, 20, 2, 0.1) #[1] 10.27942 #---------- # Random sample of 3 observations from a normal mixture with # parameters mean1=0, sd1=1, mean2=4, sd2=2, p.mix=0.5. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rnormMix(3, mean2=4, sd2=2) #[1] 0.07316778 2.06112801 1.05953620
Density, distribution function, quantile function, and random generation
for the truncated normal distribution with parameters mean
,
sd
, min
, and max
.
dnormTrunc(x, mean = 0, sd = 1, min = -Inf, max = Inf) pnormTrunc(q, mean = 0, sd = 1, min = -Inf, max = Inf) qnormTrunc(p, mean = 0, sd = 1, min = -Inf, max = Inf) rnormTrunc(n, mean = 0, sd = 1, min = -Inf, max = Inf)
dnormTrunc(x, mean = 0, sd = 1, min = -Inf, max = Inf) pnormTrunc(q, mean = 0, sd = 1, min = -Inf, max = Inf) qnormTrunc(p, mean = 0, sd = 1, min = -Inf, max = Inf) rnormTrunc(n, mean = 0, sd = 1, min = -Inf, max = Inf)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
mean |
vector of means of the distribution of the non-truncated random variable.
The default is |
sd |
vector of (positive) standard deviations of the non-truncated random variable.
The default is |
min |
vector of minimum values for truncation on the left. The default value is
|
max |
vector of maximum values for truncation on the right. The default value is
|
See the help file for the normal distribution for information about the density and cdf of a normal distribution.
Probability Density and Cumulative Distribution Function
Let denote a random variable with density function
and
cumulative distribution function
, and let
denote the truncated version of
where
is truncated
below at
min=
and above at
max=
. Then the density
function of
, denoted
, is given by:
and the cdf of Y, denoted , is given by:
|
0 | for |
|
for |
|
1 | for |
|
Quantiles
The quantile
of
is given by:
|
|
for |
|
for |
|
|
for |
|
Random Numbers
Random numbers are generated using the inverse transformation method:
where is a random deviate from a uniform
distribution.
Mean and Variance
The expected value of a truncated normal random variable with parameters
mean=
,
sd=
,
min=
, and
max=
is given by:
(Johnson et al., 1994, p.156; Schneider, 1986, p.17).
The variance of this random variable is given by:
where
(Johnson et al., 1994, p.158; Schneider, 1986, p.17).
dnormTrunc
gives the density, pnormTrunc
gives the distribution function,
qnormTrunc
gives the quantile function, and rnormTrunc
generates random
deviates.
A truncated normal distribution is sometimes used as an input distribution for probabilistic risk assessment.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Schneider, H. (1986). Truncated and Censored Samples from Normal Populations. Marcel Dekker, New York, Chapter 2.
Normal, Probability Distributions and Random Numbers.
# Density of a truncated normal distribution with parameters # mean=10, sd=2, min=8, max=13, evaluated at 10 and 11.5: dnormTrunc(c(10, 11.5), 10, 2, 8, 13) #[1] 0.2575358 0.1943982 #---------- # The cdf of a truncated normal distribution with parameters # mean=10, sd=2, min=8, max=13, evaluated at 10 and 11.5: pnormTrunc(c(10, 11.5), 10, 2, 8, 13) #[1] 0.4407078 0.7936573 #---------- # The median of a truncated normal distribution with parameters # mean=10, sd=2, min=8, max=13: qnormTrunc(.5, 10, 2, 8, 13) #[1] 10.23074 #---------- # A random sample of 3 observations from a truncated normal distribution # with parameters mean=10, sd=2, min=8, max=13. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rnormTrunc(3, 10, 2, 8, 13) #[1] 11.975223 11.373711 9.361258
# Density of a truncated normal distribution with parameters # mean=10, sd=2, min=8, max=13, evaluated at 10 and 11.5: dnormTrunc(c(10, 11.5), 10, 2, 8, 13) #[1] 0.2575358 0.1943982 #---------- # The cdf of a truncated normal distribution with parameters # mean=10, sd=2, min=8, max=13, evaluated at 10 and 11.5: pnormTrunc(c(10, 11.5), 10, 2, 8, 13) #[1] 0.4407078 0.7936573 #---------- # The median of a truncated normal distribution with parameters # mean=10, sd=2, min=8, max=13: qnormTrunc(.5, 10, 2, 8, 13) #[1] 10.23074 #---------- # A random sample of 3 observations from a truncated normal distribution # with parameters mean=10, sd=2, min=8, max=13. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(20) rnormTrunc(3, 10, 2, 8, 13) #[1] 11.975223 11.373711 9.361258
Ammonium (NH) concentration (mg/L) in precipitation measured at
Olympic National Park, Hoh Ranger Station (WA14), weekly or every other week
from January 6, 2009 through December 20, 2011.
Olympic.NH4.df
Olympic.NH4.df
A data frame with 102 observations on the following 6 variables.
Date.On
Start of collection period. Date on which the sample bucket was installed on the collector.
Date.Off
End of collection period. Date on which the sample bucket was removed from the collector.
Week
a numeric vector indicating the cumulative week number starting from January 1, 2009.
NH4.Orig.mg.per.L
a character vector of the original NH
concentrations reported either as the observed value or less than some
detection limit. For values reported as less than a detection limit,
the value reported is the actual limit of detection or, in the case of a
diluted sample, the product of the detection limit value and the
dilution factor.
NH4.mg.per.L
a numeric vector of NH concentrations with
non-detects coded to their detection limit.
Censored
a logical vector indicating which observations are censored.
Olympic National Park-Hoh Ranger Station (WA14)
Jefferson County, Washington
47.8597
-123.9325
182 meters
Owl Mountain
Olympic National Park
NPS-Air Resources Division
National Atmospheric Deposition Program, National Trends Network (NADP/NTN).
https://nadp.slh.wisc.edu/sites/ntn-WA14/
Perform Fisher's one-sample randomization (permutation) test for location.
oneSamplePermutationTest(x, alternative = "two.sided", mu = 0, exact = FALSE, n.permutations = 5000, seed = NULL, ...)
oneSamplePermutationTest(x, alternative = "two.sided", mu = 0, exact = FALSE, n.permutations = 5000, seed = NULL, ...)
x |
numeric vector of observations.
Missing ( |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
mu |
numeric scalar indicating the hypothesized value of the mean.
The default value is |
exact |
logical scalar indicating whether to perform the exact permutation test
(i.e., enumerate all possible permutations) or simply sample from the permutation
distribution. The default value is |
n.permutations |
integer indicating how many times to sample from the permutation distribution when
|
seed |
positive integer to pass to the R function |
... |
arguments that can be supplied to the |
Randomization Tests
In 1935, R.A. Fisher introduced the idea of a randomization test
(Manly, 2007, p. 107; Efron and Tibshirani, 1993, Chapter 15), which is based on
trying to answer the question: “Did the observed pattern happen by chance,
or does the pattern indicate the null hypothesis is not true?” A randomization
test works by simply enumerating all of the possible outcomes under the null
hypothesis, then seeing where the observed outcome fits in. A randomization test
is also called a permutation test, because it involves permuting the
observations during the enumeration procedure (Manly, 2007, p. 3).
In the past, randomization tests have not been used as extensively as they are now
because of the “large” computing resources needed to enumerate all of the
possible outcomes, especially for large sample sizes. The advent of more powerful
personal computers and software has allowed randomization tests to become much
easier to perform. Depending on the sample size, however, it may still be too
time consuming to enumerate all possible outcomes. In this case, the randomization
test can still be performed by sampling from the randomization distribution, and
comparing the observed outcome to this sampled permutation distribution.
Fisher's One-Sample Randomization Test for Location
Let be a vector of
independent
and identically distributed (i.i.d.) observations from some symmetric distribution
with mean
. Consider the test of the null hypothesis that the mean
is equal to some specified value
:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
)
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative
To perform the test of the null hypothesis (1) versus any of the three alternatives (2)-(4), Fisher proposed using the test statistic
where
(Manly, 2007, p. 112). The test assumes all of the observations come from the
same distribution that is symmetric about the true population mean
(hence the mean is the same as the median for this distribution).
Under the null hypothesis, the 's are equally likely to be positive or
negative. Therefore, the permutation distribution of the test statistic
consists of enumerating all possible ways of permuting the signs of the
's and computing the resulting sums. For
observations, there are
possible permutations of the signs, because each observation can either
be positive or negative.
For a one-sided upper alternative hypothesis (Equation (2)), the p-value is computed
as the proportion of sums in the permutation distribution that are greater than or
equal to the observed sum . For a one-sided lower alternative hypothesis
(Equation (3)), the p-value is computed as the proportion of sums in the permutation
distribution that are less than or equal to the observed sum
. For a
two-sided alternative hypothesis (Equation (4)), the p-value is computed by using
the permutation distribution of the absolute value of
(i.e.,
)
and computing the proportion of values in this permutation distribution that are
greater than or equal to the observed value of
.
Confidence Intervals Based on Permutation Tests
Based on the relationship between hypothesis tests and confidence intervals, it is
possible to construct a two-sided or one-sided confidence
interval for the mean
based on the one-sample permutation test by finding
the values of
that correspond to obtaining a p-value of
(Manly, 2007, pp. 18–20, 113). A confidence interval based on the bootstrap
however, will yield a similar type of confidence interval
(Efron and Tibshirani, 1993, p. 214); see the help file for
boot
in the R package boot.
A list of class "permutationTest"
containing the results of the hypothesis
test. See the help file for permutationTest.object
for details.
A frequent question in environmental statistics is “Is the concentration of chemical X greater than Y units?”. For example, in groundwater assessment (compliance) monitoring at hazardous and solid waste sites, the concentration of a chemical in the groundwater at a downgradient well must be compared to a groundwater protection standard (GWPS). If the concentration is “above” the GWPS, then the site enters corrective action monitoring. As another example, soil screening at a Superfund site involves comparing the concentration of a chemical in the soil with a pre-determined soil screening level (SSL). If the concentration is “above” the SSL, then further investigation and possible remedial action is required. Determining what it means for the chemical concentration to be “above” a GWPS or an SSL is a policy decision: the average of the distribution of the chemical concentration must be above the GWPS or SSL, or the median must be above the GWPS or SSL, or the 95'th percentile must be above the GWPS or SSL, or something else. Often, the first interpretation is used.
Hypothesis tests you can use to perform tests of location include: Student's t-test, Fisher's randomization test, the Wilcoxon signed rank test, Chen's modified t-test, the sign test, and a test based on a bootstrap confidence interval. For a discussion comparing the performance of these tests, see Millard and Neerchal (2001, pp.408-409).
Steven P. Millard ([email protected])
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, pp.224–227.
Manly, B.F.J. (2007). Randomization, Bootstrap and Monte Carlo Methods in Biology. Third Edition. Chapman & Hall, New York, pp.112-113.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.404–406.
permutationTest.object
, Hypothesis Tests,
boot
.
# Generate 10 observations from a logistic distribution with parameters # location=7 and scale=2, and test the null hypothesis that the true mean # is equal to 5 against the alternative that the true mean is greater than 5. # Use the exact permutation distribution. # (Note: the call to set.seed() allows you to reproduce this example). set.seed(23) dat <- rlogis(10, location = 7, scale = 2) test.list <- oneSamplePermutationTest(dat, mu = 5, alternative = "greater", exact = TRUE) # Print the results of the test #------------------------------ test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Mean (Median) = 5 # #Alternative Hypothesis: True Mean (Median) is greater than 5 # #Test Name: One-Sample Permutation Test # (Exact) # #Estimated Parameter(s): Mean = 9.977294 # #Data: dat # #Sample Size: 10 # #Test Statistic: Sum(x - 5) = 49.77294 # #P-value: 0.001953125 # Plot the results of the test #----------------------------- dev.new() plot(test.list) #========== # The guidance document "Supplemental Guidance to RAGS: Calculating the # Concentration Term" (USEPA, 1992d) contains an example of 15 observations # of chromium concentrations (mg/kg) which are assumed to come from a # lognormal distribution. These data are stored in the vector # EPA.92d.chromium.vec. Here, we will use the permutation test to test # the null hypothesis that the mean (median) of the log-transformed chromium # concentrations is less than or equal to log(100 mg/kg) vs. the alternative # that it is greater than log(100 mg/kg). Note that we *cannot* use the # permutation test to test a hypothesis about the mean on the original scale # because the data are not assumed to be symmetric about some mean, they are # assumed to come from a lognormal distribution. # # We will sample from the permutation distribution. # (Note: setting the argument seed=542 allows you to reproduce this example). test.list <- oneSamplePermutationTest(log(EPA.92d.chromium.vec), mu = log(100), alternative = "greater", seed = 542) test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Mean (Median) = 4.60517 # #Alternative Hypothesis: True Mean (Median) is greater than 4.60517 # #Test Name: One-Sample Permutation Test # (Based on Sampling # Permutation Distribution # 5000 Times) # #Estimated Parameter(s): Mean = 4.378636 # #Data: log(EPA.92d.chromium.vec) # #Sample Size: 15 # #Test Statistic: Sum(x - 4.60517) = -3.398017 # #P-value: 0.7598 # Plot the results of the test #----------------------------- dev.new() plot(test.list) #---------- # Clean up #--------- rm(test.list) graphics.off()
# Generate 10 observations from a logistic distribution with parameters # location=7 and scale=2, and test the null hypothesis that the true mean # is equal to 5 against the alternative that the true mean is greater than 5. # Use the exact permutation distribution. # (Note: the call to set.seed() allows you to reproduce this example). set.seed(23) dat <- rlogis(10, location = 7, scale = 2) test.list <- oneSamplePermutationTest(dat, mu = 5, alternative = "greater", exact = TRUE) # Print the results of the test #------------------------------ test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Mean (Median) = 5 # #Alternative Hypothesis: True Mean (Median) is greater than 5 # #Test Name: One-Sample Permutation Test # (Exact) # #Estimated Parameter(s): Mean = 9.977294 # #Data: dat # #Sample Size: 10 # #Test Statistic: Sum(x - 5) = 49.77294 # #P-value: 0.001953125 # Plot the results of the test #----------------------------- dev.new() plot(test.list) #========== # The guidance document "Supplemental Guidance to RAGS: Calculating the # Concentration Term" (USEPA, 1992d) contains an example of 15 observations # of chromium concentrations (mg/kg) which are assumed to come from a # lognormal distribution. These data are stored in the vector # EPA.92d.chromium.vec. Here, we will use the permutation test to test # the null hypothesis that the mean (median) of the log-transformed chromium # concentrations is less than or equal to log(100 mg/kg) vs. the alternative # that it is greater than log(100 mg/kg). Note that we *cannot* use the # permutation test to test a hypothesis about the mean on the original scale # because the data are not assumed to be symmetric about some mean, they are # assumed to come from a lognormal distribution. # # We will sample from the permutation distribution. # (Note: setting the argument seed=542 allows you to reproduce this example). test.list <- oneSamplePermutationTest(log(EPA.92d.chromium.vec), mu = log(100), alternative = "greater", seed = 542) test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Mean (Median) = 4.60517 # #Alternative Hypothesis: True Mean (Median) is greater than 4.60517 # #Test Name: One-Sample Permutation Test # (Based on Sampling # Permutation Distribution # 5000 Times) # #Estimated Parameter(s): Mean = 4.378636 # #Data: log(EPA.92d.chromium.vec) # #Sample Size: 15 # #Test Statistic: Sum(x - 4.60517) = -3.398017 # #P-value: 0.7598 # Plot the results of the test #----------------------------- dev.new() plot(test.list) #---------- # Clean up #--------- rm(test.list) graphics.off()
Ozone concentrations in 41 U.S. cities based on daily maxima collected between June and August 1974.
Ozone.NE.df
Ozone.NE.df
A data frame with 41 observations on the following 5 variables.
Median
median of daily maxima ozone concentration (ppb).
Quartile
Upper quartile (i.e., 75th percentile) of daily maxima ozone concentration (ppb).
City
a factor indicating the city
Longitude
negative longitude of the city
Latitude
latitude of the city
Cleveland, W.S., Kleiner, B., McRae, J.E., Warner, J.L., and Pasceri, P.E. (1975). The Analysis of Ground-Level Ozone Data from New Jersey, New York, Connecticut, and Massachusetts: Data Quality Assessment and Temporal and Geographical Properties. Bell Laboratories Memorandum.
The original data were collected by the New Jersey Department of Environmental Protection, the New York State Department of Environmental Protection, the Boyce Thompson Institute (Yonkers, for New York data), the Connecticut Department of Environmental Protection, and the Massachusetts Department of Public Health.
summary(Ozone.NE.df) # Median Quartile City Longitude # Min. : 34.00 Min. : 48.00 Asbury Park: 1 Min. :-74.71 # 1st Qu.: 58.00 1st Qu.: 79.75 Babylon : 1 1st Qu.:-73.74 # Median : 65.00 Median : 90.00 Bayonne : 1 Median :-73.17 # Mean : 68.15 Mean : 95.10 Boston : 1 Mean :-72.94 # 3rd Qu.: 80.00 3rd Qu.:112.25 Bridgeport : 1 3rd Qu.:-72.08 # Max. :100.00 Max. :145.00 Cambridge : 1 Max. :-71.05 # NA's : 1.00 (Other) :35 # Latitude # Min. :40.22 # 1st Qu.:40.97 # Median :41.56 # Mean :41.60 # 3rd Qu.:42.25 # Max. :43.32
summary(Ozone.NE.df) # Median Quartile City Longitude # Min. : 34.00 Min. : 48.00 Asbury Park: 1 Min. :-74.71 # 1st Qu.: 58.00 1st Qu.: 79.75 Babylon : 1 1st Qu.:-73.74 # Median : 65.00 Median : 90.00 Bayonne : 1 Median :-73.17 # Mean : 68.15 Mean : 95.10 Boston : 1 Mean :-72.94 # 3rd Qu.: 80.00 3rd Qu.:112.25 Bridgeport : 1 3rd Qu.:-72.08 # Max. :100.00 Max. :145.00 Cambridge : 1 Max. :-71.05 # NA's : 1.00 (Other) :35 # Latitude # Min. :40.22 # 1st Qu.:40.97 # Median :41.56 # Mean :41.60 # 3rd Qu.:42.25 # Max. :43.32
Density, distribution function, quantile function, and random generation
for the Pareto distribution with parameters location
and shape
.
dpareto(x, location, shape = 1) ppareto(q, location, shape = 1) qpareto(p, location, shape = 1) rpareto(n, location, shape = 1)
dpareto(x, location, shape = 1) ppareto(q, location, shape = 1) qpareto(p, location, shape = 1) rpareto(n, location, shape = 1)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
location |
vector of (positive) location parameters. |
shape |
vector of (positive) shape parameters. The default is |
Let be a Pareto random variable with parameters
location=
and
shape=
. The density function of
is given by:
The cumulative distribution function of is given by:
and the 'th quantile of
is given by:
The mode, mean, median, variance, and coefficient of variation of are given by:
dpareto
gives the density, ppareto
gives the distribution function,
qpareto
gives the quantile function, and rpareto
generates random
deviates.
The Pareto distribution is named after Vilfredo Pareto (1848-1923), a professor
of economics. It is derived from Pareto's law, which states that the number of
persons having income
is given by:
where denotes Pareto's constant and is the shape parameter for the
probability distribution.
The Pareto distribution takes values on the positive real line. All values must be
larger than the “location” parameter , which is really a threshold
parameter. There are three kinds of Pareto distributions. The one described here
is the Pareto distribution of the first kind. Stable Pareto distributions have
. Note that the
'th moment only exists if
.
The Pareto distribution is related to the
exponential distribution and
logistic distribution as follows.
Let denote a Pareto random variable with
location=
and
shape=
. Then
has an exponential distribution
with parameter
rate=
, and
has a logistic distribution with parameters
location=
and
scale=
.
The Pareto distribution has a very long right-hand tail. It is often applied in the study of socioeconomic data, including the distribution of income, firm size, population, and stock price fluctuations.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
epareto
, eqpareto
, Exponential,
Probability Distributions and Random Numbers.
# Density of a Pareto distribution with parameters location=1 and shape=1, # evaluated at 2, 3 and 4: dpareto(2:4, 1, 1) #[1] 0.2500000 0.1111111 0.0625000 #---------- # The cdf of a Pareto distribution with parameters location=2 and shape=1, # evaluated at 3, 4, and 5: ppareto(3:5, 2, 1) #[1] 0.3333333 0.5000000 0.6000000 #---------- # The 25'th percentile of a Pareto distribution with parameters # location=1 and shape=1: qpareto(0.25, 1, 1) #[1] 1.333333 #---------- # A random sample of 4 numbers from a Pareto distribution with parameters # location=3 and shape=2. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(10) rpareto(4, 3, 2) #[1] 4.274728 3.603148 3.962862 5.415322
# Density of a Pareto distribution with parameters location=1 and shape=1, # evaluated at 2, 3 and 4: dpareto(2:4, 1, 1) #[1] 0.2500000 0.1111111 0.0625000 #---------- # The cdf of a Pareto distribution with parameters location=2 and shape=1, # evaluated at 3, 4, and 5: ppareto(3:5, 2, 1) #[1] 0.3333333 0.5000000 0.6000000 #---------- # The 25'th percentile of a Pareto distribution with parameters # location=1 and shape=1: qpareto(0.25, 1, 1) #[1] 1.333333 #---------- # A random sample of 4 numbers from a Pareto distribution with parameters # location=3 and shape=2. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(10) rpareto(4, 3, 2) #[1] 4.274728 3.603148 3.962862 5.415322
Produce a probability density function (pdf) plot for a user-specified distribution.
pdfPlot(distribution = "norm", param.list = list(mean = 0, sd = 1), left.tail.cutoff = ifelse(is.finite(supp.min), 0, 0.001), right.tail.cutoff = ifelse(is.finite(supp.max), 0, 0.001), plot.it = TRUE, add = FALSE, n.points = 1000, pdf.col = "black", pdf.lwd = 3 * par("cex"), pdf.lty = 1, curve.fill = !add, curve.fill.col = "cyan", x.ticks.at.all.x.max = 15, hist.col = ifelse(add, "black", "cyan"), density = 5, digits = .Options$digits, ..., type = "l", main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
pdfPlot(distribution = "norm", param.list = list(mean = 0, sd = 1), left.tail.cutoff = ifelse(is.finite(supp.min), 0, 0.001), right.tail.cutoff = ifelse(is.finite(supp.max), 0, 0.001), plot.it = TRUE, add = FALSE, n.points = 1000, pdf.col = "black", pdf.lwd = 3 * par("cex"), pdf.lty = 1, curve.fill = !add, curve.fill.col = "cyan", x.ticks.at.all.x.max = 15, hist.col = ifelse(add, "black", "cyan"), density = 5, digits = .Options$digits, ..., type = "l", main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
distribution |
a character string denoting the distribution abbreviation. The default value is
|
param.list |
a list with values for the parameters of the distribution. The default value is
|
left.tail.cutoff |
a numeric scalar indicating what proportion of the left-tail of the probability
distribution to omit from the plot. For densities with a finite support minimum
(e.g., Lognormal) the default value is |
right.tail.cutoff |
a scalar indicating what proportion of the right-tail of the probability
distribution to omit from the plot. For densities with a finite support maximum
(e.g., Binomial) the default value is |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see |
add |
a logical scalar indicating whether to add the probability density curve to the
existing plot ( |
n.points |
a numeric scalar specifying at how many evenly-spaced points the probability
density function will be evaluated. The default value is |
pdf.col |
for continuous distributions, a numeric scalar or character string determining
the color of the pdf line in the plot.
The default value is |
pdf.lwd |
for continuous distributions, a numeric scalar determining the width of the pdf
line in the plot.
The default value is |
pdf.lty |
for continuous distributions, a numeric scalar determining the line type of
the pdf line in the plot.
The default value is |
curve.fill |
for continuous distributions, a logical value indicating whether to fill in
the area below the probability density curve with the color specified by
|
curve.fill.col |
for continuous distributions, when |
x.ticks.at.all.x.max |
a numeric scalar indicating the maximum number of ticks marks on the |
hist.col |
for discrete distributions, a numeric scalar or character string indicating
what color to use to fill in the histogram if |
density |
for discrete distributions, a scalar indicting the density of line shading for
the histogram when |
digits |
a scalar indicating how many significant digits to print for the distribution
parameters. The default value is |
type , main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters. See |
The probability density function (pdf) of a random variable ,
usually denoted
, is defined as:
where is the cumulative distribution function (cdf) of
.
That is,
is the derivative of the cdf
with respect to
(where this derivative exists).
For discrete distributions, the probability density function is simply:
In this case, is sometimes called the probability function or
probability mass function.
The probability that the random variable takes on a value in the interval
is simply the (Lebesgue) integral of the pdf evaluated between
and
. That is,
For discrete distributions, Equation (3) translates to summing up the probabilities of all values in this interval:
A probability density function (pdf) plot plots the values of the pdf against quantiles of the specified distribution. Theoretical pdf plots are sometimes plotted along with empirical pdf plots (density plots), histograms or bar graphs to visually assess whether data have a particular distribution.
pdfPlot
invisibly returns a list giving coordinates of the points
that have been or would have been plotted:
Quantiles |
The quantiles used for the plot. |
Probability.Densities |
The values of the pdf associated with the quantiles. |
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Distribution.df
, epdfPlot
, cdfPlot
.
# Plot the pdf of the standard normal distribution #------------------------------------------------- dev.new() pdfPlot() #========== # Plot the pdf of the standard normal distribution # and a N(2, 2) distribution on the sample plot. #------------------------------------------------- dev.new() pdfPlot(param.list = list(mean=2, sd=2), curve.fill = FALSE, ylim = c(0, dnorm(0)), main = "") pdfPlot(add = TRUE, pdf.col = "red") legend("topright", legend = c("N(2,2)", "N(0,1)"), col = c("black", "red"), lwd = 3 * par("cex")) title("PDF Plots for Two Normal Distributions") #========== # Clean up #--------- graphics.off()
# Plot the pdf of the standard normal distribution #------------------------------------------------- dev.new() pdfPlot() #========== # Plot the pdf of the standard normal distribution # and a N(2, 2) distribution on the sample plot. #------------------------------------------------- dev.new() pdfPlot(param.list = list(mean=2, sd=2), curve.fill = FALSE, ylim = c(0, dnorm(0)), main = "") pdfPlot(add = TRUE, pdf.col = "red") legend("topright", legend = c("N(2,2)", "N(0,1)"), col = c("black", "red"), lwd = 3 * par("cex")) title("PDF Plots for Two Normal Distributions") #========== # Clean up #--------- graphics.off()
This class of objects is returned by functions that perform permutation tests.
Objects of class "permutationTest"
are lists that contain information about
the null and alternative hypotheses, the estimated distribution parameters, the
test statistic and the p-value. They also contain the permutation distribution
of the statistic (or a sample of the permutation distribution).
Objects of S3 class "permutationTest"
are returned by any of the
EnvStats functions that perform permutation tests. Currently, these are:
oneSamplePermutationTest
, twoSamplePermutationTestLocation
, and
twoSamplePermutationTestProportion
.
A legitimate list of class "permutationTest"
includes the components
listed in the help file for htest.object
. In addition, the following
components must be included in a legitimate list of class "permutationTest"
:
Required Components
The following components must be included in a legitimate list of
class "permutationTest"
.
stat.dist |
numeric vector containing values of the statistic for the permutation distribution.
When |
exact |
logical scalar indicating whether the exact permutation distribution was used for
the test ( |
Optional Components
The following component may optionally be included in an object of
of class "permutationTest"
:
seed |
integer or vector of integers indicating the seed that was used for sampling the
permutation distribution. This component is present only if |
prob.stat.dist |
numeric vector containing the probabilities associated with each element of
the component |
Generic functions that have methods for objects of class
"permutationTest"
include: print
, plot
.
Since objects of class "permutationTest"
are lists, you may extract
their components with the $
and [[
operators.
Steven P. Millard ([email protected])
print.permutationTest
, plot.permutationTest
,
oneSamplePermutationTest
, twoSamplePermutationTestLocation
,
twoSamplePermutationTestProportion
, Hypothesis Tests.
# Create an object of class "permutationTest", then print it and plot it. #------------------------------------------------------------------------ set.seed(23) dat <- rlogis(10, location = 7, scale = 2) permutationTest.obj <- oneSamplePermutationTest(dat, mu = 5, alternative = "greater", exact = TRUE) mode(permutationTest.obj) #[1] "list" class(permutationTest.obj) #[1] "permutationTest" names(permutationTest.obj) # [1] "statistic" "parameters" "p.value" # [4] "estimate" "null.value" "alternative" # [7] "method" "estimation.method" "sample.size" #[10] "data.name" "bad.obs" "stat.dist" #[13] "exact" #========== # Print the results of the test #------------------------------ permutationTest.obj #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Mean (Median) = 5 # #Alternative Hypothesis: True Mean (Median) is greater than 5 # #Test Name: One-Sample Permutation Test # (Exact) # #Estimated Parameter(s): Mean = 9.977294 # #Data: dat # #Sample Size: 10 # #Test Statistic: Sum(x - 5) = 49.77294 # #P-value: 0.001953125 #========== # Plot the results of the test #----------------------------- dev.new() plot(permutationTest.obj) #========== # Extract the test statistic #--------------------------- permutationTest.obj$statistic #Sum(x - 5) # 49.77294 #========== # Clean up #--------- rm(permutationTest.obj) graphics.off()
# Create an object of class "permutationTest", then print it and plot it. #------------------------------------------------------------------------ set.seed(23) dat <- rlogis(10, location = 7, scale = 2) permutationTest.obj <- oneSamplePermutationTest(dat, mu = 5, alternative = "greater", exact = TRUE) mode(permutationTest.obj) #[1] "list" class(permutationTest.obj) #[1] "permutationTest" names(permutationTest.obj) # [1] "statistic" "parameters" "p.value" # [4] "estimate" "null.value" "alternative" # [7] "method" "estimation.method" "sample.size" #[10] "data.name" "bad.obs" "stat.dist" #[13] "exact" #========== # Print the results of the test #------------------------------ permutationTest.obj #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Mean (Median) = 5 # #Alternative Hypothesis: True Mean (Median) is greater than 5 # #Test Name: One-Sample Permutation Test # (Exact) # #Estimated Parameter(s): Mean = 9.977294 # #Data: dat # #Sample Size: 10 # #Test Statistic: Sum(x - 5) = 49.77294 # #P-value: 0.001953125 #========== # Plot the results of the test #----------------------------- dev.new() plot(permutationTest.obj) #========== # Extract the test statistic #--------------------------- permutationTest.obj$statistic #Sum(x - 5) # 49.77294 #========== # Clean up #--------- rm(permutationTest.obj) graphics.off()
Plot the results of calling the function boxcox
, which returns an
object of class "boxcox"
. Three different kinds of plots are available.
The function plot.boxcox
is automatically called by plot
when given an object of class "boxcox"
. The names of other functions
associated with Box-Cox transformations are listed under Data Transformations.
## S3 method for class 'boxcox' plot(x, plot.type = "Objective vs. lambda", same.window = TRUE, ask = same.window & plot.type != "Ojective vs. lambda", plot.pos.con = 0.375, estimate.params = FALSE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = TRUE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, cex.main = 1.4 * par("cex"), cex.sub = par("cex"), main = NULL, sub = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ...)
## S3 method for class 'boxcox' plot(x, plot.type = "Objective vs. lambda", same.window = TRUE, ask = same.window & plot.type != "Ojective vs. lambda", plot.pos.con = 0.375, estimate.params = FALSE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = TRUE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, cex.main = 1.4 * par("cex"), cex.sub = par("cex"), main = NULL, sub = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ...)
x |
an object of class |
plot.type |
character string indicating what kind of plot to create. Only one particular
plot type will be created, unless |
same.window |
logical scalar indicating whether to produce all plots in the same graphics
window ( |
ask |
logical scalar supplied to the function |
points.col |
numeric scalar determining the color of the points in the plot. The default
value is |
The following arguments can be supplied when plot.type="Q-Q Plots"
, plot.type="Tukey M-D Q-Q Plots"
, or plot.type="All"
(supplied to qqPlot
):
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant used to construct the Q-Q plots and/or Tukey Mean-Difference Q-Q plots.
The default value is |
estimate.params |
logical scalar indicating whether to compute quantiles based on estimating the
distribution parameters ( |
equal.axes |
logical scalar indicating whether to use the same range on the |
add.line |
logical scalar indicating whether to add a line to the plot. If |
qq.line.type |
character string determining what kind of line to add to the plot when |
duplicate.points.method |
a character string denoting how to plot points with duplicate |
line.col |
numeric scalar determining the color of the line in the plot. The default value
is |
line.lwd |
numeric scalar determining the width of the line in the plot. The default value
is |
line.lty |
numeric scalar determining the line type (style) of the line in the plot.
The default value is |
digits |
scalar indicating how many significant digits to print for the distribution
parameters and the value of the objective in the sub-title. The default
value is the current setting of |
Graphics parameters:
cex.main , cex.sub , main , sub , xlab , ylab , xlim , ylim , ...
|
graphics parameters; see |
The function plot.boxcox
is a method for the generic function
plot
for the class "boxcox"
(see boxcox.object
).
It can be invoked by calling plot
and giving it an object of
class "boxcox"
as the first argument, or by calling plot.boxcox
directly, regardless of the class of the object given as the first argument
to plot.boxcox
.
Plots associated with Box-Cox transformations are produced on the current graphics device. These can be one or all of the following:
Objective vs. .
Observed Quantiles vs. Normal Quantiles (Q-Q Plot) for the transformed
observations for each of the values of .
Tukey Mean-Difference Q-Q Plots for the transformed observations for each
of the values of .
See the help files for boxcox
and qqPlot
for more
information.
plot.boxcox
invisibly returns the first argument, x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
qqPlot
, boxcox
, boxcox.object
,
print.boxcox
, Data Transformations, plot
.
# Generate 30 observations from a lognormal distribution with # mean=10 and cv=2, call the function boxcox, and then plot # the results. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rlnormAlt(30, mean = 10, cv = 2) # Plot the results based on the PPCC objective #--------------------------------------------- boxcox.list <- boxcox(x) dev.new() plot(boxcox.list) # Look at Q-Q Plots for the candidate values of lambda #----------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) # Look at Tukey Mean-Difference Q-Q Plots # for the candidate values of lambda #---------------------------------------- plot(boxcox.list, plot.type = "Tukey M-D Q-Q Plots", same.window = FALSE) #========== # Clean up #--------- rm(x, boxcox.list) graphics.off()
# Generate 30 observations from a lognormal distribution with # mean=10 and cv=2, call the function boxcox, and then plot # the results. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x <- rlnormAlt(30, mean = 10, cv = 2) # Plot the results based on the PPCC objective #--------------------------------------------- boxcox.list <- boxcox(x) dev.new() plot(boxcox.list) # Look at Q-Q Plots for the candidate values of lambda #----------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) # Look at Tukey Mean-Difference Q-Q Plots # for the candidate values of lambda #---------------------------------------- plot(boxcox.list, plot.type = "Tukey M-D Q-Q Plots", same.window = FALSE) #========== # Clean up #--------- rm(x, boxcox.list) graphics.off()
Plot the results of calling the function boxcoxCensored
,
which returns an object of class "boxcoxCensored"
. Three different kinds of plots are available.
The function plot.boxcoxCensored
is automatically called by plot
when given an object of class "boxcoxCensored"
.
## S3 method for class 'boxcoxCensored' plot(x, plot.type = "Objective vs. lambda", same.window = TRUE, ask = same.window & plot.type != "Ojective vs. lambda", prob.method = "michael-schucany", plot.pos.con = 0.375, estimate.params = FALSE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = TRUE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, cex.main = 1.4 * par("cex"), cex.sub = par("cex"), main = NULL, sub = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ...)
## S3 method for class 'boxcoxCensored' plot(x, plot.type = "Objective vs. lambda", same.window = TRUE, ask = same.window & plot.type != "Ojective vs. lambda", prob.method = "michael-schucany", plot.pos.con = 0.375, estimate.params = FALSE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = TRUE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, cex.main = 1.4 * par("cex"), cex.sub = par("cex"), main = NULL, sub = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ...)
x |
an object of class |
plot.type |
character string indicating what kind of plot to create. Only one particular
plot type will be created, unless |
same.window |
logical scalar indicating whether to produce all plots in the same graphics
window ( |
ask |
logical scalar supplied to the function |
points.col |
numeric scalar determining the color of the points in the plot. The default
value is |
The following arguments can be supplied when plot.type="Q-Q Plots"
, plot.type="Tukey M-D Q-Q Plots"
, or plot.type="All"
(supplied to qqPlot
):
prob.method |
character string indicating what method to use to compute the plotting positions
for Q-Q plots or Tukey Mean-Difference Q-Q plots.
Possible values are
The This argument is ignored if |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant used to construct the Q-Q plots and/or Tukey Mean-Difference Q-Q plots.
The default value is |
estimate.params |
logical scalar indicating whether to compute quantiles based on estimating the
distribution parameters ( |
equal.axes |
logical scalar indicating whether to use the same range on the |
add.line |
logical scalar indicating whether to add a line to the plot. If |
qq.line.type |
character string determining what kind of line to add to the plot when |
duplicate.points.method |
a character string denoting how to plot points with duplicate |
line.col |
numeric scalar determining the color of the line in the plot. The default value
is |
line.lwd |
numeric scalar determining the width of the line in the plot. The default value
is |
line.lty |
numeric scalar determining the line type (style) of the line in the plot.
The default value is |
digits |
scalar indicating how many significant digits to print for the distribution
parameters and the value of the objective in the sub-title. The default
value is the current setting of |
Graphics parameters:
cex.main , cex.sub , main , sub , xlab , ylab , xlim , ylim , ...
|
graphics parameters; see |
The function plot.boxcoxCensored
is a method for the generic function
plot
for the class "boxcoxCensored"
(see boxcoxCensored.object
).
It can be invoked by calling plot
and giving it an object of
class "boxcoxCensored"
as the first argument, or by calling
plot.boxcoxCensored
directly, regardless of the class of the object given
as the first argument to plot.boxcoxCensored
.
Plots associated with Box-Cox transformations are produced on the current graphics device. These can be one or all of the following:
Objective vs. .
Observed Quantiles vs. Normal Quantiles (Q-Q Plot) for the transformed
observations for each of the values of .
Tukey Mean-Difference Q-Q Plots for the transformed observations for each
of the values of .
See the help files for boxcoxCensored
and qqPlotCensored
for more information.
plot.boxcoxCensored
invisibly returns the first argument, x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
qqPlotCensored
, boxcoxCensored
,
boxcoxCensored.object
, print.boxcoxCensored
,
Data Transformations, plot
.
# Generate 15 observations from a lognormal distribution with # mean=10 and cv=2 and censor the observations less than 2. # Then generate 15 more observations from this distribution and # censor the observations less than 4. # Then call the function boxcoxCensored, and then plot the results. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x.1 <- rlnormAlt(15, mean = 10, cv = 2) censored.1 <- x.1 < 2 x.1[censored.1] <- 2 x.2 <- rlnormAlt(15, mean = 10, cv = 2) censored.2 <- x.2 < 4 x.2[censored.2] <- 4 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) # Plot the results based on the PPCC objective #--------------------------------------------- boxcox.list <- boxcoxCensored(x, censored) dev.new() plot(boxcox.list) # Look at Q-Q Plots for the candidate values of lambda #----------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) # Look at Tukey Mean-Difference Q-Q Plots # for the candidate values of lambda #---------------------------------------- plot(boxcox.list, plot.type = "Tukey M-D Q-Q Plots", same.window = FALSE) #========== # Clean up #--------- rm(x.1, censored.1, x.2, censored.2, x, censored, boxcox.list) graphics.off()
# Generate 15 observations from a lognormal distribution with # mean=10 and cv=2 and censor the observations less than 2. # Then generate 15 more observations from this distribution and # censor the observations less than 4. # Then call the function boxcoxCensored, and then plot the results. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) x.1 <- rlnormAlt(15, mean = 10, cv = 2) censored.1 <- x.1 < 2 x.1[censored.1] <- 2 x.2 <- rlnormAlt(15, mean = 10, cv = 2) censored.2 <- x.2 < 4 x.2[censored.2] <- 4 x <- c(x.1, x.2) censored <- c(censored.1, censored.2) # Plot the results based on the PPCC objective #--------------------------------------------- boxcox.list <- boxcoxCensored(x, censored) dev.new() plot(boxcox.list) # Look at Q-Q Plots for the candidate values of lambda #----------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) # Look at Tukey Mean-Difference Q-Q Plots # for the candidate values of lambda #---------------------------------------- plot(boxcox.list, plot.type = "Tukey M-D Q-Q Plots", same.window = FALSE) #========== # Clean up #--------- rm(x.1, censored.1, x.2, censored.2, x, censored, boxcox.list) graphics.off()
Plot the results of calling the function boxcox
when the argument
x
supplied to boxcox
is an object of class "lm"
.
Three different kinds of plots are available.
The function plot.boxcoxLm
is automatically called by plot
when given an object of class "boxcoxLm"
. The names of other functions
associated with Box-Cox transformations are listed under Data Transformations.
## S3 method for class 'boxcoxLm' plot(x, plot.type = "Objective vs. lambda", same.window = TRUE, ask = same.window & plot.type != "Ojective vs. lambda", plot.pos.con = 0.375, estimate.params = FALSE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = TRUE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, cex.main = 1.4 * par("cex"), cex.sub = par("cex"), main = NULL, sub = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ...)
## S3 method for class 'boxcoxLm' plot(x, plot.type = "Objective vs. lambda", same.window = TRUE, ask = same.window & plot.type != "Ojective vs. lambda", plot.pos.con = 0.375, estimate.params = FALSE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = TRUE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, cex.main = 1.4 * par("cex"), cex.sub = par("cex"), main = NULL, sub = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ...)
x |
an object of class |
plot.type |
character string indicating what kind of plot to create. Only one particular
plot type will be created, unless |
same.window |
logical scalar indicating whether to produce all plots in the same graphics
window ( |
ask |
logical scalar supplied to the function |
points.col |
numeric scalar determining the color of the points in the plot. The default
value is |
The following arguments can be supplied when plot.type="Q-Q Plots"
, plot.type="Tukey M-D Q-Q Plots"
, or plot.type="All"
(supplied to qqPlot
):
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant used to construct the Q-Q plots and/or Tukey Mean-Difference Q-Q plots.
The default value is |
estimate.params |
logical scalar indicating whether to compute quantiles based on estimating the
distribution parameters ( |
equal.axes |
logical scalar indicating whether to use the same range on the |
add.line |
logical scalar indicating whether to add a line to the plot. If |
qq.line.type |
character string determining what kind of line to add to the plot when |
duplicate.points.method |
a character string denoting how to plot points with duplicate |
line.col |
numeric scalar determining the color of the line in the plot. The default value
is |
line.lwd |
numeric scalar determining the width of the line in the plot. The default value
is |
line.lty |
numeric scalar determining the line type (style) of the line in the plot.
The default value is |
digits |
scalar indicating how many significant digits to print for the distribution
parameters and the value of the objective in the sub-title. The default
value is the current setting of |
Graphics parameters:
cex.main , cex.sub , main , sub , xlab , ylab , xlim , ylim , ...
|
graphics parameters; see |
The function plot.boxcoxLm
is a method for the generic function
plot
for the class "boxcoxLm"
(see boxcoxLm.object
).
It can be invoked by calling plot
and giving it an object of
class "boxcoxLm"
as the first argument, or by calling plot.boxcoxLm
directly, regardless of the class of the object given as the first argument
to plot.boxcoxLm
.
Plots associated with Box-Cox transformations are produced on the current graphics device. These can be one or all of the following:
Objective vs. .
Observed Quantiles vs. Normal Quantiles (Q-Q Plot) for the residuals of
the linear model based on transformed values of the response variable
for each of the values of .
Tukey Mean-Difference Q-Q Plots for the residuals of
the linear model based on transformed values of the response variable
for each of the values of .
See the help files for boxcox
and qqPlot
for more
information.
plot.boxcoxLm
invisibly returns the first argument, x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
qqPlot
, boxcox
, boxcoxLm.object
,
print.boxcoxLm
, Data Transformations, plot
.
# Create an object of class "boxcoxLm", then plot the results. # The data frame Environmental.df contains daily measurements of # ozone concentration, wind speed, temperature, and solar radiation # in New York City for 153 consecutive days between May 1 and # September 30, 1973. In this example, we'll model ozone as a # function of temperature. # Fit the model with the raw Ozone data #-------------------------------------- ozone.fit <- lm(ozone ~ temperature, data = Environmental.df) boxcox.list <- boxcox(ozone.fit) # Plot PPCC vs. lambda based on Q-Q plots of residuals #----------------------------------------------------- dev.new() plot(boxcox.list) # Look at Q-Q plots of residuals for the various transformation #-------------------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) # Look at Tukey Mean-Difference Q-Q plots of residuals # for the various transformation #----------------------------------------------------- plot(boxcox.list, plot.type = "Tukey M-D Q-Q Plots", same.window = FALSE) #========== # Clean up #--------- rm(ozone.fit, boxcox.list) graphics.off()
# Create an object of class "boxcoxLm", then plot the results. # The data frame Environmental.df contains daily measurements of # ozone concentration, wind speed, temperature, and solar radiation # in New York City for 153 consecutive days between May 1 and # September 30, 1973. In this example, we'll model ozone as a # function of temperature. # Fit the model with the raw Ozone data #-------------------------------------- ozone.fit <- lm(ozone ~ temperature, data = Environmental.df) boxcox.list <- boxcox(ozone.fit) # Plot PPCC vs. lambda based on Q-Q plots of residuals #----------------------------------------------------- dev.new() plot(boxcox.list) # Look at Q-Q plots of residuals for the various transformation #-------------------------------------------------------------- plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) # Look at Tukey Mean-Difference Q-Q plots of residuals # for the various transformation #----------------------------------------------------- plot(boxcox.list, plot.type = "Tukey M-D Q-Q Plots", same.window = FALSE) #========== # Clean up #--------- rm(ozone.fit, boxcox.list) graphics.off()
Plot the results of calling the function gofTest
, which returns an
object of class "gof"
when testing the goodness-of-fit of a set of data
to a distribution (i.e., when supplied with the y
argument but not
the x
argument). Five different kinds of plots are available.
The function plot.gof
is automatically called by plot
when given an object of class "gof"
. The names of other functions
associated with goodness-of-fit test are listed under Goodness-of-Fit Tests.
## S3 method for class 'gof' plot(x, plot.type = "Summary", captions = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL, Results = NULL), x.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), y.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), same.window = FALSE, ask = same.window & plot.type == "All", hist.col = "cyan", fitted.pdf.col = "black", fitted.pdf.lwd = 3 * par("cex"), fitted.pdf.lty = 1, plot.pos.con = switch(dist.abb, norm = , lnorm = , lnormAlt = , lnorm3 = 0.375, evd = 0.44, 0.4), ecdf.col = "cyan", fitted.cdf.col = "black", ecdf.lwd = 3 * par("cex"), fitted.cdf.lwd = 3 * par("cex"), ecdf.lty = 1, fitted.cdf.lty = 2, add.line = TRUE, digits = ifelse(plot.type == "Summary", 2, .Options$digits), test.result.font = 1, test.result.cex = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), test.result.mar = c(0, 0, 3, 0) + 0.1, cex.main = ifelse(plot.type == "Summary", 1.2, 1.5) * par("cex"), cex.axis = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), cex.lab = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add.om.title = TRUE, oma = if (plot.type == "Summary" & add.om.title) c(0, 0, 2.5, 0) else c(0, 0, 0, 0), om.title = NULL, om.font = 2, om.cex.main = 1.75 * par("cex"), om.line = 0.5, ...)
## S3 method for class 'gof' plot(x, plot.type = "Summary", captions = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL, Results = NULL), x.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), y.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), same.window = FALSE, ask = same.window & plot.type == "All", hist.col = "cyan", fitted.pdf.col = "black", fitted.pdf.lwd = 3 * par("cex"), fitted.pdf.lty = 1, plot.pos.con = switch(dist.abb, norm = , lnorm = , lnormAlt = , lnorm3 = 0.375, evd = 0.44, 0.4), ecdf.col = "cyan", fitted.cdf.col = "black", ecdf.lwd = 3 * par("cex"), fitted.cdf.lwd = 3 * par("cex"), ecdf.lty = 1, fitted.cdf.lty = 2, add.line = TRUE, digits = ifelse(plot.type == "Summary", 2, .Options$digits), test.result.font = 1, test.result.cex = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), test.result.mar = c(0, 0, 3, 0) + 0.1, cex.main = ifelse(plot.type == "Summary", 1.2, 1.5) * par("cex"), cex.axis = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), cex.lab = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add.om.title = TRUE, oma = if (plot.type == "Summary" & add.om.title) c(0, 0, 2.5, 0) else c(0, 0, 0, 0), om.title = NULL, om.font = 2, om.cex.main = 1.75 * par("cex"), om.line = 0.5, ...)
x |
an object of class |
plot.type |
character string indicating what kind of plot to create. Only one particular
plot type will be created, unless |
captions |
a list with 1 to 5 components with the names |
x.labels |
a list of 1 to 4 components with the names |
y.labels |
a list of 1 to 4 components with the names |
same.window |
logical scalar indicating whether to produce all plots in the same graphics
window ( |
ask |
logical scalar supplied to the function |
digits |
scalar indicating how many significant digits to print for the distribution
parameters. If |
Arguments associated with plot.type="PDFs: Observed and Fitted"
:
hist.col |
a character string or numeric scalar determining the color of the histogram
used to display the distribution of the observed values. The default value is
|
fitted.pdf.col |
a character string or numeric scalar determining the color of the fitted PDF
(which is displayed as a line for continuous distributions and a histogram for
discrete distributions). The default value is |
fitted.pdf.lwd |
numeric scalar determining the width of the line used to display the fitted PDF.
The default value is |
fitted.pdf.lty |
numeric scalar determining the line type used to display the fitted PDF.
The default value is |
Arguments associated with plot.type="CDFs: Observed and Fitted"
:
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant used to construct the observed (empirical) CDF. The default value of
NOTE: This argument is also used to determine the value of the
plotting position constant for the Q-Q plot ( |
ecdf.col |
a character string or numeric scalar determining the color of the line
used to display the empirical CDF. The default value is
|
fitted.cdf.col |
a character string or numeric scalar determining the color of the line used
to display the fitted CDF. The default value is |
ecdf.lwd |
numeric scalar determining the width of the line used to display the empirical CDF.
The default value is |
fitted.cdf.lwd |
numeric scalar determining the width of the line used to display the fitted CDF.
The default value is |
ecdf.lty |
numeric scalar determining the line type used to display the empirical CDF.
The default value is |
fitted.cdf.lty |
numeric scalar determining the line type used to display the fitted CDF.
The default value is |
Arguments associated with plot.type="Q-Q Plot"
or plot.type="Tukey M-D Q-Q Plot"
:
As explained above, plot.pos.con
is used for these plot types. Also:
add.line |
logical scalar indicating whether to add a line to the plot. If |
Arguments associated with plot.type="Test Results"
test.result.font |
numeric scalar indicating which font to use to print out the test results.
The default value is |
test.result.cex |
numeric scalar indicating the value of |
test.result.mar |
numeric vector indicating the value of |
Arguments associated with plot.type="Summary"
add.om.title |
logical scalar indicating whether to add a title in the outer margin when |
om.title |
character string containing the outer margin title. The default value is |
om.font |
numeric scalar indicating the font to use for the outer margin. The default
value is |
om.cex.main |
numeric scalar indicating the value of |
om.line |
numeric scalar indicating the line to place the outer margin title on. The
default value is |
Graphics parameters:
cex.main , cex.axis , cex.lab , main , xlab , ylab , xlim , ylim , oma , ...
|
additional graphics parameters. See the help file for |
The function plot.gof
is a method for the generic function
plot
for objects that inherit from class "gof"
(see gof.object
).
It can be invoked by calling plot
and giving it an object of
class "gof"
as the first argument, or by calling plot.gof
directly, regardless of the class of the object given as the first argument
to plot.gof
.
Plots associated with the goodness-of-fit test are produced on the current graphics device. These can be one or all of the following:
Observed distribution overlaid with fitted distribution
(plot.type="PDFs: Observed and Fitted"
). See the help files for
hist
and pdfPlot
.
Observed empirical distribution overlaid with fitted cumulative distribution
(plot.type="CDFs: Observed and Fitted"
). See the help file for
cdfCompare
.
Observed quantiles vs. fitted quantiles (Q-Q Plot)
(plot.type="Q-Q Plot"
). See the help file for qqPlot
.
Tukey mean-difference Q-Q plot (plot.type="Tukey M-D Q-Q Plot"
).
See the help file for qqPlot
.
Results of the goodness-of-fit test (plot.type="Test Results"
).
See the help file for print.gof
.
See the help file for gofTest
for more information.
plot.gof
invisibly returns the first argument, x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
gofTest
, gof.object
, print.gof
,
Goodness-of-Fit Tests, plot
.
# Create an object of class "gof" then plot the results. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- rnorm(20, mean = 3, sd = 2) gof.obj <- gofTest(dat) # Summary plot (the default) #--------------------------- dev.new() plot(gof.obj) # Make your own titles for the summary plot #------------------------------------------ dev.new() plot(gof.obj, captions = list(PDFs = "Compare PDFs", CDFs = "Compare CDFs", QQ = "Q-Q Plot", Results = "Results"), om.title = "Summary") # Just the Q-Q Plot #------------------ dev.new() plot(gof.obj, plot.type="Q-Q") # Make your own title for the Q-Q Plot #------------------------------------- dev.new() plot(gof.obj, plot.type="Q-Q", main = "Q-Q Plot") #========== # Clean up #--------- rm(dat, gof.obj) graphics.off()
# Create an object of class "gof" then plot the results. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- rnorm(20, mean = 3, sd = 2) gof.obj <- gofTest(dat) # Summary plot (the default) #--------------------------- dev.new() plot(gof.obj) # Make your own titles for the summary plot #------------------------------------------ dev.new() plot(gof.obj, captions = list(PDFs = "Compare PDFs", CDFs = "Compare CDFs", QQ = "Q-Q Plot", Results = "Results"), om.title = "Summary") # Just the Q-Q Plot #------------------ dev.new() plot(gof.obj, plot.type="Q-Q") # Make your own title for the Q-Q Plot #------------------------------------- dev.new() plot(gof.obj, plot.type="Q-Q", main = "Q-Q Plot") #========== # Clean up #--------- rm(dat, gof.obj) graphics.off()
Plot the results of calling the function gofTestCensored
, which returns
an object of class "gofCensored"
when testing the goodness-of-fit of a set of
data to a distribution. Five different kinds of plots are available.
The function plot.gofCensored
is automatically called by plot
when given an object of class "gofCensored"
.
## S3 method for class 'gofCensored' plot(x, plot.type = "Summary", captions = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL, Results = NULL), x.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), y.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), same.window = FALSE, ask = same.window & plot.type == "All", hist.col = "cyan", fitted.pdf.col = "black", fitted.pdf.lwd = 3 * par("cex"), fitted.pdf.lty = 1, prob.method = "michael-schucany", plot.pos.con = 0.375, ecdf.col = "cyan", fitted.cdf.col = "black", ecdf.lwd = 3 * par("cex"), fitted.cdf.lwd = 3 * par("cex"), ecdf.lty = 1, fitted.cdf.lty = 2, add.line = TRUE, digits = ifelse(plot.type == "Summary", 2, .Options$digits), test.result.font = 1, test.result.cex = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), test.result.mar = c(0, 0, 3, 0) + 0.1, cex.main = ifelse(plot.type == "Summary", 1.2, 1.5) * par("cex"), cex.axis = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), cex.lab = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add.om.title = TRUE, oma = if (plot.type == "Summary" & add.om.title) c(0, 0, 4, 0) else c(0, 0, 0, 0), om.title = NULL, om.font = 2, om.cex.main = 1.5 * par("cex"), om.line = 0, ...)
## S3 method for class 'gofCensored' plot(x, plot.type = "Summary", captions = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL, Results = NULL), x.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), y.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), same.window = FALSE, ask = same.window & plot.type == "All", hist.col = "cyan", fitted.pdf.col = "black", fitted.pdf.lwd = 3 * par("cex"), fitted.pdf.lty = 1, prob.method = "michael-schucany", plot.pos.con = 0.375, ecdf.col = "cyan", fitted.cdf.col = "black", ecdf.lwd = 3 * par("cex"), fitted.cdf.lwd = 3 * par("cex"), ecdf.lty = 1, fitted.cdf.lty = 2, add.line = TRUE, digits = ifelse(plot.type == "Summary", 2, .Options$digits), test.result.font = 1, test.result.cex = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), test.result.mar = c(0, 0, 3, 0) + 0.1, cex.main = ifelse(plot.type == "Summary", 1.2, 1.5) * par("cex"), cex.axis = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), cex.lab = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add.om.title = TRUE, oma = if (plot.type == "Summary" & add.om.title) c(0, 0, 4, 0) else c(0, 0, 0, 0), om.title = NULL, om.font = 2, om.cex.main = 1.5 * par("cex"), om.line = 0, ...)
x |
an object of class |
plot.type |
character string indicating what kind of plot to create. Only one particular
plot type will be created, unless |
captions |
a list with 1 to 5 components with the names |
x.labels |
a list of 1 to 4 components with the names |
y.labels |
a list of 1 to 4 components with the names |
same.window |
logical scalar indicating whether to produce all plots in the same graphics
window ( |
ask |
logical scalar supplied to the function |
digits |
scalar indicating how many significant digits to print for the distribution
parameters. If |
Arguments associated with plot.type="PDFs: Observed and Fitted"
:
hist.col |
a character string or numeric scalar determining the color of the histogram
used to display the distribution of the observed values. The default value is
|
fitted.pdf.col |
a character string or numeric scalar determining the color of the fitted PDF
(which is displayed as a line for continuous distributions and a histogram for
discrete distributions). The default value is |
fitted.pdf.lwd |
numeric scalar determining the width of the line used to display the fitted PDF.
The default value is |
fitted.pdf.lty |
numeric scalar determining the line type used to display the fitted PDF.
The default value is |
Arguments associated with plot.type="CDFs: Observed and Fitted"
:
prob.method |
character string indicating what method to use to compute the plotting positions
(empirical probabilities). Possible values are: The default value is The NOTE: This argument is also used to determine the plotting position method
for the Q-Q plot ( |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant used to construct the observed (empirical) CDF. The default value is
This argument is used only if NOTE: This argument is also used to determine the value of the
plotting position constant for the Q-Q plot ( |
ecdf.col |
a character string or numeric scalar determining the color of the line
used to display the empirical CDF. The default value is
|
fitted.cdf.col |
a character string or numeric scalar determining the color of the line used
to display the fitted CDF. The default value is |
ecdf.lwd |
numeric scalar determining the width of the line used to display the empirical CDF.
The default value is |
fitted.cdf.lwd |
numeric scalar determining the width of the line used to display the fitted CDF.
The default value is |
ecdf.lty |
numeric scalar determining the line type used to display the empirical CDF.
The default value is |
fitted.cdf.lty |
numeric scalar determining the line type used to display the fitted CDF.
The default value is |
Arguments associated with plot.type="Q-Q Plot"
or plot.type="Tukey M-D Q-Q Plot"
:
As explained above, prob.method
and plot.pos.con
are used for these plot
types. Also:
add.line |
logical scalar indicating whether to add a line to the plot. If |
Arguments associated with plot.type="Test Results"
test.result.font |
numeric scalar indicating which font to use to print out the test results.
The default value is |
test.result.cex |
numeric scalar indicating the value of |
test.result.mar |
numeric vector indicating the value of |
Arguments associated with plot.type="Summary"
add.om.title |
logical scalar indicating whether to add a title in the outer margin when |
om.title |
character string containing the outer margin title. The default value is |
om.font |
numeric scalar indicating the font to use for the outer margin. The default
value is |
om.cex.main |
numeric scalar indicating the value of |
om.line |
numeric scalar indicating the line to place the outer margin title on. The
default value is |
Graphics parameters:
cex.main , cex.axis , cex.lab , main , xlab , ylab , xlim , ylim , oma , ...
|
additional graphics parameters. See the help file for |
The function plot.gofCensored
is a method for the generic function
plot
for objects that inherit from the class "gofCensored"
(see gofCensored.object
).
It can be invoked by calling plot
and giving it an object of
class "gofCensored"
as the first argument, or by calling
plot.gofCensored
directly, regardless of the class of the object given
as the first argument to plot.gofCensored
.
Plots associated with the goodness-of-fit test are produced on the current graphics device. These can be one or all of the following:
Observed distribution overlaid with fitted distribution
(plot.type="PDFs: Observed and Fitted"
). See the help files for
hist
and pdfPlot
. Note: This kind of
plot is only available for singly-censored data.
Observed empirical distribution overlaid with fitted cumulative distribution
(plot.type="CDFs: Observed and Fitted"
). See the help file for
cdfCompareCensored
.
Observed quantiles vs. fitted quantiles (Q-Q Plot)
(plot.type="Q-Q Plot"
). See the help file for qqPlotCensored
.
Tukey mean-difference Q-Q plot (plot.type="Tukey M-D Q-Q Plot"
).
See the help file for qqPlotCensored
.
Results of the goodness-of-fit test (plot.type="Test Results"
).
See the help file for print.gofCensored
.
See the help file for gofTestCensored
for more information.
plot.gofCensored
invisibly returns the first argument, x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
gofTestCensored
, gofCensored.object
,
print.gofCensored
, Censored Data, plot
.
# Create an object of class "gofCensored", then plot the results. #---------------------------------------------------------------- gofCensored.obj <- with(EPA.09.Ex.15.1.manganese.df, gofTestCensored(Manganese.ppb, Censored, test = "sf")) mode(gofCensored.obj) #[1] "list" class(gofCensored.obj) #[1] "gofCensored" # Summary plot (the default) #--------------------------- dev.new() plot(gofCensored.obj) # Make your own titles for the summary plot #------------------------------------------ dev.new() plot(gofCensored.obj, captions = list(CDFs = "Compare CDFs", QQ = "Q-Q Plot", Results = "Results"), om.title = "Summary") # Just the Q-Q Plot #------------------ dev.new() plot(gofCensored.obj, plot.type="Q-Q") # Make your own title for the Q-Q Plot #------------------------------------- dev.new() plot(gofCensored.obj, plot.type="Q-Q", main = "Q-Q Plot") #========== # Clean up #--------- rm(gofCensored.obj) graphics.off()
# Create an object of class "gofCensored", then plot the results. #---------------------------------------------------------------- gofCensored.obj <- with(EPA.09.Ex.15.1.manganese.df, gofTestCensored(Manganese.ppb, Censored, test = "sf")) mode(gofCensored.obj) #[1] "list" class(gofCensored.obj) #[1] "gofCensored" # Summary plot (the default) #--------------------------- dev.new() plot(gofCensored.obj) # Make your own titles for the summary plot #------------------------------------------ dev.new() plot(gofCensored.obj, captions = list(CDFs = "Compare CDFs", QQ = "Q-Q Plot", Results = "Results"), om.title = "Summary") # Just the Q-Q Plot #------------------ dev.new() plot(gofCensored.obj, plot.type="Q-Q") # Make your own title for the Q-Q Plot #------------------------------------- dev.new() plot(gofCensored.obj, plot.type="Q-Q", main = "Q-Q Plot") #========== # Clean up #--------- rm(gofCensored.obj) graphics.off()
Plot the results of calling the function gofGroupTest
,
which returns an object of class "gofGroup"
when performing a
goodness-of-fit test to determine whether data in a set of
groups appear to all come from the same probability distribution
(with possibly different parameters for each group).
Five different kinds of plots are available.
The function plot.gofGroup
is automatically called by plot
when given an object of class "gofGroup"
. The names of other functions
associated with goodness-of-fit test are listed under Goodness-of-Fit Tests.
## S3 method for class 'gofGroup' plot(x, plot.type = "Summary", captions = list(QQ = NULL, MDQQ = NULL, ScoresQQ = NULL, ScoresMDQQ = NULL, Results = NULL), x.labels = list(QQ = NULL, MDQQ = NULL, ScoresQQ = NULL, ScoresMDQQ = NULL), y.labels = list(QQ = NULL, MDQQ = NULL, ScoresQQ = NULL, ScoresMDQQ = NULL), same.window = FALSE, ask = same.window & plot.type == "All", add.line = TRUE, digits = ifelse(plot.type == "Summary", 2, .Options$digits), test.result.font = 1, test.result.cex = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), test.result.mar = c(0, 0, 3, 0) + 0.1, individual.p.values = FALSE, cex.main = ifelse(plot.type == "Summary", 1.2, 1.5) * par("cex"), cex.axis = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), cex.lab = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add.om.title = TRUE, oma = if (plot.type == "Summary" & add.om.title) c(0, 0, 5, 0) else c(0, 0, 0, 0), om.title = NULL, om.font = 2, om.cex.main = 1.5 * par("cex"), om.line = 1, ...)
## S3 method for class 'gofGroup' plot(x, plot.type = "Summary", captions = list(QQ = NULL, MDQQ = NULL, ScoresQQ = NULL, ScoresMDQQ = NULL, Results = NULL), x.labels = list(QQ = NULL, MDQQ = NULL, ScoresQQ = NULL, ScoresMDQQ = NULL), y.labels = list(QQ = NULL, MDQQ = NULL, ScoresQQ = NULL, ScoresMDQQ = NULL), same.window = FALSE, ask = same.window & plot.type == "All", add.line = TRUE, digits = ifelse(plot.type == "Summary", 2, .Options$digits), test.result.font = 1, test.result.cex = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), test.result.mar = c(0, 0, 3, 0) + 0.1, individual.p.values = FALSE, cex.main = ifelse(plot.type == "Summary", 1.2, 1.5) * par("cex"), cex.axis = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), cex.lab = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add.om.title = TRUE, oma = if (plot.type == "Summary" & add.om.title) c(0, 0, 5, 0) else c(0, 0, 0, 0), om.title = NULL, om.font = 2, om.cex.main = 1.5 * par("cex"), om.line = 1, ...)
x |
an object of class |
plot.type |
character string indicating what kind of plot to create. Only one particular
plot type will be created, unless |
captions |
a list with 1 to 5 components with the names |
x.labels |
a list of 1 to 4 components with the names |
y.labels |
a list of 1 to 4 components with the names |
same.window |
logical scalar indicating whether to produce all plots in the same graphics
window ( |
ask |
logical scalar supplied to the function |
add.line |
logical scalar indicating whether to add a line to the plot. If |
Arguments associated with plot.type="Test Results"
digits |
scalar indicating how many significant digits to print for the test results
when |
individual.p.values |
logical scalar indicating whether to display the p-values associated with
each individual group. The default value is |
test.result.font |
numeric scalar indicating which font to use to print out the test results.
The default value is |
test.result.cex |
numeric scalar indicating the value of |
test.result.mar |
numeric vector indicating the value of |
Arguments associated with plot.type="Summary"
add.om.title |
logical scalar indicating whether to add a title in the outer margin when |
om.title |
character string containing the outer margin title. The default value is |
om.font |
numeric scalar indicating the font to use for the outer margin. The default
value is |
om.cex.main |
numeric scalar indicating the value of |
om.line |
numeric scalar indicating the line to place the outer margin title on. The
default value is |
Graphics parameters:
cex.main , cex.axis , cex.lab , main , xlab , ylab , xlim , ylim , oma , ...
|
additional graphics parameters. See the help file for |
The function plot.gofGroup
is a method for the generic function
plot
for the class "gofGroup"
(see
gofGroup.object
).
It can be invoked by calling plot
and giving it an object of
class "gofGroup"
as the first argument, or by calling
plot.gofGroup
directly, regardless of the class of the object given
as the first argument to plot.gofGroup
.
Plots associated with the goodness-of-fit test are produced on the current graphics device. These can be one or all of the following:
plot.type="Q-Q Plot"
.
Q-Q Plot of observed p-values vs. quantiles from a
Uniform [0,1] distribution.
See the help file for qqPlot
.
plot.type="Tukey M-D Q-Q Plot"
.
Tukey mean-difference Q-Q plot for observed p-values and
quantiles from a Uniform [0,1] distribution.
See the help file for qqPlot
.
plot.type="Scores Q-Q Plot"
.
Q-Q Plot of Normal scores vs. quantiles from a
Normal(0,1) distribution or
Q-Q Plot of Chisquare scores vs. quantiles from a
Chisquare distribution with 2 degrees of freedom.
See the help file for qqPlot
.
plot.type="Scores Tukey M-D Q-Q Plot"
.
Tukey mean-difference Q-Q plot based on Normal scores or
Chisquare scores.
See the help file for qqPlot
.
Results of the goodness-of-fit test (plot.type="Test Results"
).
See the help file for print.gofGroup
.
See the help file for gofGroupTest
for more information.
plot.gofGroup
invisibly returns the first argument, x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
gofGroupTest
, gofGroup.object
,
print.gofGroup
,
Goodness-of-Fit Tests, plot
.
# Create an object of class "gofGroup" then plot it. # Example 10-4 of USEPA (2009, page 10-20) gives an example of # simultaneously testing the assumption of normality for nickel # concentrations (ppb) in groundwater collected at 4 monitoring # wells over 5 months. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #... #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 # Test for a normal distribution at each well: #-------------------------------------------- gofGroup.obj <- gofGroupTest(Nickel.ppb ~ Well, data = EPA.09.Ex.10.1.nickel.df) dev.new() plot(gofGroup.obj) # Make your own titles for the summary plot #------------------------------------------ dev.new() plot(gofGroup.obj, captions = list(QQ = "Q-Q Plot", ScoresQQ = "Scores Q-Q Plot", Results = "Results"), om.title = "Summary Plot") # Just the Q-Q Plot #------------------ dev.new() plot(gofGroup.obj, plot.type="Q-Q") # Make your own title for the Q-Q Plot #------------------------------------- dev.new() plot(gofGroup.obj, plot.type="Q-Q", main = "Q-Q Plot") #========== # Clean up #--------- rm(gofGroup.obj) graphics.off()
# Create an object of class "gofGroup" then plot it. # Example 10-4 of USEPA (2009, page 10-20) gives an example of # simultaneously testing the assumption of normality for nickel # concentrations (ppb) in groundwater collected at 4 monitoring # wells over 5 months. The data for this example are stored in # EPA.09.Ex.10.1.nickel.df. EPA.09.Ex.10.1.nickel.df # Month Well Nickel.ppb #1 1 Well.1 58.8 #2 3 Well.1 1.0 #3 6 Well.1 262.0 #... #18 6 Well.4 85.6 #19 8 Well.4 10.0 #20 10 Well.4 637.0 # Test for a normal distribution at each well: #-------------------------------------------- gofGroup.obj <- gofGroupTest(Nickel.ppb ~ Well, data = EPA.09.Ex.10.1.nickel.df) dev.new() plot(gofGroup.obj) # Make your own titles for the summary plot #------------------------------------------ dev.new() plot(gofGroup.obj, captions = list(QQ = "Q-Q Plot", ScoresQQ = "Scores Q-Q Plot", Results = "Results"), om.title = "Summary Plot") # Just the Q-Q Plot #------------------ dev.new() plot(gofGroup.obj, plot.type="Q-Q") # Make your own title for the Q-Q Plot #------------------------------------- dev.new() plot(gofGroup.obj, plot.type="Q-Q", main = "Q-Q Plot") #========== # Clean up #--------- rm(gofGroup.obj) graphics.off()
Plot the results of calling the function gofTest
to compare
two samples. gofTest
returns an object of class "gofTwoSample"
when supplied with both the arguments y
and x
.
plot.gofTwoSample
provides five different kinds of plots.
The function plot.gofTwoSample
is automatically called by plot
when given an object of class "gofTwoSample"
. The names of other functions
associated with goodness-of-fit test are listed under Goodness-of-Fit Tests.
## S3 method for class 'gofTwoSample' plot(x, plot.type = "Summary", captions = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL, Results = NULL), x.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), y.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), same.window = FALSE, ask = same.window & plot.type == "All", x.points.col = "blue", y.points.col = "black", points.pch = 1, jitter.points = TRUE, discrete = FALSE, plot.pos.con = 0.375, x.ecdf.col = "blue", y.ecdf.col = "black", x.ecdf.lwd = 3 * par("cex"), y.ecdf.lwd = 3 * par("cex"), x.ecdf.lty = 1, y.ecdf.lty = 4, add.line = TRUE, digits = ifelse(plot.type == "Summary", 2, .Options$digits), test.result.font = 1, test.result.cex = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), test.result.mar = c(0, 0, 3, 0) + 0.1, cex.main = ifelse(plot.type == "Summary", 1.2, 1.5) * par("cex"), cex.axis = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), cex.lab = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add.om.title = TRUE, oma = if (plot.type == "Summary" & add.om.title) c(0, 0, 4, 0) else c(0, 0, 0, 0), om.title = NULL, om.font = 2, om.cex.main = 1.5 * par("cex"), om.line = 0, ...)
## S3 method for class 'gofTwoSample' plot(x, plot.type = "Summary", captions = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL, Results = NULL), x.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), y.labels = list(PDFs = NULL, CDFs = NULL, QQ = NULL, MDQQ = NULL), same.window = FALSE, ask = same.window & plot.type == "All", x.points.col = "blue", y.points.col = "black", points.pch = 1, jitter.points = TRUE, discrete = FALSE, plot.pos.con = 0.375, x.ecdf.col = "blue", y.ecdf.col = "black", x.ecdf.lwd = 3 * par("cex"), y.ecdf.lwd = 3 * par("cex"), x.ecdf.lty = 1, y.ecdf.lty = 4, add.line = TRUE, digits = ifelse(plot.type == "Summary", 2, .Options$digits), test.result.font = 1, test.result.cex = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), test.result.mar = c(0, 0, 3, 0) + 0.1, cex.main = ifelse(plot.type == "Summary", 1.2, 1.5) * par("cex"), cex.axis = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), cex.lab = ifelse(plot.type == "Summary", 0.9, 1) * par("cex"), main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, add.om.title = TRUE, oma = if (plot.type == "Summary" & add.om.title) c(0, 0, 4, 0) else c(0, 0, 0, 0), om.title = NULL, om.font = 2, om.cex.main = 1.5 * par("cex"), om.line = 0, ...)
x |
an object of class |
plot.type |
character string indicating what kind of plot to create. Only one particular
plot type will be created, unless |
captions |
a list with 1 to 5 components with the names |
x.labels |
a list of 1 to 4 components with the names |
y.labels |
a list of 1 to 4 components with the names |
same.window |
logical scalar indicating whether to produce all plots in the same graphics
window ( |
ask |
logical scalar supplied to the function |
Arguments associated with plot.type="PDFs: Observed"
:
x.points.col |
a character string or numeric scalar determining the color of the plotting symbol
used to display the distribution of the observed |
y.points.col |
a character string or numeric scalar determining the color of the plotting symbol
used to display the distribution of the observed |
points.pch |
a character string or numeric scalar determining the plotting symbol
used to display the distribution of the observed |
jitter.points |
logical scalar indicating whether to jitter the points in the strip chart.
The default value is |
Arguments associated with plot.type="CDFs: Observed"
:
discrete |
logical scalar indicating whether the two distributions are considered to be
discrete ( |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position
constant used to construct the observed (empirical) CDFs. The default value
is NOTE: This argument is also used to determine the value of the
plotting position constant for the Q-Q plot ( |
x.ecdf.col |
a character string or numeric scalar determining the color of the line
used to display the empirical CDF for the |
y.ecdf.col |
a character string or numeric scalar determining the color of the line
used to display the empirical CDF for the |
x.ecdf.lwd |
numeric scalar determining the width of the line used to display the empirical CDF
for the |
y.ecdf.lwd |
numeric scalar determining the width of the line used to display the empirical CDF
for the |
x.ecdf.lty |
numeric scalar determining the line type used to display the empirical CDF for the
|
y.ecdf.lty |
numeric scalar determining the line type used to display the empirical CDF for the
|
Arguments associated with plot.type="Q-Q Plot"
or plot.type="Tukey M-D Q-Q Plot"
:
As explained above, plot.pos.con
is used for these plot types. Also:
add.line |
logical scalar indicating whether to add a line to the plot. If |
Arguments associated with plot.type="Test Results"
digits |
scalar indicating how many significant digits to print for the test results
when |
test.result.font |
numeric scalar indicating which font to use to print out the test results.
The default value is |
test.result.cex |
numeric scalar indicating the value of |
test.result.mar |
numeric vector indicating the value of |
Arguments associated with plot.type="Summary"
add.om.title |
logical scalar indicating whether to add a title in the outer margin when |
om.title |
character string containing the outer margin title. The default value is |
om.font |
numeric scalar indicating the font to use for the outer margin. The default
value is |
om.cex.main |
numeric scalar indicating the value of |
om.line |
numeric scalar indicating the line to place the outer margin title on. The
default value is |
Graphics parameters:
cex.main , cex.axis , cex.lab , main , xlab , ylab , xlim , ylim , oma , ...
|
additional graphics parameters. See the help file for |
The function plot.gofTwoSample
is a method for the generic function
plot
for the class "gofTwoSample"
(see gofTwoSample.object
).
It can be invoked by calling plot
and giving it an object of
class "gofTwoSample"
as the first argument, or by calling
plot.gofTwoSample
directly, regardless of the class of the object given
as the first argument to plot.gofTwoSample
.
Plots associated with the goodness-of-fit test are produced on the current graphics device. These can be one or all of the following:
Observed distributions (plot.type="PDFs: Observed"
).
Observed CDFs (plot.type="CDFs: Observed"
).
See the help file for cdfCompare
.
Q-Q Plot (plot.type="Q-Q Plot"
). See the help file for
qqPlot
.
Tukey mean-difference Q-Q plot (plot.type="Tukey M-D Q-Q Plot"
).
See the help file for qqPlot
.
Results of the goodness-of-fit test (plot.type="Test Results"
).
See the help file for print.gofTwoSample
.
See the help file for gofTest
for more information.
plot.gofTwoSample
invisibly returns the first argument, x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
gofTest
, gofTwoSample.object
,
print.gofTwoSample
,
Goodness-of-Fit Tests, plot
.
# Create an object of class "gofTwoSample" then plot the results. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(300) dat1 <- rnorm(20, mean = 3, sd = 2) dat2 <- rnorm(10, mean = 1, sd = 2) gof.obj <- gofTest(x = dat1, y = dat2) # Summary plot (the default) #--------------------------- dev.new() plot(gof.obj) # Make your own titles for the summary plot #------------------------------------------ dev.new() plot(gof.obj, captions = list(PDFs = "Compare PDFs", CDFs = "Compare CDFs", QQ = "Q-Q Plot", Results = "Results"), om.title = "Summary Plot") # Just the Q-Q Plot #------------------ dev.new() plot(gof.obj, plot.type="Q-Q") # Make your own title for the Q-Q Plot #------------------------------------- dev.new() plot(gof.obj, plot.type="Q-Q", main = "Q-Q Plot") #========== # Clean up #--------- rm(dat1, dat2, gof.obj) graphics.off()
# Create an object of class "gofTwoSample" then plot the results. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(300) dat1 <- rnorm(20, mean = 3, sd = 2) dat2 <- rnorm(10, mean = 1, sd = 2) gof.obj <- gofTest(x = dat1, y = dat2) # Summary plot (the default) #--------------------------- dev.new() plot(gof.obj) # Make your own titles for the summary plot #------------------------------------------ dev.new() plot(gof.obj, captions = list(PDFs = "Compare PDFs", CDFs = "Compare CDFs", QQ = "Q-Q Plot", Results = "Results"), om.title = "Summary Plot") # Just the Q-Q Plot #------------------ dev.new() plot(gof.obj, plot.type="Q-Q") # Make your own title for the Q-Q Plot #------------------------------------- dev.new() plot(gof.obj, plot.type="Q-Q", main = "Q-Q Plot") #========== # Clean up #--------- rm(dat1, dat2, gof.obj) graphics.off()
Plot the results of calling functions that return an object of class
"permutationTest"
. Currently, the EnvStats functions that perform
permutation tests and produce objects of class "permutationTest"
are: oneSamplePermutationTest
,
twoSamplePermutationTestLocation
, and twoSamplePermutationTestProportion
.
The function plot.permutationTest
is automatically called by
plot
when given an object of class "permutationTest"
.
## S3 method for class 'permutationTest' plot(x, hist.col = "cyan", stat.col = "black", stat.lwd = 3 * par("cex"), stat.lty = 1, cex.main = par("cex"), digits = .Options$digits, main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ...)
## S3 method for class 'permutationTest' plot(x, hist.col = "cyan", stat.col = "black", stat.lwd = 3 * par("cex"), stat.lty = 1, cex.main = par("cex"), digits = .Options$digits, main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL, ...)
x |
an object of class |
hist.col |
a character string or numeric scalar determining the color of the histogram
used to display the permutation distribution. The default
value is |
stat.col |
a character string or numeric scalar determining the color of the line indicating
the value of the observed test statistic. The default value is
|
stat.lwd |
numeric scalar determining the width of the line indicating the value of the
observed test statistic. The default value is |
stat.lty |
numeric scalar determining the line type used to display the value of the
observed test statistic. The default value is |
digits |
scalar indicating how many significant digits to print for the distribution
parameters. The default value is |
cex.main , main , xlab , ylab , xlim , ylim , ...
|
graphics parameters. See the help file for |
Produces a plot displaying the permutation distribution (exact=TRUE
) or a
sample of the permutation distribution (exact=FALSE
), and a line indicating
the observed value of the test statistic. The title in the plot includes
information on the data used, null hypothesis, and p-value.
The function plot.permutationTest
is a method for the generic function
plot
for the class "permutationTest"
(see permutationTest.object
). It can be invoked by calling
plot
and giving it an object of
class "permutationTest"
as the first argument, or by calling plot.permutationTest
directly, regardless of the class of the object given
as the first argument to plot.permutationTest
.
plot.permutationTest
invisibly returns the first argument, x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
permutationTest.object
, print.permutationTest
,
oneSamplePermutationTest
, twoSamplePermutationTestLocation
,
twoSamplePermutationTestProportion
,
Hypothesis Tests, plot
.
# Create an object of class "permutationTest", then print it and plot it. # (Note: the call to set.seed() allows you to reproduce this example.) #------------------------------------------------------------------------ set.seed(23) dat <- rlogis(10, location = 7, scale = 2) permutationTest.obj <- oneSamplePermutationTest(dat, mu = 5, alternative = "greater", exact = TRUE) mode(permutationTest.obj) #[1] "list" class(permutationTest.obj) #[1] "permutationTest" names(permutationTest.obj) # [1] "statistic" "parameters" "p.value" # [4] "estimate" "null.value" "alternative" # [7] "method" "estimation.method" "sample.size" #[10] "data.name" "bad.obs" "stat.dist" #[13] "exact" #========== # Print the results of the test #------------------------------ permutationTest.obj #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Mean (Median) = 5 # #Alternative Hypothesis: True Mean (Median) is greater than 5 # #Test Name: One-Sample Permutation Test # (Exact) # #Estimated Parameter(s): Mean = 9.977294 # #Data: dat # #Sample Size: 10 # #Test Statistic: Sum(x - 5) = 49.77294 # #P-value: 0.001953125 #========== # Plot the results of the test #----------------------------- dev.new() plot(permutationTest.obj) #========== # Extract the test statistic #--------------------------- permutationTest.obj$statistic #Sum(x - 5) # 49.77294 #========== # Clean up #--------- rm(permutationTest.obj) graphics.off()
# Create an object of class "permutationTest", then print it and plot it. # (Note: the call to set.seed() allows you to reproduce this example.) #------------------------------------------------------------------------ set.seed(23) dat <- rlogis(10, location = 7, scale = 2) permutationTest.obj <- oneSamplePermutationTest(dat, mu = 5, alternative = "greater", exact = TRUE) mode(permutationTest.obj) #[1] "list" class(permutationTest.obj) #[1] "permutationTest" names(permutationTest.obj) # [1] "statistic" "parameters" "p.value" # [4] "estimate" "null.value" "alternative" # [7] "method" "estimation.method" "sample.size" #[10] "data.name" "bad.obs" "stat.dist" #[13] "exact" #========== # Print the results of the test #------------------------------ permutationTest.obj #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Mean (Median) = 5 # #Alternative Hypothesis: True Mean (Median) is greater than 5 # #Test Name: One-Sample Permutation Test # (Exact) # #Estimated Parameter(s): Mean = 9.977294 # #Data: dat # #Sample Size: 10 # #Test Statistic: Sum(x - 5) = 49.77294 # #P-value: 0.001953125 #========== # Plot the results of the test #----------------------------- dev.new() plot(permutationTest.obj) #========== # Extract the test statistic #--------------------------- permutationTest.obj$statistic #Sum(x - 5) # 49.77294 #========== # Clean up #--------- rm(permutationTest.obj) graphics.off()
Create plots involving sample size, power, scaled difference, and significance level for a one-way fixed-effects analysis of variance.
plotAovDesign(x.var = "n", y.var = "power", range.x.var = NULL, n.vec = c(25, 25), mu.vec = c(0, 1), sigma = 1, alpha = 0.05, power = 0.95, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 50, plot.col = 1, plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, main = NULL, xlab = NULL, ylab = NULL, type = "l", ...)
plotAovDesign(x.var = "n", y.var = "power", range.x.var = NULL, n.vec = c(25, 25), mu.vec = c(0, 1), sigma = 1, alpha = 0.05, power = 0.95, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 50, plot.col = 1, plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, main = NULL, xlab = NULL, ylab = NULL, type = "l", ...)
x.var |
character string indicating what variable to use for the x-axis. Possible values are
|
y.var |
character string indicating what variable to use for the y-axis. Possible values are
|
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use for the plot.
The default value depends on the value of |
n.vec |
numeric vector indicating the sample size for each group. The default value is
|
mu.vec |
numeric vector indicating the population mean for each group. The default value is
|
sigma |
numeric scalar indicating the population standard deviation for all groups. The default
value is |
alpha |
numeric scalar between 0 and 1 indicating the Type I error level associated with the
hypothesis test. The default value is |
power |
numeric scalar between 0 and 1 indicating the power associated with the hypothesis
test. The default value is |
round.up |
logical scalar indicating whether to round up the values of the computed sample
size(s) to the next smallest integer. The default value is FALSE. This argument
is ignored unless |
n.max |
for the case when |
tol |
for the case when |
maxiter |
for the case when |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see |
add |
a logical scalar indicating whether to add the design plot to the existing plot
( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot. There are
|
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for aovPower
and aovN
for information on how to compute the power and sample size for a
one-way fixed-effects analysis of variance.
plotAovDesign
invisibly returns a list with components:
x.var |
x-coordinates of the points that have been or would have been plotted |
y.var |
y-coordinates of the points that have been or would have been plotted |
The normal and lognormal distribution are probably the two most frequently used distributions to model environmental data. Sometimes it is necessary to compare several means to determine whether any are significantly different from each other (e.g., USEPA, 2009, p.6-38). In this case, assuming normally distributed data, you perform a one-way parametric analysis of variance.
In the course of designing a sampling program, an environmental
scientist may wish to determine the relationship between sample
size, Type I error level, power, and differences in means if
one of the objectives of the sampling program is to determine
whether a particular mean differs from a group of means. The
functions aovPower
, aovN
, and
plotAovDesign
can be used to investigate these
relationships for the case of normally-distributed observations.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, FL, Chapter 17.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapters 27, 29, 30.
Scheffe, H. (1959). The Analysis of Variance. John Wiley and Sons, New York, 477pp.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 10.
# Look at the relationship between power and sample size # for a one-way ANOVA, assuming k=2 groups, group means of # 0 and 1, a population standard deviation of 1, and a # 5% significance level: dev.new() plotAovDesign() #-------------------------------------------------------------------- # Plot power vs. sample size for various levels of significance: dev.new() plotAovDesign(mu.vec = c(0, 0.5, 1), ylim=c(0, 1), main="") plotAovDesign(mu.vec = c(0, 0.5, 1), alpha=0.1, add=TRUE, plot.col=2) plotAovDesign(mu.vec = c(0, 0.5, 1), alpha=0.2, add=TRUE, plot.col=3) legend(35, 0.6, c("20%", "10%", " 5%"), lty=1, lwd = 3, col=3:1, bty = "n") mtext("Power vs. Sample Size for One-Way ANOVA", line = 3, cex = 1.25) mtext(expression(paste("with ", mu, "=(0, 0.5, 1), ", sigma, "=1, and Various Significance Levels", sep="")), line = 1.5, cex = 1.25) #-------------------------------------------------------------------- # The example on pages 5-11 to 5-14 of USEPA (1989b) shows # log-transformed concentrations of lead (mg/L) at two # background wells and four compliance wells, where # observations were taken once per month over four months # (the data are stored in EPA.89b.loglead.df). # Assume the true mean levels at each well are # 3.9, 3.9, 4.5, 4.5, 4.5, and 5, respectively. Plot the # power vs. sample size of a one-way ANOVA to test for mean # differences between wells. Use alpha=0.05, and assume the # true standard deviation is equal to the one estimated # from the data in this example. names(EPA.89b.loglead.df) #[1] "LogLead" "Month" "Well" "Well.type" # Perform the ANOVA and get the estimated sd aov.list <- aov(LogLead ~ Well, data=EPA.89b.loglead.df) summary(aov.list) # Df Sum Sq Mean Sq F value Pr(>F) #Well 5 5.7447 1.14895 3.3469 0.02599 * #Residuals 18 6.1791 0.34328 #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 # Now create the plot dev.new() plotAovDesign(range.x.var = c(2, 20), mu.vec = c(3.9,3.9,4.5,4.5,4.5,5), sigma=sqrt(0.34), ylim = c(0, 1), digits=2) # Clean up #--------- rm(aov.list) graphics.off()
# Look at the relationship between power and sample size # for a one-way ANOVA, assuming k=2 groups, group means of # 0 and 1, a population standard deviation of 1, and a # 5% significance level: dev.new() plotAovDesign() #-------------------------------------------------------------------- # Plot power vs. sample size for various levels of significance: dev.new() plotAovDesign(mu.vec = c(0, 0.5, 1), ylim=c(0, 1), main="") plotAovDesign(mu.vec = c(0, 0.5, 1), alpha=0.1, add=TRUE, plot.col=2) plotAovDesign(mu.vec = c(0, 0.5, 1), alpha=0.2, add=TRUE, plot.col=3) legend(35, 0.6, c("20%", "10%", " 5%"), lty=1, lwd = 3, col=3:1, bty = "n") mtext("Power vs. Sample Size for One-Way ANOVA", line = 3, cex = 1.25) mtext(expression(paste("with ", mu, "=(0, 0.5, 1), ", sigma, "=1, and Various Significance Levels", sep="")), line = 1.5, cex = 1.25) #-------------------------------------------------------------------- # The example on pages 5-11 to 5-14 of USEPA (1989b) shows # log-transformed concentrations of lead (mg/L) at two # background wells and four compliance wells, where # observations were taken once per month over four months # (the data are stored in EPA.89b.loglead.df). # Assume the true mean levels at each well are # 3.9, 3.9, 4.5, 4.5, 4.5, and 5, respectively. Plot the # power vs. sample size of a one-way ANOVA to test for mean # differences between wells. Use alpha=0.05, and assume the # true standard deviation is equal to the one estimated # from the data in this example. names(EPA.89b.loglead.df) #[1] "LogLead" "Month" "Well" "Well.type" # Perform the ANOVA and get the estimated sd aov.list <- aov(LogLead ~ Well, data=EPA.89b.loglead.df) summary(aov.list) # Df Sum Sq Mean Sq F value Pr(>F) #Well 5 5.7447 1.14895 3.3469 0.02599 * #Residuals 18 6.1791 0.34328 #--- #Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 # Now create the plot dev.new() plotAovDesign(range.x.var = c(2, 20), mu.vec = c(3.9,3.9,4.5,4.5,4.5,5), sigma=sqrt(0.34), ylim = c(0, 1), digits=2) # Clean up #--------- rm(aov.list) graphics.off()
Create plots for a sampling design based on a confidence interval for a binomial proportion or the difference between two proportions.
plotCiBinomDesign(x.var = "n", y.var = "half.width", range.x.var = NULL, n.or.n1 = 25, p.hat.or.p1.hat = 0.5, n2 = n.or.n1, p2.hat = 0.4, ratio = 1, half.width = 0.05, conf.level = 0.95, sample.type = "one.sample", ci.method = "score", correct = TRUE, warn = TRUE, n.or.n1.min = 2, n.or.n1.max = 10000, tol.half.width = 0.005, tol.p.hat = 0.005, maxiter = 10000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = 1, plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, main = NULL, xlab = NULL, ylab = NULL, type = "l", ...)
plotCiBinomDesign(x.var = "n", y.var = "half.width", range.x.var = NULL, n.or.n1 = 25, p.hat.or.p1.hat = 0.5, n2 = n.or.n1, p2.hat = 0.4, ratio = 1, half.width = 0.05, conf.level = 0.95, sample.type = "one.sample", ci.method = "score", correct = TRUE, warn = TRUE, n.or.n1.min = 2, n.or.n1.max = 10000, tol.half.width = 0.005, tol.p.hat = 0.005, maxiter = 10000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = 1, plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, main = NULL, xlab = NULL, ylab = NULL, type = "l", ...)
x.var |
character string indicating what variable to use for the x-axis. Possible values are
|
y.var |
character string indicating what variable to use for the y-axis. Possible values are
|
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use for the plot.
The default value depends on the value of |
n.or.n1 |
numeric scalar indicating the sample size. The default value is |
p.hat.or.p1.hat |
numeric scalar indicating an estimated proportion. |
n2 |
numeric scalar indicating the sample size for group 2. The default value is the value of |
p2.hat |
numeric scalar indicating the estimated proportion for group 2.
Missing ( |
ratio |
numeric vector indicating the ratio of sample size in group 2 to sample size in group 1 ( |
half.width |
positive numeric scalar indicating the half-width of the confidence interval.
The default value is |
conf.level |
a numeric scalar between 0 and 1 indicating the confidence level associated with the confidence intervals.
The default value is |
sample.type |
character string indicating whether this is a one-sample or two-sample confidence interval.
When |
ci.method |
character string indicating which method to use to construct the confidence interval.
Possible values are |
correct |
logical scalar indicating whether to use the continuity correction when |
warn |
logical scalar indicating whether to issue a warning when |
n.or.n1.min |
for the case when |
n.or.n1.max |
for the case when |
tol.half.width |
for the case when |
tol.p.hat |
for the case when |
maxiter |
for the case when |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see description of the argument |
add |
a logical scalar indicating whether to add the design plot to the existing plot
( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for ciBinomHalfWidth
and ciBinomN
for information on how to compute a one-sample confidence interval for
a single binomial proportion or a two-sample confidence interval for the difference between
two proportions, how the half-width is computed when other quantities are fixed, and how
the sample size is computed when other quantities are fixed.
plotCiBinomDesign
invisibly returns a list with components:
x.var |
x-coordinates of the points that have been or would have been plotted |
y.var |
y-coordinates of the points that have been or would have been plotted |
The binomial distribution is used to model processes with binary
(Yes-No, Success-Failure, Heads-Tails, etc.) outcomes. It is assumed that the outcome of any
one trial is independent of any other trial, and that the probability of “success”, ,
is the same on each trial. A binomial discrete random variable
is the number of
“successes” in
independent trials. A special case of the binomial distribution
occurs when
, in which case
is also called a Bernoulli random variable.
In the context of environmental statistics, the binomial distribution is sometimes used to model
the proportion of times a chemical concentration exceeds a set standard in a given period of time
(e.g., Gilbert, 1987, p.143), or to compare the proportion of detects in a compliance well vs. a
background well (e.g., USEPA, 1989b, Chapter 8, p.3-7). (However, USEPA 2009, p.8-27
recommends using the Wilcoxon rank sum test (wilcox.test
) instead of
comparing proportions.)
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, confidence level, and half-width if one of the objectives of
the sampling program is to produce confidence intervals. The functions ciBinomHalfWidth
,
ciBinomN
, and plotCiBinomDesign
can be used to investigate these
relationships for the case of binomial proportions.
Steven P. Millard ([email protected])
Agresti, A., and B.A. Coull. (1998). Approximate is Better than "Exact" for Interval Estimation of Binomial Proportions. The American Statistician, 52(2), 119–126.
Agresti, A., and B. Caffo. (2000). Simple and Effective Confidence Intervals for Proportions and Differences of Proportions Result from Adding Two Successes and Two Failures. The American Statistician, 54(4), 280–288.
Berthouex, P.M., and L.C. Brown. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, FL, Chapters 2 and 15.
Cochran, W.G. (1977). Sampling Techniques. John Wiley and Sons, New York, Chapter 3.
Fisher, R.A., and F. Yates. (1963). Statistical Tables for Biological, Agricultural, and Medical Research. 6th edition. Hafner, New York, 146pp.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. Second Edition. John Wiley and Sons, New York, Chapters 1-2.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY, Chapter 11.
Newcombe, R.G. (1998a). Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods. Statistics in Medicine, 17, 857–872.
Newcombe, R.G. (1998b). Interval Estimation for the Difference Between Independent Proportions: Comparison of Eleven Methods. Statistics in Medicine, 17, 873–890.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL, Chapter 4.
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 24.
ciBinomHalfWidth
, ciBinomN
,
ebinom
, binom.test
, prop.test
,
par
.
# Look at the relationship between half-width and sample size # for a one-sample confidence interval for a binomial proportion, # assuming an estimated proportion of 0.5 and a confidence level of # 95%. The jigsaw appearance of the plot is the result of using the # score method: dev.new() plotCiBinomDesign() #---------- # Redo the example above, but use the traditional (and inaccurate) # Wald method. dev.new() plotCiBinomDesign(ci.method = "Wald") #-------------------------------------------------------------------- # Plot sample size vs. the estimated proportion for various half-widths, # using a 95% confidence level and the adjusted Wald method: # NOTE: This example takes several seconds to run so it has been # commented out. Simply remove the pound signs (#) from in front # of the R commands to run it. #dev.new() #plotCiBinomDesign(x.var = "p.hat", y.var = "n", # half.width = 0.04, ylim = c(0, 600), main = "", # xlab = expression(hat(p))) # #plotCiBinomDesign(x.var = "p.hat", y.var = "n", # half.width = 0.05, add = TRUE, plot.col = 2) # #plotCiBinomDesign(x.var = "p.hat", y.var = "n", # half.width = 0.06, add = TRUE, plot.col = 3) # #legend(0.5, 150, paste("Half-Width =", c(0.04, 0.05, 0.06)), # lty = rep(1, 3), lwd = rep(2, 3), col=1:3, bty = "n") # #mtext(expression(paste("Sample Size vs. ", hat(p), # " for Confidence Interval for p")), line = 2.5, cex = 1.25) #mtext("with Confidence=95% and Various Values of Half-Width", # line = 1.5, cex = 1.25) #mtext(paste("CI Method = Score Normal Approximation", # "with Continuity Correction"), line = 0.5) #-------------------------------------------------------------------- # Modifying the example on pages 8-5 to 8-7 of USEPA (1989b), # look at the relationship between half-width and sample size # for a 95% confidence interval for the difference between the # proportion of detects at the background and compliance wells. # Use the estimated proportion of detects from the original data. # (The data are stored in EPA.89b.cadmium.df.) # Assume equal sample sizes at each well. EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background # .......................................... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance p.hat.back <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Background"])) p.hat.back #[1] 0.3333333 p.hat.comp <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Compliance"])) p.hat.comp #[1] 0.375 dev.new() plotCiBinomDesign(p.hat.or.p1.hat = p.hat.back, p2.hat = p.hat.comp, digits=3) #========== # Clean up #--------- rm(p.hat.back, p.hat.comp) graphics.off()
# Look at the relationship between half-width and sample size # for a one-sample confidence interval for a binomial proportion, # assuming an estimated proportion of 0.5 and a confidence level of # 95%. The jigsaw appearance of the plot is the result of using the # score method: dev.new() plotCiBinomDesign() #---------- # Redo the example above, but use the traditional (and inaccurate) # Wald method. dev.new() plotCiBinomDesign(ci.method = "Wald") #-------------------------------------------------------------------- # Plot sample size vs. the estimated proportion for various half-widths, # using a 95% confidence level and the adjusted Wald method: # NOTE: This example takes several seconds to run so it has been # commented out. Simply remove the pound signs (#) from in front # of the R commands to run it. #dev.new() #plotCiBinomDesign(x.var = "p.hat", y.var = "n", # half.width = 0.04, ylim = c(0, 600), main = "", # xlab = expression(hat(p))) # #plotCiBinomDesign(x.var = "p.hat", y.var = "n", # half.width = 0.05, add = TRUE, plot.col = 2) # #plotCiBinomDesign(x.var = "p.hat", y.var = "n", # half.width = 0.06, add = TRUE, plot.col = 3) # #legend(0.5, 150, paste("Half-Width =", c(0.04, 0.05, 0.06)), # lty = rep(1, 3), lwd = rep(2, 3), col=1:3, bty = "n") # #mtext(expression(paste("Sample Size vs. ", hat(p), # " for Confidence Interval for p")), line = 2.5, cex = 1.25) #mtext("with Confidence=95% and Various Values of Half-Width", # line = 1.5, cex = 1.25) #mtext(paste("CI Method = Score Normal Approximation", # "with Continuity Correction"), line = 0.5) #-------------------------------------------------------------------- # Modifying the example on pages 8-5 to 8-7 of USEPA (1989b), # look at the relationship between half-width and sample size # for a 95% confidence interval for the difference between the # proportion of detects at the background and compliance wells. # Use the estimated proportion of detects from the original data. # (The data are stored in EPA.89b.cadmium.df.) # Assume equal sample sizes at each well. EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background # .......................................... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance p.hat.back <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Background"])) p.hat.back #[1] 0.3333333 p.hat.comp <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Compliance"])) p.hat.comp #[1] 0.375 dev.new() plotCiBinomDesign(p.hat.or.p1.hat = p.hat.back, p2.hat = p.hat.comp, digits=3) #========== # Clean up #--------- rm(p.hat.back, p.hat.comp) graphics.off()
Create plots involving sample size, half-width, estimated standard deviation, and confidence level for a confidence interval for the mean of a normal distribution or the difference between two means.
plotCiNormDesign(x.var = "n", y.var = "half.width", range.x.var = NULL, n.or.n1 = 25, n2 = n.or.n1, half.width = sigma.hat/2, sigma.hat = 1, conf.level = 0.95, sample.type = ifelse(missing(n2), "one.sample", "two.sample"), round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, main = NULL, xlab = NULL, ylab = NULL, type = "l", ...)
plotCiNormDesign(x.var = "n", y.var = "half.width", range.x.var = NULL, n.or.n1 = 25, n2 = n.or.n1, half.width = sigma.hat/2, sigma.hat = 1, conf.level = 0.95, sample.type = ifelse(missing(n2), "one.sample", "two.sample"), round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, main = NULL, xlab = NULL, ylab = NULL, type = "l", ...)
x.var |
character string indicating what variable to use for the x-axis.
Possible values are |
y.var |
character string indicating what variable to use for the y-axis.
Possible values are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use for the plot.
The default value depends on the value of |
n.or.n1 |
numeric scalar indicating the sample size. The default value is |
n2 |
numeric scalar indicating the sample size for group 2.
The default value is the value of |
half.width |
positive numeric scalar indicating the half-width of the confidence interval.
The default value is |
sigma.hat |
positive numeric scalar specifying the estimated standard deviation.
The default value is |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the confidence interval.
The default value is |
sample.type |
character string indicating whether this is a one-sample or two-sample confidence interval. |
round.up |
logical scalar indicating whether to round up the computed sample sizes to the next smallest integer.
The default value is |
n.max |
for the case when |
tol |
for the case when |
maxiter |
for the case when |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see explanation of the argument |
add |
a logical scalar indicating whether to add the design plot to the existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for ciNormHalfWidth
and ciNormN
for information on how to compute a one-sample confidence interval for the mean of
a normal distribution or a two-sample confidence interval for the difference between
two means, how the half-width is computed when other quantities are fixed, and how the
sample size is computed when other quantities are fixed.
plotCiNormDesign
invisibly returns a list with components:
x.var |
x-coordinates of points that have been or would have been plotted. |
y.var |
y-coordinates of points that have been or would have been plotted. |
The normal distribution and lognormal distribution are probably the two most frequently used distributions to model environmental data. In order to make any kind of probability statement about a normally-distributed population (of chemical concentrations for example), you have to first estimate the mean and standard deviation (the population parameters) of the distribution. Once you estimate these parameters, it is often useful to characterize the uncertainty in the estimate of the mean. This is done with confidence intervals.
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, confidence level, and half-width if one of the objectives
of the sampling program is to produce confidence intervals. The functions
ciNormHalfWidth
, ciNormN
, and plotCiNormDesign
can be used to investigate these relationships for the case of normally-distributed observations.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.
Millard, S.P., and N. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.21-3.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapters 7 and 8.
ciNormHalfWidth
, ciNormN
, Normal
,
enorm
, t.test
,
Estimating Distribution Parameters.
# Look at the relationship between half-width and sample size # for a one-sample confidence interval for the mean, assuming # an estimated standard deviation of 1 and a confidence level of 95%. dev.new() plotCiNormDesign() #-------------------------------------------------------------------- # Plot sample size vs. the estimated standard deviation for # various levels of confidence, using a half-width of 0.5. dev.new() plotCiNormDesign(x.var = "sigma.hat", y.var = "n", main = "") plotCiNormDesign(x.var = "sigma.hat", y.var = "n", conf.level = 0.9, add = TRUE, plot.col = 2) plotCiNormDesign(x.var = "sigma.hat", y.var = "n", conf.level = 0.8, add = TRUE, plot.col = 3) legend(0.25, 60, c("95%", "90%", "80%"), lty = 1, lwd = 3, col = 1:3) mtext("Sample Size vs. Estimated SD for Confidence Interval for Mean", font = 2, cex = 1.25, line = 2.75) mtext("with Half-Width=0.5 and Various Confidence Levels", font = 2, cex = 1.25, line = 1.25) #-------------------------------------------------------------------- # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), # look at the relationship between half-width and sample size for a # 95% confidence interval for the mean level of Aldicarb at the # first compliance well. Use the estimated standard deviation from # the first four months of data. # (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #... mu.hat <- with(EPA.09.Ex.21.1.aldicarb.df, mean(Aldicarb.ppb[Well=="Well.1"])) mu.hat #[1] 23.1 sigma.hat <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well=="Well.1"])) sigma.hat #[1] 4.93491 dev.new() plotCiNormDesign(sigma.hat = sigma.hat, digits = 2, range.x.var = c(2, 25)) #========== # Clean up #--------- rm(mu.hat, sigma.hat) graphics.off()
# Look at the relationship between half-width and sample size # for a one-sample confidence interval for the mean, assuming # an estimated standard deviation of 1 and a confidence level of 95%. dev.new() plotCiNormDesign() #-------------------------------------------------------------------- # Plot sample size vs. the estimated standard deviation for # various levels of confidence, using a half-width of 0.5. dev.new() plotCiNormDesign(x.var = "sigma.hat", y.var = "n", main = "") plotCiNormDesign(x.var = "sigma.hat", y.var = "n", conf.level = 0.9, add = TRUE, plot.col = 2) plotCiNormDesign(x.var = "sigma.hat", y.var = "n", conf.level = 0.8, add = TRUE, plot.col = 3) legend(0.25, 60, c("95%", "90%", "80%"), lty = 1, lwd = 3, col = 1:3) mtext("Sample Size vs. Estimated SD for Confidence Interval for Mean", font = 2, cex = 1.25, line = 2.75) mtext("with Half-Width=0.5 and Various Confidence Levels", font = 2, cex = 1.25, line = 1.25) #-------------------------------------------------------------------- # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), # look at the relationship between half-width and sample size for a # 95% confidence interval for the mean level of Aldicarb at the # first compliance well. Use the estimated standard deviation from # the first four months of data. # (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #... mu.hat <- with(EPA.09.Ex.21.1.aldicarb.df, mean(Aldicarb.ppb[Well=="Well.1"])) mu.hat #[1] 23.1 sigma.hat <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well=="Well.1"])) sigma.hat #[1] 4.93491 dev.new() plotCiNormDesign(sigma.hat = sigma.hat, digits = 2, range.x.var = c(2, 25)) #========== # Clean up #--------- rm(mu.hat, sigma.hat) graphics.off()
Create plots involving sample size, quantile, and confidence level for a nonparametric confidence interval for a quantile.
plotCiNparDesign(x.var = "n", y.var = "conf.level", range.x.var = NULL, n = 25, p = 0.5, conf.level = 0.95, ci.type = "two.sided", lcl.rank = ifelse(ci.type == "upper", 0, 1), n.plus.one.minus.ucl.rank = ifelse(ci.type == "lower", 0, 1), plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotCiNparDesign(x.var = "n", y.var = "conf.level", range.x.var = NULL, n = 25, p = 0.5, conf.level = 0.95, ci.type = "two.sided", lcl.rank = ifelse(ci.type == "upper", 0, 1), n.plus.one.minus.ucl.rank = ifelse(ci.type == "lower", 0, 1), plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis.
Possible values are |
y.var |
character string indicating what variable to use for the y-axis.
Possible values are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use
for the plot. The default value depends on the value of |
n |
numeric scalar indicating the sample size. The default value is
|
p |
numeric scalar specifying the quantile. The value of this argument must be
between 0 and 1. The default value is |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the confidence interval.
The default value is |
ci.type |
character string indicating what kind of confidence interval to compute. The
possible values are |
lcl.rank , n.plus.one.minus.ucl.rank
|
numeric vectors of non-negative integers indicating the ranks of the
order statistics that are used for the lower and upper bounds of the
confidence interval for the specified quantile(s). When |
plot.it |
a logical scalar indicating whether to create a plot or add to the
existing plot (see |
add |
a logical scalar indicating whether to add the design plot to the
existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted
line or points. The default value is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for eqnpar
, ciNparConfLevel
,
and ciNparN
for information on how to compute a
nonparametric confidence interval for a quantile, how the confidence level
is computed when other quantities are fixed, and how the sample size is
computed when other quantities are fixed.
plotCiNparDesign
invisibly returns a list with components
x.var
and y.var
, giving coordinates of the points that
have been or would have been plotted.
See the help file for eqnpar
.
Steven P. Millard ([email protected])
See the help file for eqnpar
.
eqnpar
, ciNparConfLevel
,
ciNparN
.
# Look at the relationship between confidence level and sample size for # a two-sided nonparametric confidence interval for the 90'th percentile. dev.new() plotCiNparDesign(p = 0.9) #---------- # Plot sample size vs. quantile for various levels of confidence: dev.new() plotCiNparDesign(x.var = "p", y.var = "n", range.x.var = c(0.8, 0.95), ylim = c(0, 60), main = "") plotCiNparDesign(x.var = "p", y.var = "n", conf.level = 0.9, add = TRUE, plot.col = 2, plot.lty = 2) plotCiNparDesign(x.var = "p", y.var = "n", conf.level = 0.8, add = TRUE, plot.col = 3, plot.lty = 3) legend("topleft", c("95%", "90%", "80%"), lty = 1:3, col = 1:3, lwd = 3 * par('cex'), bty = 'n') title(main = paste("Sample Size vs. Quantile for ", "Nonparametric CI for \nQuantile, with ", "Various Confidence Levels", sep="")) #========== # Clean up #--------- graphics.off()
# Look at the relationship between confidence level and sample size for # a two-sided nonparametric confidence interval for the 90'th percentile. dev.new() plotCiNparDesign(p = 0.9) #---------- # Plot sample size vs. quantile for various levels of confidence: dev.new() plotCiNparDesign(x.var = "p", y.var = "n", range.x.var = c(0.8, 0.95), ylim = c(0, 60), main = "") plotCiNparDesign(x.var = "p", y.var = "n", conf.level = 0.9, add = TRUE, plot.col = 2, plot.lty = 2) plotCiNparDesign(x.var = "p", y.var = "n", conf.level = 0.8, add = TRUE, plot.col = 3, plot.lty = 3) legend("topleft", c("95%", "90%", "80%"), lty = 1:3, col = 1:3, lwd = 3 * par('cex'), bty = 'n') title(main = paste("Sample Size vs. Quantile for ", "Nonparametric CI for \nQuantile, with ", "Various Confidence Levels", sep="")) #========== # Clean up #--------- graphics.off()
Create plots involving sample size, power, scaled difference, and significance level for a t-test for linear trend.
plotLinearTrendTestDesign(x.var = "n", y.var = "power", range.x.var = NULL, n = 12, slope.over.sigma = switch(alternative, greater = 0.1, less = -0.1, two.sided = ifelse(two.sided.direction == "greater", 0.1, -0.1)), alpha = 0.05, power = 0.95, alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = ifelse(x.var == "n", diff(range.x.var) + 1, 50), plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotLinearTrendTestDesign(x.var = "n", y.var = "power", range.x.var = NULL, n = 12, slope.over.sigma = switch(alternative, greater = 0.1, less = -0.1, two.sided = ifelse(two.sided.direction == "greater", 0.1, -0.1)), alpha = 0.05, power = 0.95, alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = ifelse(x.var == "n", diff(range.x.var) + 1, 50), plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis.
Possible values are |
y.var |
character string indicating what variable to use for the y-axis.
Possible values are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use
for the plot. The default value depends on the value of |
n |
numeric scalar indicating the sample size. The default value is
|
slope.over.sigma |
numeric scalar specifying the ratio of the true slope ( |
alpha |
numeric scalar between 0 and 1 indicating the Type I error level associated
with the hypothesis test. The default value is |
power |
numeric scalar between 0 and 1 indicating the power associated with the
hypothesis test. The default value is |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
two.sided.direction |
character string indicating the direction (positive or negative) for the
scaled minimal detectable slope when |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
round.up |
logical scalar indicating whether to round up the values of the computed
sample size(s) to the next smallest integer. The default value is
|
n.max |
for the case when |
tol |
numeric scalar indicating the toloerance to use in the
|
maxiter |
positive integer indicating the maximum number of iterations
argument to pass to the |
plot.it |
a logical scalar indicating whether to create a new plot or add to the existing plot
(see |
add |
a logical scalar indicating whether to add the design plot to the
existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted
line or points. The default value is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for linearTrendTestPower
,
linearTrendTestN
, and linearTrendTestScaledMds
for
information on how to compute the power, sample size, or scaled minimal detectable
slope for a t-test for linear trend.
plotlinearTrendTestDesign
invisibly returns a list with components
x.var
and y.var
, giving coordinates of the points that have
been or would have been plotted.
See the help files for linearTrendTestPower
.
Steven P. Millard ([email protected])
See the help file for linearTrendTestPower
.
linearTrendTestPower
, linearTrendTestN
,
linearTrendTestScaledMds
.
# Look at the relationship between power and sample size for the t-test for # liner trend, assuming a scaled slope of 0.1 and a 5% significance level: dev.new() plotLinearTrendTestDesign() #========== # Plot sample size vs. the scaled minimal detectable slope for various # levels of power, using a 5% significance level: dev.new() plotLinearTrendTestDesign(x.var = "slope.over.sigma", y.var = "n", ylim = c(0, 30), main = "") plotLinearTrendTestDesign(x.var = "slope.over.sigma", y.var = "n", power = 0.9, add = TRUE, plot.col = "red") plotLinearTrendTestDesign(x.var = "slope.over.sigma", y.var = "n", power = 0.8, add = TRUE, plot.col = "blue") legend("topright", c("95%", "90%", "80%"), lty = 1, bty = "n", lwd = 3 * par("cex"), col = c("black", "red", "blue")) title(main = paste("Sample Size vs. Scaled Slope for t-Test for Linear Trend", "with Alpha=0.05 and Various Powers", sep="\n")) #========== # Clean up #--------- graphics.off()
# Look at the relationship between power and sample size for the t-test for # liner trend, assuming a scaled slope of 0.1 and a 5% significance level: dev.new() plotLinearTrendTestDesign() #========== # Plot sample size vs. the scaled minimal detectable slope for various # levels of power, using a 5% significance level: dev.new() plotLinearTrendTestDesign(x.var = "slope.over.sigma", y.var = "n", ylim = c(0, 30), main = "") plotLinearTrendTestDesign(x.var = "slope.over.sigma", y.var = "n", power = 0.9, add = TRUE, plot.col = "red") plotLinearTrendTestDesign(x.var = "slope.over.sigma", y.var = "n", power = 0.8, add = TRUE, plot.col = "blue") legend("topright", c("95%", "90%", "80%"), lty = 1, bty = "n", lwd = 3 * par("cex"), col = c("black", "red", "blue")) title(main = paste("Sample Size vs. Scaled Slope for t-Test for Linear Trend", "with Alpha=0.05 and Various Powers", sep="\n")) #========== # Clean up #--------- graphics.off()
Plot power vs. (ratio of means) for a
sampling design for a test based on a simultaneous prediction interval for a
lognormal distribution.
plotPredIntLnormAltSimultaneousTestPowerCurve(n = 8, df = n - 1, n.geomean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", cv = 1, range.ratio.of.means = c(1, 5), pi.type = "upper", conf.level = 0.95, r.shifted = r, K.tol = .Machine$double.eps^(1/2), integrate.args.list = NULL, plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotPredIntLnormAltSimultaneousTestPowerCurve(n = 8, df = n - 1, n.geomean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", cv = 1, range.ratio.of.means = c(1, 5), pi.type = "upper", conf.level = 0.95, r.shifted = r, K.tol = .Machine$double.eps^(1/2), integrate.args.list = NULL, plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
n |
positive integer greater than 2 indicating the sample size upon which
the prediction interval is based. The default is value is |
df |
positive integer indicating the degrees of freedom associated with
the sample size. The default value is |
n.geomean |
positive integer specifying the sample size associated with the future geometric
mean(s). The default value is |
k |
for the |
m |
positive integer specifying the maximum number of future observations (or
averages) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
cv |
positive value specifying the coefficient of variation for
both the population that was sampled to construct the prediction interval and
the population that will be sampled to produce the future observations. The
default value is |
range.ratio.of.means |
numeric vector of length 2 indicating the range of the x-variable to use for the
plot. The default value is |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
r.shifted |
positive integer between |
K.tol |
numeric scalar indicating the tolerance to use in the nonlinear search algorithm to
compute |
integrate.args.list |
a list of arguments to supply to the |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see explanation of the argument |
add |
a logical scalar indicating whether to add the design plot to the existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help file for predIntLnormAltSimultaneousTestPower
for
information on how to compute the power of a hypothesis test for the difference
between two means of lognormal distributions based on a simultaneous prediction
interval for a lognormal distribution.
plotPredIntLnormAltSimultaneousTestPowerCurve
invisibly returns a list with
components:
x.var |
x-coordinates of points that have been or would have been plotted. |
y.var |
y-coordinates of points that have been or would have been plotted. |
See the help file for predIntNormSimultaneous
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether two distributions differ from each other. The functions
predIntLnormAltSimultaneousTestPower
and plotPredIntLnormAltSimultaneousTestPowerCurve
can be
used to investigate these relationships for the case of normally-distributed
observations.
Steven P. Millard ([email protected])
See the help file for predIntNormSimultaneous
.
predIntLnormAltSimultaneousTestPower
,
predIntLnormAltSimultaneous
, predIntLnormAlt
,
predIntLnormAltTestPower
, Prediction Intervals,
LognormalAlt.
# USEPA (2009) contains an example on page 19-23 that involves monitoring # nw=100 compliance wells at a large facility with minimal natural spatial # variation every 6 months for nc=20 separate chemicals. # There are n=25 background measurements for each chemical to use to create # simultaneous prediction intervals. We would like to determine which kind of # resampling plan based on normal distribution simultaneous prediction intervals to # use (1-of-m, 1-of-m based on means, or Modified California) in order to have # adequate power of detecting an increase in chemical concentration at any of the # 100 wells while at the same time maintaining a site-wide false positive rate # (SWFPR) of 10% per year over all 4,000 comparisons # (100 wells x 20 chemicals x semi-annual sampling). # The function predIntNormSimultaneousTestPower includes the argument "r" # that is the number of future sampling occasions (r=2 in this case because # we are performing semi-annual sampling), so to compute the individual test # Type I error level alpha.test (and thus the individual test confidence level), # we only need to worry about the number of wells (100) and the number of # constituents (20): alpha.test = 1-(1-alpha)^(1/(nw x nc)). The individual # confidence level is simply 1-alpha.test. Plugging in 0.1 for alpha, # 100 for nw, and 20 for nc yields an individual test confidence level of # 1-alpha.test = 0.9999473. nc <- 20 nw <- 100 conf.level <- (1 - 0.1)^(1 / (nc * nw)) conf.level #[1] 0.9999473 # The help file for predIntNormSimultaneousTestPower shows how to # create the results below for various sampling plans: # Rule k m N.Mean K Power Total.Samples #1 k.of.m 1 2 1 3.16 0.39 2 #2 k.of.m 1 3 1 2.33 0.65 3 #3 k.of.m 1 4 1 1.83 0.81 4 #4 Modified.CA 1 4 1 2.57 0.71 4 #5 k.of.m 1 1 2 3.62 0.41 2 #6 k.of.m 1 2 2 2.33 0.85 4 #7 k.of.m 1 1 3 2.99 0.71 3 # The above table shows the K-multipliers for each prediction interval, along with # the power of detecting a change in concentration of three standard deviations at # any of the 100 wells during the course of a year, for each of the sampling # strategies considered. The last three rows of the table correspond to sampling # strategies that involve using the mean of two or three observations. # Here we will create a variation of this example based on # using a lognormal distribution and plotting power versus ratio of the # means assuming cv=1. # Here is the power curve for the 1-of-4 sampling strategy: dev.new() plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 4, r = 2, rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", conf.level = conf.level, ylim = c(0, 1), main = "") title(main = paste("Power Curves for 1-of-4 Sampling Strategy Based on 25 Background", "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) mtext("Assuming Lognormal Data with CV=1", line = 0) #---------- # Here are the power curves for the first four sampling strategies. # Because this takes several seconds to run, here we have commented out # the R commands. To run this example, just remove the pound signs (#) # from in front of the R commands. #dev.new() #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 4, r = 2, # rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, ylim = c(0, 1), main = "") #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 3, r = 2, # rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, add = TRUE, plot.col = "red", plot.lty = 2) #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 2, r = 2, # rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, add = TRUE, plot.col = "blue", plot.lty = 3) #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, r = 2, rule="Modified.CA", # range.ratio.of.means = c(1, 10), pi.type = "upper", conf.level = conf.level, # add = TRUE, plot.col = "green3", plot.lty = 4) #legend("topleft", c("1-of-4", "Modified CA", "1-of-3", "1-of-2"), # col = c("black", "green3", "red", "blue"), lty = c(1, 4, 2, 3), # lwd = 3 * par("cex"), bty = "n") #title(main = paste("Power Curves for 4 Sampling Strategies Based on 25 Background", # "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #mtext("Assuming Lognormal Data with CV=1", line = 0) #---------- # Here are the power curves for the last 3 sampling strategies: # Because this takes several seconds to run, here we have commented out # the R commands. To run this example, just remove the pound signs (#) # from in front of the R commands. #dev.new() #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 2, n.geomean = 2, # r = 2, rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, ylim = c(0, 1), main = "") #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 1, n.geomean = 2, # r = 2, rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, add = TRUE, plot.col = "red", plot.lty = 2) #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 1, n.geomean = 3, # r = 2, rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, add = TRUE, plot.col = "blue", plot.lty = 3) #legend("topleft", c("1-of-2, Order 2", "1-of-1, Order 3", "1-of-1, Order 2"), # col = c("black", "blue", "red"), lty = c(1, 3, 2), lwd = 3 * par("cex"), # bty="n") #title(main = paste("Power Curves for 3 Sampling Strategies Based on 25 Background", # "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #mtext("Assuming Lognormal Data with CV=1", line = 0) #========== # Clean up #--------- rm(nc, nw, conf.level) graphics.off()
# USEPA (2009) contains an example on page 19-23 that involves monitoring # nw=100 compliance wells at a large facility with minimal natural spatial # variation every 6 months for nc=20 separate chemicals. # There are n=25 background measurements for each chemical to use to create # simultaneous prediction intervals. We would like to determine which kind of # resampling plan based on normal distribution simultaneous prediction intervals to # use (1-of-m, 1-of-m based on means, or Modified California) in order to have # adequate power of detecting an increase in chemical concentration at any of the # 100 wells while at the same time maintaining a site-wide false positive rate # (SWFPR) of 10% per year over all 4,000 comparisons # (100 wells x 20 chemicals x semi-annual sampling). # The function predIntNormSimultaneousTestPower includes the argument "r" # that is the number of future sampling occasions (r=2 in this case because # we are performing semi-annual sampling), so to compute the individual test # Type I error level alpha.test (and thus the individual test confidence level), # we only need to worry about the number of wells (100) and the number of # constituents (20): alpha.test = 1-(1-alpha)^(1/(nw x nc)). The individual # confidence level is simply 1-alpha.test. Plugging in 0.1 for alpha, # 100 for nw, and 20 for nc yields an individual test confidence level of # 1-alpha.test = 0.9999473. nc <- 20 nw <- 100 conf.level <- (1 - 0.1)^(1 / (nc * nw)) conf.level #[1] 0.9999473 # The help file for predIntNormSimultaneousTestPower shows how to # create the results below for various sampling plans: # Rule k m N.Mean K Power Total.Samples #1 k.of.m 1 2 1 3.16 0.39 2 #2 k.of.m 1 3 1 2.33 0.65 3 #3 k.of.m 1 4 1 1.83 0.81 4 #4 Modified.CA 1 4 1 2.57 0.71 4 #5 k.of.m 1 1 2 3.62 0.41 2 #6 k.of.m 1 2 2 2.33 0.85 4 #7 k.of.m 1 1 3 2.99 0.71 3 # The above table shows the K-multipliers for each prediction interval, along with # the power of detecting a change in concentration of three standard deviations at # any of the 100 wells during the course of a year, for each of the sampling # strategies considered. The last three rows of the table correspond to sampling # strategies that involve using the mean of two or three observations. # Here we will create a variation of this example based on # using a lognormal distribution and plotting power versus ratio of the # means assuming cv=1. # Here is the power curve for the 1-of-4 sampling strategy: dev.new() plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 4, r = 2, rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", conf.level = conf.level, ylim = c(0, 1), main = "") title(main = paste("Power Curves for 1-of-4 Sampling Strategy Based on 25 Background", "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) mtext("Assuming Lognormal Data with CV=1", line = 0) #---------- # Here are the power curves for the first four sampling strategies. # Because this takes several seconds to run, here we have commented out # the R commands. To run this example, just remove the pound signs (#) # from in front of the R commands. #dev.new() #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 4, r = 2, # rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, ylim = c(0, 1), main = "") #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 3, r = 2, # rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, add = TRUE, plot.col = "red", plot.lty = 2) #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 2, r = 2, # rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, add = TRUE, plot.col = "blue", plot.lty = 3) #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, r = 2, rule="Modified.CA", # range.ratio.of.means = c(1, 10), pi.type = "upper", conf.level = conf.level, # add = TRUE, plot.col = "green3", plot.lty = 4) #legend("topleft", c("1-of-4", "Modified CA", "1-of-3", "1-of-2"), # col = c("black", "green3", "red", "blue"), lty = c(1, 4, 2, 3), # lwd = 3 * par("cex"), bty = "n") #title(main = paste("Power Curves for 4 Sampling Strategies Based on 25 Background", # "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #mtext("Assuming Lognormal Data with CV=1", line = 0) #---------- # Here are the power curves for the last 3 sampling strategies: # Because this takes several seconds to run, here we have commented out # the R commands. To run this example, just remove the pound signs (#) # from in front of the R commands. #dev.new() #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 2, n.geomean = 2, # r = 2, rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, ylim = c(0, 1), main = "") #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 1, n.geomean = 2, # r = 2, rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, add = TRUE, plot.col = "red", plot.lty = 2) #plotPredIntLnormAltSimultaneousTestPowerCurve(n = 25, k = 1, m = 1, n.geomean = 3, # r = 2, rule="k.of.m", range.ratio.of.means = c(1, 10), pi.type = "upper", # conf.level = conf.level, add = TRUE, plot.col = "blue", plot.lty = 3) #legend("topleft", c("1-of-2, Order 2", "1-of-1, Order 3", "1-of-1, Order 2"), # col = c("black", "blue", "red"), lty = c(1, 3, 2), lwd = 3 * par("cex"), # bty="n") #title(main = paste("Power Curves for 3 Sampling Strategies Based on 25 Background", # "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #mtext("Assuming Lognormal Data with CV=1", line = 0) #========== # Clean up #--------- rm(nc, nw, conf.level) graphics.off()
Plot power vs. (ratio of means) for a
sampling design for a test based on a prediction interval for a lognormal distribution.
plotPredIntLnormAltTestPowerCurve(n = 8, df = n - 1, n.geomean = 1, k = 1, cv = 1, range.ratio.of.means = c(1, 5), pi.type = "upper", conf.level = 0.95, plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotPredIntLnormAltTestPowerCurve(n = 8, df = n - 1, n.geomean = 1, k = 1, cv = 1, range.ratio.of.means = c(1, 5), pi.type = "upper", conf.level = 0.95, plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
n |
positive integer greater than 2 indicating the sample size upon which
the prediction interval is based. The default is value is |
df |
positive integer indicating the degrees of freedom associated with
the sample size. The default value is |
n.geomean |
positive integer specifying the sample size associated with the future geometric
mean(s). The default value is |
k |
positive integer specifying the number of future observations that the
prediction interval should contain with confidence level |
cv |
positive value specifying the coefficient of variation for both the population
that was sampled to construct the prediction interval and the population that
will be sampled to produce the future observations. The default value is
|
range.ratio.of.means |
numeric vector of length 2 indicating the range of the x-variable to use for the
plot. The default value is |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see explanation of the argument |
add |
a logical scalar indicating whether to add the design plot to the existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help file for predIntLnormAltTestPower
for information on how to
compute the power of a hypothesis test for the ratio of two means of lognormal
distributions based on a prediction interval for a lognormal distribution.
plotPredIntLnormAltTestPowerCurve
invisibly returns a list with components:
x.var |
x-coordinates of points that have been or would have been plotted. |
y.var |
y-coordinates of points that have been or would have been plotted. |
See the help files for predIntNormTestPower
.
Steven P. Millard ([email protected])
See the help files for predIntNormTestPower
and
tTestLnormAltPower
.
predIntLnormAltTestPower
, predIntLnormAlt
, predIntNorm
, predIntNormK
, plotPredIntNormTestPowerCurve
, predIntLnormAltSimultaneous
, predIntLnormAltSimultaneousTestPower
,
Prediction Intervals, LognormalAlt.
# Plot power vs. ratio of means for k=1 future observation for # various sample sizes using a 5% significance level and assuming cv=1. dev.new() plotPredIntLnormAltTestPowerCurve(n = 8, k = 1, range.ratio.of.means=c(1, 10), ylim = c(0, 1), main = "") plotPredIntLnormAltTestPowerCurve(n = 16, k = 1, range.ratio.of.means = c(1, 10), add = TRUE, plot.col = "red") plotPredIntLnormAltTestPowerCurve(n = 32, k = 1, range.ratio.of.means=c(1, 10), add = TRUE, plot.col = "blue") legend("topleft", c("n=32", "n=16", "n=8"), lty = 1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Ratio of Means for Upper Prediction Interval", "with k=1, Confidence=95%, and Various Sample Sizes", sep="\n")) mtext("Assuming a Lognormal Distribution with CV = 1", line = 0) #========== ## Not run: # Pages 6-16 to 6-17 of USEPA (2009) present EPA Reference Power Curves (ERPC) # for groundwater monitoring: # # "Since effect sizes discussed in the next section often cannot or have not been # quantified, the Unified Guidance recommends using the ERPC as a suitable basis # of comparison for proposed testing procedures. Each reference power curve # corresponds to one of three typical yearly statistical evaluation schedules - # quarterly, semi-annual, or annual - and represents the cumulative power # achievable during a single year at one well-constituent pair by a 99 # (normal) prediction limit based on n = 10 background measurements and one new # measurement from the compliance well. # # Here we will create a variation of Figure 6-3 on page 6-17 based on # using a lognormal distribution and plotting power versus ratio of the # means assuming cv=1. dev.new() plotPredIntLnormAltTestPowerCurve(n = 10, k = 1, cv = 1, conf.level = 0.99, range.ratio.of.means = c(1, 10), ylim = c(0, 1), main="") plotPredIntLnormAltTestPowerCurve(n = 10, k = 2, cv = 1, conf.level = 0.99, range.ratio.of.means = c(1, 10), add = TRUE, plot.col = "red", plot.lty = 2) plotPredIntLnormAltTestPowerCurve(n = 10, k = 4, cv = 1, conf.level = 0.99, range.ratio.of.means = c(1, 10), add = TRUE, plot.col = "blue", plot.lty = 3) legend("topleft", c("Quarterly", "Semi-Annual", "Annual"), lty = 3:1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Ratio of Means for Upper Prediction Interval with", "n=10, Confidence=99%, and Various Sampling Frequencies", sep="\n")) mtext("Assuming a Lognormal Distribution with CV = 1", line = 0) ## End(Not run) #========== # Clean up #--------- graphics.off()
# Plot power vs. ratio of means for k=1 future observation for # various sample sizes using a 5% significance level and assuming cv=1. dev.new() plotPredIntLnormAltTestPowerCurve(n = 8, k = 1, range.ratio.of.means=c(1, 10), ylim = c(0, 1), main = "") plotPredIntLnormAltTestPowerCurve(n = 16, k = 1, range.ratio.of.means = c(1, 10), add = TRUE, plot.col = "red") plotPredIntLnormAltTestPowerCurve(n = 32, k = 1, range.ratio.of.means=c(1, 10), add = TRUE, plot.col = "blue") legend("topleft", c("n=32", "n=16", "n=8"), lty = 1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Ratio of Means for Upper Prediction Interval", "with k=1, Confidence=95%, and Various Sample Sizes", sep="\n")) mtext("Assuming a Lognormal Distribution with CV = 1", line = 0) #========== ## Not run: # Pages 6-16 to 6-17 of USEPA (2009) present EPA Reference Power Curves (ERPC) # for groundwater monitoring: # # "Since effect sizes discussed in the next section often cannot or have not been # quantified, the Unified Guidance recommends using the ERPC as a suitable basis # of comparison for proposed testing procedures. Each reference power curve # corresponds to one of three typical yearly statistical evaluation schedules - # quarterly, semi-annual, or annual - and represents the cumulative power # achievable during a single year at one well-constituent pair by a 99 # (normal) prediction limit based on n = 10 background measurements and one new # measurement from the compliance well. # # Here we will create a variation of Figure 6-3 on page 6-17 based on # using a lognormal distribution and plotting power versus ratio of the # means assuming cv=1. dev.new() plotPredIntLnormAltTestPowerCurve(n = 10, k = 1, cv = 1, conf.level = 0.99, range.ratio.of.means = c(1, 10), ylim = c(0, 1), main="") plotPredIntLnormAltTestPowerCurve(n = 10, k = 2, cv = 1, conf.level = 0.99, range.ratio.of.means = c(1, 10), add = TRUE, plot.col = "red", plot.lty = 2) plotPredIntLnormAltTestPowerCurve(n = 10, k = 4, cv = 1, conf.level = 0.99, range.ratio.of.means = c(1, 10), add = TRUE, plot.col = "blue", plot.lty = 3) legend("topleft", c("Quarterly", "Semi-Annual", "Annual"), lty = 3:1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Ratio of Means for Upper Prediction Interval with", "n=10, Confidence=99%, and Various Sampling Frequencies", sep="\n")) mtext("Assuming a Lognormal Distribution with CV = 1", line = 0) ## End(Not run) #========== # Clean up #--------- graphics.off()
Observations from a Normal Distribution
Create plots involving sample size, number of future observations, half-width,
estimated standard deviation, and confidence level for a prediction interval for
the next observations from a normal distribution.
plotPredIntNormDesign(x.var = "n", y.var = "half.width", range.x.var = NULL, n = 25, k = 1, n.mean = 1, half.width = 4 * sigma.hat, sigma.hat = 1, method = "Bonferroni", conf.level = 0.95, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotPredIntNormDesign(x.var = "n", y.var = "half.width", range.x.var = NULL, n = 25, k = 1, n.mean = 1, half.width = 4 * sigma.hat, sigma.hat = 1, method = "Bonferroni", conf.level = 0.95, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis.
Possible values are |
y.var |
character string indicating what variable to use for the y-axis.
Possible values are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use for the plot.
The default value depends on the value of |
n |
positive integer greater than 1 indicating the sample size upon
which the prediction interval is based. The default value is |
k |
positive integer specifying the number of future observations
or averages the prediction interval should contain with confidence level
|
n.mean |
positive integer specifying the sample size associated with the |
half.width |
positive scalar indicating the half-widths of the prediction interval.
The default value is |
sigma.hat |
numeric scalar specifying the value of the estimated standard deviation.
The default value is |
method |
character string specifying the method to use if the number of future observations
( |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
round.up |
for the case when |
n.max |
for the case when |
tol |
numeric scalar indicating the tolerance to use in the |
maxiter |
positive integer indicating the maximum number of iterations to use in the
|
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see explanation of the argument |
add |
a logical scalar indicating whether to add the design plot to the existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for predIntNorm
, predIntNormK
,
predIntNormHalfWidth
, and predIntNormN
for
information on how to compute a prediction interval for the next
observations or averages from a normal distribution, how the half-width is
computed when other quantities are fixed, and how the
sample size is computed when other quantities are fixed.
plotPredIntNormDesign
invisibly returns a list with components:
x.var |
x-coordinates of points that have been or would have been plotted. |
y.var |
y-coordinates of points that have been or would have been plotted. |
See the help file for predIntNorm
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, confidence level, and half-width
if one of the objectives of the sampling program is to produce prediction intervals.
The functions predIntNormHalfWidth
, predIntNormN
, and
plotPredIntNormDesign
can be used to investigate these relationships for the
case of normally-distributed observations.
Steven P. Millard ([email protected])
See the help file for predIntNorm
.
predIntNorm
, predIntNormK
,
predIntNormHalfWidth
, predIntNormN
,
Normal
.
# Look at the relationship between half-width and sample size for a # prediction interval for k=1 future observation, assuming an estimated # standard deviation of 1 and a confidence level of 95%: dev.new() plotPredIntNormDesign() #========== # Plot sample size vs. the estimated standard deviation for various levels # of confidence, using a half-width of 4: dev.new() plotPredIntNormDesign(x.var = "sigma.hat", y.var = "n", range.x.var = c(1, 2), ylim = c(0, 90), main = "") plotPredIntNormDesign(x.var = "sigma.hat", y.var = "n", range.x.var = c(1, 2), conf.level = 0.9, add = TRUE, plot.col = "red") plotPredIntNormDesign(x.var = "sigma.hat", y.var = "n", range.x.var = c(1, 2), conf.level = 0.8, add = TRUE, plot.col = "blue") legend("topleft", c("95%", "90%", "80%"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Sample Size vs. Sigma Hat for Prediction Interval for", "k=1 Future Obs, Half-Width=4, and Various Confidence Levels", sep = "\n")) #========== # The data frame EPA.92c.arsenic3.df contains arsenic concentrations (ppb) # collected quarterly for 3 years at a background well and quarterly for # 2 years at a compliance well. Using the data from the background well, # plot the relationship between half-width and sample size for a two-sided # 90% prediction interval for k=4 future observations. EPA.92c.arsenic3.df # Arsenic Year Well.type #1 12.6 1 Background #2 30.8 1 Background #3 52.0 1 Background #... #18 3.8 5 Compliance #19 2.6 5 Compliance #20 51.9 5 Compliance mu.hat <- with(EPA.92c.arsenic3.df, mean(Arsenic[Well.type=="Background"])) mu.hat #[1] 27.51667 sigma.hat <- with(EPA.92c.arsenic3.df, sd(Arsenic[Well.type=="Background"])) sigma.hat #[1] 17.10119 dev.new() plotPredIntNormDesign(x.var = "n", y.var = "half.width", range.x.var = c(4, 50), k = 4, sigma.hat = sigma.hat, conf.level = 0.9) #========== # Clean up #--------- rm(mu.hat, sigma.hat) graphics.off()
# Look at the relationship between half-width and sample size for a # prediction interval for k=1 future observation, assuming an estimated # standard deviation of 1 and a confidence level of 95%: dev.new() plotPredIntNormDesign() #========== # Plot sample size vs. the estimated standard deviation for various levels # of confidence, using a half-width of 4: dev.new() plotPredIntNormDesign(x.var = "sigma.hat", y.var = "n", range.x.var = c(1, 2), ylim = c(0, 90), main = "") plotPredIntNormDesign(x.var = "sigma.hat", y.var = "n", range.x.var = c(1, 2), conf.level = 0.9, add = TRUE, plot.col = "red") plotPredIntNormDesign(x.var = "sigma.hat", y.var = "n", range.x.var = c(1, 2), conf.level = 0.8, add = TRUE, plot.col = "blue") legend("topleft", c("95%", "90%", "80%"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Sample Size vs. Sigma Hat for Prediction Interval for", "k=1 Future Obs, Half-Width=4, and Various Confidence Levels", sep = "\n")) #========== # The data frame EPA.92c.arsenic3.df contains arsenic concentrations (ppb) # collected quarterly for 3 years at a background well and quarterly for # 2 years at a compliance well. Using the data from the background well, # plot the relationship between half-width and sample size for a two-sided # 90% prediction interval for k=4 future observations. EPA.92c.arsenic3.df # Arsenic Year Well.type #1 12.6 1 Background #2 30.8 1 Background #3 52.0 1 Background #... #18 3.8 5 Compliance #19 2.6 5 Compliance #20 51.9 5 Compliance mu.hat <- with(EPA.92c.arsenic3.df, mean(Arsenic[Well.type=="Background"])) mu.hat #[1] 27.51667 sigma.hat <- with(EPA.92c.arsenic3.df, sd(Arsenic[Well.type=="Background"])) sigma.hat #[1] 17.10119 dev.new() plotPredIntNormDesign(x.var = "n", y.var = "half.width", range.x.var = c(4, 50), k = 4, sigma.hat = sigma.hat, conf.level = 0.9) #========== # Clean up #--------- rm(mu.hat, sigma.hat) graphics.off()
Plot power vs. (scaled minimal detectable difference) for a
sampling design for a test based on a simultaneous prediction interval for a
normal distribution.
plotPredIntNormSimultaneousTestPowerCurve(n = 8, df = n - 1, n.mean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", range.delta.over.sigma = c(0, 5), pi.type = "upper", conf.level = 0.95, r.shifted = r, K.tol = .Machine$double.eps^(1/2), integrate.args.list = NULL, plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotPredIntNormSimultaneousTestPowerCurve(n = 8, df = n - 1, n.mean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", range.delta.over.sigma = c(0, 5), pi.type = "upper", conf.level = 0.95, r.shifted = r, K.tol = .Machine$double.eps^(1/2), integrate.args.list = NULL, plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
n |
positive integer greater than 2 indicating the sample size upon which
the prediction interval is based. The default is value is |
df |
positive integer indicating the degrees of freedom associated with
the sample size. The default value is |
n.mean |
positive integer specifying the sample size associated with the future average(s).
The default value is |
k |
for the |
m |
positive integer specifying the maximum number of future observations (or
averages) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
range.delta.over.sigma |
numeric vector of length 2 indicating the range of the x-variable to use for the
plot. The default value is |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
r.shifted |
positive integer between |
K.tol |
numeric scalar indicating the tolerance to use in the nonlinear search algorithm to
compute |
integrate.args.list |
a list of arguments to supply to the |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see explanation of the argument |
add |
a logical scalar indicating whether to add the design plot to the existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help file for predIntNormSimultaneousTestPower
for
information on how to compute the power of a hypothesis test for the difference
between two means of normal distributions based on a simultaneous prediction
interval for a normal distribution.
plotPredIntNormSimultaneousTestPowerCurve
invisibly returns a list with
components:
x.var |
x-coordinates of points that have been or would have been plotted. |
y.var |
y-coordinates of points that have been or would have been plotted. |
See the help file for predIntNormSimultaneous
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether two distributions differ from each other. The functions
predIntNormSimultaneousTestPower
and plotPredIntNormSimultaneousTestPowerCurve
can be
used to investigate these relationships for the case of normally-distributed
observations.
Steven P. Millard ([email protected])
See the help file for predIntNormSimultaneous
.
predIntNormSimultaneous
, predIntNormSimultaneousK
,
predIntNormSimultaneousTestPower
,
predIntNorm
, predIntNormK
,
predIntNormTestPower
, Prediction Intervals,
Normal.
# USEPA (2009) contains an example on page 19-23 that involves monitoring # nw=100 compliance wells at a large facility with minimal natural spatial # variation every 6 months for nc=20 separate chemicals. # There are n=25 background measurements for each chemical to use to create # simultaneous prediction intervals. We would like to determine which kind of # resampling plan based on normal distribution simultaneous prediction intervals to # use (1-of-m, 1-of-m based on means, or Modified California) in order to have # adequate power of detecting an increase in chemical concentration at any of the # 100 wells while at the same time maintaining a site-wide false positive rate # (SWFPR) of 10% per year over all 4,000 comparisons # (100 wells x 20 chemicals x semi-annual sampling). # The function predIntNormSimultaneousTestPower includes the argument "r" # that is the number of future sampling occasions (r=2 in this case because # we are performing semi-annual sampling), so to compute the individual test # Type I error level alpha.test (and thus the individual test confidence level), # we only need to worry about the number of wells (100) and the number of # constituents (20): alpha.test = 1-(1-alpha)^(1/(nw x nc)). The individual # confidence level is simply 1-alpha.test. Plugging in 0.1 for alpha, # 100 for nw, and 20 for nc yields an individual test confidence level of # 1-alpha.test = 0.9999473. nc <- 20 nw <- 100 conf.level <- (1 - 0.1)^(1 / (nc * nw)) conf.level #[1] 0.9999473 # The help file for predIntNormSimultaneousTestPower shows how to # create the results below for various sampling plans: # Rule k m N.Mean K Power Total.Samples #1 k.of.m 1 2 1 3.16 0.39 2 #2 k.of.m 1 3 1 2.33 0.65 3 #3 k.of.m 1 4 1 1.83 0.81 4 #4 Modified.CA 1 4 1 2.57 0.71 4 #5 k.of.m 1 1 2 3.62 0.41 2 #6 k.of.m 1 2 2 2.33 0.85 4 #7 k.of.m 1 1 3 2.99 0.71 3 # The above table shows the K-multipliers for each prediction interval, along with # the power of detecting a change in concentration of three standard deviations at # any of the 100 wells during the course of a year, for each of the sampling # strategies considered. The last three rows of the table correspond to sampling # strategies that involve using the mean of two or three observations. # Here is the power curve for the 1-of-4 sampling strategy: dev.new() plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 4, r = 2, rule="k.of.m", pi.type = "upper", conf.level = conf.level, xlab = "SD Units Above Background", main = "") title(main = paste( "Power Curve for 1-of-4 Sampling Strategy Based on 25 Background", "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #---------- # Here are the power curves for the first four sampling strategies. # Because this takes several seconds to run, here we have commented out # the R commands. To run this example, just remove the pound signs (#) # from in front of the R commands. #dev.new() #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 4, r = 2, # rule="k.of.m", pi.type = "upper", conf.level = conf.level, # xlab = "SD Units Above Background", main = "") #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 3, r = 2, # rule="k.of.m", pi.type = "upper", conf.level = conf.level, add = TRUE, # plot.col = "red", plot.lty = 2) #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 2, r = 2, # rule="k.of.m", pi.type = "upper", conf.level = conf.level, add = TRUE, # plot.col = "blue", plot.lty = 3) #plotPredIntNormSimultaneousTestPowerCurve(n = 25, r = 2, rule="Modified.CA", # pi.type = "upper", conf.level = conf.level, add = TRUE, plot.col = "green3", # plot.lty = 4) #legend(0, 1, c("1-of-4", "Modified CA", "1-of-3", "1-of-2"), # col = c("black", "green3", "red", "blue"), lty = c(1, 4, 2, 3), # lwd = 3 * par("cex"), bty = "n") #title(main = paste("Power Curves for 4 Sampling Strategies Based on 25 Background", # "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #---------- # Here are the power curves for the last 3 sampling strategies. # Because this takes several seconds to run, here we have commented out # the R commands. To run this example, just remove the pound signs (#) # from in front of the R commands. #dev.new() #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 2, n.mean = 2, # r = 2, rule="k.of.m", pi.type = "upper", conf.level = conf.level, # xlab = "SD Units Above Background", main = "") #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 1, n.mean = 2, # r = 2, rule="k.of.m", pi.type = "upper", conf.level = conf.level, add = TRUE, # plot.col = "red", plot.lty = 2) #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 1, n.mean = 3, # r = 2, rule="k.of.m", pi.type = "upper", conf.level = conf.level, add = TRUE, # plot.col = "blue", plot.lty = 3) #legend(0, 1, c("1-of-2, Order 2", "1-of-1, Order 3", "1-of-1, Order 2"), # col = c("black", "blue", "red"), lty = c(1, 3, 2), lwd = 3 * par("cex"), # bty="n") #title(main = paste("Power Curves for 3 Sampling Strategies Based on 25 Background", # "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #========== # Clean up #--------- rm(nc, nw, conf.level) graphics.off()
# USEPA (2009) contains an example on page 19-23 that involves monitoring # nw=100 compliance wells at a large facility with minimal natural spatial # variation every 6 months for nc=20 separate chemicals. # There are n=25 background measurements for each chemical to use to create # simultaneous prediction intervals. We would like to determine which kind of # resampling plan based on normal distribution simultaneous prediction intervals to # use (1-of-m, 1-of-m based on means, or Modified California) in order to have # adequate power of detecting an increase in chemical concentration at any of the # 100 wells while at the same time maintaining a site-wide false positive rate # (SWFPR) of 10% per year over all 4,000 comparisons # (100 wells x 20 chemicals x semi-annual sampling). # The function predIntNormSimultaneousTestPower includes the argument "r" # that is the number of future sampling occasions (r=2 in this case because # we are performing semi-annual sampling), so to compute the individual test # Type I error level alpha.test (and thus the individual test confidence level), # we only need to worry about the number of wells (100) and the number of # constituents (20): alpha.test = 1-(1-alpha)^(1/(nw x nc)). The individual # confidence level is simply 1-alpha.test. Plugging in 0.1 for alpha, # 100 for nw, and 20 for nc yields an individual test confidence level of # 1-alpha.test = 0.9999473. nc <- 20 nw <- 100 conf.level <- (1 - 0.1)^(1 / (nc * nw)) conf.level #[1] 0.9999473 # The help file for predIntNormSimultaneousTestPower shows how to # create the results below for various sampling plans: # Rule k m N.Mean K Power Total.Samples #1 k.of.m 1 2 1 3.16 0.39 2 #2 k.of.m 1 3 1 2.33 0.65 3 #3 k.of.m 1 4 1 1.83 0.81 4 #4 Modified.CA 1 4 1 2.57 0.71 4 #5 k.of.m 1 1 2 3.62 0.41 2 #6 k.of.m 1 2 2 2.33 0.85 4 #7 k.of.m 1 1 3 2.99 0.71 3 # The above table shows the K-multipliers for each prediction interval, along with # the power of detecting a change in concentration of three standard deviations at # any of the 100 wells during the course of a year, for each of the sampling # strategies considered. The last three rows of the table correspond to sampling # strategies that involve using the mean of two or three observations. # Here is the power curve for the 1-of-4 sampling strategy: dev.new() plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 4, r = 2, rule="k.of.m", pi.type = "upper", conf.level = conf.level, xlab = "SD Units Above Background", main = "") title(main = paste( "Power Curve for 1-of-4 Sampling Strategy Based on 25 Background", "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #---------- # Here are the power curves for the first four sampling strategies. # Because this takes several seconds to run, here we have commented out # the R commands. To run this example, just remove the pound signs (#) # from in front of the R commands. #dev.new() #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 4, r = 2, # rule="k.of.m", pi.type = "upper", conf.level = conf.level, # xlab = "SD Units Above Background", main = "") #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 3, r = 2, # rule="k.of.m", pi.type = "upper", conf.level = conf.level, add = TRUE, # plot.col = "red", plot.lty = 2) #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 2, r = 2, # rule="k.of.m", pi.type = "upper", conf.level = conf.level, add = TRUE, # plot.col = "blue", plot.lty = 3) #plotPredIntNormSimultaneousTestPowerCurve(n = 25, r = 2, rule="Modified.CA", # pi.type = "upper", conf.level = conf.level, add = TRUE, plot.col = "green3", # plot.lty = 4) #legend(0, 1, c("1-of-4", "Modified CA", "1-of-3", "1-of-2"), # col = c("black", "green3", "red", "blue"), lty = c(1, 4, 2, 3), # lwd = 3 * par("cex"), bty = "n") #title(main = paste("Power Curves for 4 Sampling Strategies Based on 25 Background", # "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #---------- # Here are the power curves for the last 3 sampling strategies. # Because this takes several seconds to run, here we have commented out # the R commands. To run this example, just remove the pound signs (#) # from in front of the R commands. #dev.new() #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 2, n.mean = 2, # r = 2, rule="k.of.m", pi.type = "upper", conf.level = conf.level, # xlab = "SD Units Above Background", main = "") #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 1, n.mean = 2, # r = 2, rule="k.of.m", pi.type = "upper", conf.level = conf.level, add = TRUE, # plot.col = "red", plot.lty = 2) #plotPredIntNormSimultaneousTestPowerCurve(n = 25, k = 1, m = 1, n.mean = 3, # r = 2, rule="k.of.m", pi.type = "upper", conf.level = conf.level, add = TRUE, # plot.col = "blue", plot.lty = 3) #legend(0, 1, c("1-of-2, Order 2", "1-of-1, Order 3", "1-of-1, Order 2"), # col = c("black", "blue", "red"), lty = c(1, 3, 2), lwd = 3 * par("cex"), # bty="n") #title(main = paste("Power Curves for 3 Sampling Strategies Based on 25 Background", # "Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n")) #========== # Clean up #--------- rm(nc, nw, conf.level) graphics.off()
Plot power vs. (scaled minimal detectable difference) for a
sampling design for a test based on a prediction interval for a normal distribution.
plotPredIntNormTestPowerCurve(n = 8, df = n - 1, n.mean = 1, k = 1, range.delta.over.sigma = c(0, 5), pi.type = "upper", conf.level = 0.95, plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotPredIntNormTestPowerCurve(n = 8, df = n - 1, n.mean = 1, k = 1, range.delta.over.sigma = c(0, 5), pi.type = "upper", conf.level = 0.95, plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
n |
positive integer greater than 2 indicating the sample size upon which
the prediction interval is based. The default is value is |
df |
positive integer indicating the degrees of freedom associated with
the sample size. The default value is |
n.mean |
positive integer specifying the sample size associated with the future average(s).
The default value is |
k |
positive integer specifying the number of future observations that the
prediction interval should contain with confidence level |
range.delta.over.sigma |
numeric vector of length 2 indicating the range of the x-variable to use for the
plot. The default value is |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see explanation of the argument |
add |
a logical scalar indicating whether to add the design plot to the existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help file for predIntNormTestPower
for information on how to
compute the power of a hypothesis test for the difference between two means of
normal distributions based on a prediction interval for a normal distribution.
plotPredIntNormTestPowerCurve
invisibly returns a list with components:
x.var |
x-coordinates of points that have been or would have been plotted. |
y.var |
y-coordinates of points that have been or would have been plotted. |
See the help files for predIntNorm
and
predIntNormSimultaneous
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether two distributions differ from each other. The functions
predIntNormTestPower
and plotPredIntNormTestPowerCurve
can be
used to investigate these relationships for the case of normally-distributed
observations. In the case of a simple shift between the two means, the test based
on a prediction interval is not as powerful as the two-sample t-test. However, the
test based on a prediction interval is more efficient at detecting a shift in the
tail.
Steven P. Millard ([email protected])
See the help files for predIntNorm
and
predIntNormSimultaneous
.
predIntNorm
, predIntNormK
,
predIntNormTestPower
, predIntNormSimultaneous
, predIntNormSimultaneousK
,
predIntNormSimultaneousTestPower
, Prediction Intervals,
Normal.
# Pages 6-16 to 6-17 of USEPA (2009) present EPA Reference Power Curves (ERPC) # for groundwater monitoring: # # "Since effect sizes discussed in the next section often cannot or have not been # quantified, the Unified Guidance recommends using the ERPC as a suitable basis # of comparison for proposed testing procedures. Each reference power curve # corresponds to one of three typical yearly statistical evaluation schedules - # quarterly, semi-annual, or annual - and represents the cumulative power # achievable during a single year at one well-constituent pair by a 99% upper # (normal) prediction limit based on n = 10 background measurements and one new # measurement from the compliance well. # # Here we will reproduce Figure 6-3 on page 6-17. dev.new() plotPredIntNormTestPowerCurve(n = 10, k = 1, conf.level = 0.99, ylim = c(0, 1), main="") plotPredIntNormTestPowerCurve(n = 10, k = 2, conf.level = 0.99, add = TRUE, plot.col = "red", plot.lty = 2) plotPredIntNormTestPowerCurve(n = 10, k = 4, conf.level = 0.99, add = TRUE, plot.col = "blue", plot.lty = 3) legend("topleft", c("Quarterly", "Semi-Annual", "Annual"), lty = 3:1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Delta/Sigma for Upper Prediction Interval with", "n=10, Confidence=99%, and Various Sampling Frequencies", sep="\n")) #========== ## Not run: # Plot power vs. scaled minimal detectable difference for various sample sizes # using a 5 dev.new() plotPredIntNormTestPowerCurve(n = 8, k = 1, ylim = c(0, 1), main="") plotPredIntNormTestPowerCurve(n = 16, k = 1, add = TRUE, plot.col = "red") plotPredIntNormTestPowerCurve(n = 32, k = 1, add = TRUE, plot.col = "blue") legend("bottomright", c("n=32", "n=16", "n=8"), lty = 1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Delta/Sigma for Upper Prediction Interval with", "k=1, Confidence=95%, and Various Sample Sizes", sep="\n")) #========== # Clean up #--------- graphics.off() ## End(Not run)
# Pages 6-16 to 6-17 of USEPA (2009) present EPA Reference Power Curves (ERPC) # for groundwater monitoring: # # "Since effect sizes discussed in the next section often cannot or have not been # quantified, the Unified Guidance recommends using the ERPC as a suitable basis # of comparison for proposed testing procedures. Each reference power curve # corresponds to one of three typical yearly statistical evaluation schedules - # quarterly, semi-annual, or annual - and represents the cumulative power # achievable during a single year at one well-constituent pair by a 99% upper # (normal) prediction limit based on n = 10 background measurements and one new # measurement from the compliance well. # # Here we will reproduce Figure 6-3 on page 6-17. dev.new() plotPredIntNormTestPowerCurve(n = 10, k = 1, conf.level = 0.99, ylim = c(0, 1), main="") plotPredIntNormTestPowerCurve(n = 10, k = 2, conf.level = 0.99, add = TRUE, plot.col = "red", plot.lty = 2) plotPredIntNormTestPowerCurve(n = 10, k = 4, conf.level = 0.99, add = TRUE, plot.col = "blue", plot.lty = 3) legend("topleft", c("Quarterly", "Semi-Annual", "Annual"), lty = 3:1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Delta/Sigma for Upper Prediction Interval with", "n=10, Confidence=99%, and Various Sampling Frequencies", sep="\n")) #========== ## Not run: # Plot power vs. scaled minimal detectable difference for various sample sizes # using a 5 dev.new() plotPredIntNormTestPowerCurve(n = 8, k = 1, ylim = c(0, 1), main="") plotPredIntNormTestPowerCurve(n = 16, k = 1, add = TRUE, plot.col = "red") plotPredIntNormTestPowerCurve(n = 32, k = 1, add = TRUE, plot.col = "blue") legend("bottomright", c("n=32", "n=16", "n=8"), lty = 1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Delta/Sigma for Upper Prediction Interval with", "k=1, Confidence=95%, and Various Sample Sizes", sep="\n")) #========== # Clean up #--------- graphics.off() ## End(Not run)
Create plots involving sample size (), number of future observations
(
), minimum number of future observations the interval should contain
(
), and confidence level (
) for a nonparametric prediction
interval.
plotPredIntNparDesign(x.var = "n", y.var = "conf.level", range.x.var = NULL, n = max(25, lpl.rank + n.plus.one.minus.upl.rank + 1), k = 1, m = ifelse(x.var == "k", ceiling(max.x), 1), conf.level = 0.95, pi.type = "two.sided", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), n.max = 5000, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotPredIntNparDesign(x.var = "n", y.var = "conf.level", range.x.var = NULL, n = max(25, lpl.rank + n.plus.one.minus.upl.rank + 1), k = 1, m = ifelse(x.var == "k", ceiling(max.x), 1), conf.level = 0.95, pi.type = "two.sided", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), n.max = 5000, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis.
Possible values are |
y.var |
character string indicating what variable to use for the y-axis.
Possible values are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use
for the plot. The default value depends on the value of |
n |
numeric scalar indicating the sample size. The default value is |
k |
positive integer specifying the minimum number of future observations out of |
m |
positive integer specifying the number of future observations. The default value is
|
conf.level |
numeric scalar between 0 and 1 indicating the confidence level
associated with the prediction interval. The default value is
|
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
lpl.rank |
non-negative integer indicating the rank of the order statistic to use for
the lower bound of the prediction interval. If |
n.plus.one.minus.upl.rank |
non-negative integer related to the rank of the order statistic to use for
the upper bound of the prediction interval. A value of
|
n.max |
for the case when |
maxiter |
positive integer indicating the maximum number of iterations to use in the
|
plot.it |
a logical scalar indicating whether to create a plot or add to the
existing plot (see |
add |
a logical scalar indicating whether to add the design plot to the
existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted
line or points. The default value is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help file for predIntNpar
, predIntNparConfLevel
,
and predIntNparN
for information on how to compute a
nonparametric prediction interval, how the confidence level
is computed when other quantities are fixed, and how the sample size is
computed when other quantities are fixed.
plotPredIntNparDesign
invisibly returns a list with components
x.var
and y.var
, giving coordinates of the points that
have been or would have been plotted.
See the help file for predIntNpar
.
Steven P. Millard ([email protected])
See the help file for predIntNpar
.
predIntNpar
, predIntNparConfLevel
,
predIntNparN
.
# Look at the relationship between confidence level and sample size for a # two-sided nonparametric prediction interval for the next m=1 future observation. dev.new() plotPredIntNparDesign() #========== # Plot confidence level vs. sample size for various values of number of # future observations (m): dev.new() plotPredIntNparDesign(k = 1, m = 1, ylim = c(0, 1), main = "") plotPredIntNparDesign(k = 2, m = 2, add = TRUE, plot.col = "red") plotPredIntNparDesign(k = 3, m = 3, add = TRUE, plot.col = "blue") legend("bottomright", c("m=1", "m=2", "m=3"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Confidence Level vs. Sample Size for Nonparametric PI", "with Various Values of m", sep="\n")) #========== # Example 18-3 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # 4 future observations of trichloroethylene (TCE) at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.3.TCE.df. # There are 6 monthly observations of TCE (ppb) at 3 background wells, # and 4 monthly observations of TCE at a compliance well. # # Modify this example by creating a plot to look at confidence level versus # sample size (i.e., number of observations at the background wells) for # predicting the next m = 4 future observations when constructing a one-sided # upper prediction interval based on the maximum value. dev.new() plotPredIntNparDesign(k = 4, m = 4, pi.type = "upper") #========== # Clean up #--------- graphics.off()
# Look at the relationship between confidence level and sample size for a # two-sided nonparametric prediction interval for the next m=1 future observation. dev.new() plotPredIntNparDesign() #========== # Plot confidence level vs. sample size for various values of number of # future observations (m): dev.new() plotPredIntNparDesign(k = 1, m = 1, ylim = c(0, 1), main = "") plotPredIntNparDesign(k = 2, m = 2, add = TRUE, plot.col = "red") plotPredIntNparDesign(k = 3, m = 3, add = TRUE, plot.col = "blue") legend("bottomright", c("m=1", "m=2", "m=3"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Confidence Level vs. Sample Size for Nonparametric PI", "with Various Values of m", sep="\n")) #========== # Example 18-3 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # 4 future observations of trichloroethylene (TCE) at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.3.TCE.df. # There are 6 monthly observations of TCE (ppb) at 3 background wells, # and 4 monthly observations of TCE at a compliance well. # # Modify this example by creating a plot to look at confidence level versus # sample size (i.e., number of observations at the background wells) for # predicting the next m = 4 future observations when constructing a one-sided # upper prediction interval based on the maximum value. dev.new() plotPredIntNparDesign(k = 4, m = 4, pi.type = "upper") #========== # Clean up #--------- graphics.off()
Create plots involving sample size (), number of future observations
(
), minimum number of future observations the interval should contain
(
), number of future sampling occasions (
), and confidence level
for a simultaneous nonparametric prediction interval.
plotPredIntNparSimultaneousDesign(x.var = "n", y.var = "conf.level", range.x.var = NULL, n = max(25, lpl.rank + n.plus.one.minus.upl.rank + 1), n.median = 1, k = 1, m = ifelse(x.var == "k", ceiling(max.x), 1), r = 2, rule = "k.of.m", conf.level = 0.95, pi.type = "upper", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), n.max = 5000, maxiter = 1000, integrate.args.list = NULL, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotPredIntNparSimultaneousDesign(x.var = "n", y.var = "conf.level", range.x.var = NULL, n = max(25, lpl.rank + n.plus.one.minus.upl.rank + 1), n.median = 1, k = 1, m = ifelse(x.var == "k", ceiling(max.x), 1), r = 2, rule = "k.of.m", conf.level = 0.95, pi.type = "upper", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), n.max = 5000, maxiter = 1000, integrate.args.list = NULL, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis.
Possible values are |
y.var |
character string indicating what variable to use for the y-axis.
Possible values are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use
for the plot. The default value depends on the value of |
n |
numeric scalar indicating the sample size. The default value is |
n.median |
positive odd integer specifying the sample size associated with the future medians.
The default value is |
k |
for the |
m |
positive integer specifying the maximum number of future observations (or
medians) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
conf.level |
numeric scalar between 0 and 1 indicating the confidence level
associated with the prediction interval. The default value is
|
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
lpl.rank |
non-negative integer indicating the rank of the order statistic to use for
the lower bound of the prediction interval. If |
n.plus.one.minus.upl.rank |
non-negative integer related to the rank of the order statistic to use for
the upper bound of the prediction interval. A value of
|
n.max |
numeric scalar indicating the maximum sample size to consider when |
maxiter |
positive integer indicating the maximum number of iterations to use in the
|
integrate.args.list |
list of arguments to supply to the |
plot.it |
a logical scalar indicating whether to create a plot or add to the
existing plot (see |
add |
a logical scalar indicating whether to add the design plot to the
existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted
line or points. The default value is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help file for predIntNparSimultaneous
,
predIntNparSimultaneousConfLevel
, and predIntNparSimultaneousN
for information on how to compute a
simultaneous nonparametric prediction interval, how the confidence level
is computed when other quantities are fixed, and how the sample size is
computed when other quantities are fixed.
plotPredIntNparSimultaneousDesign
invisibly returns a list with components
x.var
and y.var
, giving coordinates of the points that
have been or would have been plotted.
See the help file for predIntNparSimultaneous
.
Steven P. Millard ([email protected])
See the help file for predIntNparSimultaneous
.
predIntNparSimultaneous
,
predIntNparSimultaneousConfLevel
, predIntNparSimultaneousN
,
predIntNparSimultaneousTestPower
,
predIntNpar
, tolIntNpar
.
# For the 1-of-3 rule with r=20 future sampling occasions, look at the # relationship between confidence level and sample size for a one-sided # upper simultaneous nonparametric prediction interval. dev.new() plotPredIntNparSimultaneousDesign(k = 1, m = 3, r = 20, range.x.var = c(2, 20)) #========== # Plot confidence level vs. sample size for various values of number of # future sampling occasions (r): dev.new() plotPredIntNparSimultaneousDesign(m = 3, r = 10, rule = "CA", ylim = c(0, 1), main = "") plotPredIntNparSimultaneousDesign(m = 3, r = 20, rule = "CA", add = TRUE, plot.col = "red") plotPredIntNparSimultaneousDesign(m = 3, r = 30, rule = "CA", add = TRUE, plot.col = "blue") legend("bottomright", c("r=10", "r=20", "r=30"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Confidence Level vs. Sample Size for Simultaneous", "Nonparametric PI with Various Values of r", sep="\n")) #========== # Modifying Example 19-5 of USEPA (2009, p. 19-33), plot confidence level # versus sample size (number of background observations requried) for # a 1-of-3 plan assuming r = 10 compliance wells (future sampling occasions). dev.new() plotPredIntNparSimultaneousDesign(k = 1, m = 3, r = 10, rule = "k.of.m") #========== # Clean up #--------- graphics.off()
# For the 1-of-3 rule with r=20 future sampling occasions, look at the # relationship between confidence level and sample size for a one-sided # upper simultaneous nonparametric prediction interval. dev.new() plotPredIntNparSimultaneousDesign(k = 1, m = 3, r = 20, range.x.var = c(2, 20)) #========== # Plot confidence level vs. sample size for various values of number of # future sampling occasions (r): dev.new() plotPredIntNparSimultaneousDesign(m = 3, r = 10, rule = "CA", ylim = c(0, 1), main = "") plotPredIntNparSimultaneousDesign(m = 3, r = 20, rule = "CA", add = TRUE, plot.col = "red") plotPredIntNparSimultaneousDesign(m = 3, r = 30, rule = "CA", add = TRUE, plot.col = "blue") legend("bottomright", c("r=10", "r=20", "r=30"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Confidence Level vs. Sample Size for Simultaneous", "Nonparametric PI with Various Values of r", sep="\n")) #========== # Modifying Example 19-5 of USEPA (2009, p. 19-33), plot confidence level # versus sample size (number of background observations requried) for # a 1-of-3 plan assuming r = 10 compliance wells (future sampling occasions). dev.new() plotPredIntNparSimultaneousDesign(k = 1, m = 3, r = 10, rule = "k.of.m") #========== # Clean up #--------- graphics.off()
Plot power vs. (scaled minimal detectable difference) for a
sampling design for a test based on a nonparametric simultaneous prediction
interval. The power is based on assuming the true distribution of the
observations is normal.
plotPredIntNparSimultaneousTestPowerCurve(n = 8, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "upper", r.shifted = r, integrate.args.list = NULL, method = "approx", NMC = 100, range.delta.over.sigma = c(0, 5), plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotPredIntNparSimultaneousTestPowerCurve(n = 8, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "upper", r.shifted = r, integrate.args.list = NULL, method = "approx", NMC = 100, range.delta.over.sigma = c(0, 5), plot.it = TRUE, add = FALSE, n.points = 20, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
n |
positive integer specifying the sample sizes. |
n.median |
positive odd integer specifying the sample size associated with the
future medians. The default value is |
k |
for the |
m |
positive integer specifying the maximum number of future observations (or
medians) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling
“occasions”. The default value is |
rule |
character string specifying which rule to use. The possible values are
|
lpl.rank |
non-negative integer indicating the rank of the order statistic to use for
the lower bound of the prediction interval. When |
n.plus.one.minus.upl.rank |
non-negative integer related to the rank of the order statistic to use for
the upper
bound of the prediction interval. A value of |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
r.shifted |
integer between |
integrate.args.list |
list of arguments to supply to the |
method |
character string indicating what method to use to compute the power. The possible
values are |
NMC |
positive integer indicating the number of Monte Carlo trials to run when |
range.delta.over.sigma |
numeric vector of length 2 indicating the range of the x-variable to use for the
plot. The default value is |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see explanation of the argument |
add |
a logical scalar indicating whether to add the design plot to the existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help file for predIntNparSimultaneousTestPower
for
information on how to compute the power of a hypothesis test for the difference
between two means of normal distributions based on a nonparametric simultaneous
prediction interval.
plotPredIntNparSimultaneousTestPowerCurve
invisibly returns a list with
components:
x.var |
x-coordinates of points that have been or would have been plotted. |
y.var |
y-coordinates of points that have been or would have been plotted. |
See the help file for predIntNparSimultaneous
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether two distributions differ from each other. The functions
predIntNparSimultaneousTestPower
and plotPredIntNparSimultaneousTestPowerCurve
can be
used to investigate these relationships for the case of normally-distributed
observations.
Steven P. Millard ([email protected])
See the help file for predIntNparSimultaneous
.
Gansecki, M. (2009). Using the Optimal Rank Values Calculator. US Environmental Protection Agency, Region 8, March 10, 2009.
predIntNparSimultaneousTestPower
,
predIntNparSimultaneous
,
predIntNparSimultaneousN
,
predIntNparSimultaneousConfLevel
,
plotPredIntNparSimultaneousDesign
,
predIntNpar
, tolIntNpar
.
# Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year (i.e., r = 10 * 1 = 10). # Here we will reproduce Figure 19-2 on page 19-35. This figure plots the # power of the nonparametric simultaneous prediction interval for 6 different # plans: # Rule Median.n k m Order.Statistic Achieved.alpha BG.Limit #1) k.of.m 1 1 3 Max 0.0055 0.28 #2) k.of.m 1 1 4 Max 0.0009 0.28 #3) Modified.CA 1 1 4 Max 0.0140 0.28 #4) k.of.m 3 1 2 Max 0.0060 0.28 #5) k.of.m 1 1 4 2nd 0.0046 0.25 #6) k.of.m 1 1 4 3rd 0.0135 0.24 # Here is the power curve for the 1-of-4 sampling strategy. dev.new() plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 4, r = 10, rule = "k.of.m", n.plus.one.minus.upl.rank = 3, pi.type = "upper", r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), main = "") title(main = paste( "Power Curve for Nonparametric 1-of-4 Sampling Strategy Based on", "25 Background Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n"), cex.main = 1.1) #---------- # Here are the power curves for all 6 sampling strategies. # Because these take several seconds to create, here we have commented out # the R commands. To run this example, just remove the pound signs (#) from # in front of the R commands. #dev.new() #plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 4, r = 10, # rule = "k.of.m", n.plus.one.minus.upl.rank = 3, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), main = "") #plotPredIntNparSimultaneousTestPowerCurve(n = 20, n.median = 3, k = 1, m = 2, # r = 10, rule = "k.of.m", n.plus.one.minus.upl.rank = 1, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), # add = TRUE, plot.col = 2, plot.lty = 2) #plotPredIntNparSimultaneousTestPowerCurve(n = 20, r = 10, rule = "Modified.CA", # n.plus.one.minus.upl.rank = 1, pi.type = "upper", r.shifted = 1, # method = "approx", range.delta.over.sigma = c(0, 5), add = TRUE, # plot.col = 3, plot.lty = 3) #plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 4, r = 10, # rule = "k.of.m", n.plus.one.minus.upl.rank = 2, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), # add = TRUE, plot.col = 4, plot.lty = 4) #plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 3, r = 10, # rule = "k.of.m", n.plus.one.minus.upl.rank = 1, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), # add = TRUE, plot.col = 5, plot.lty = 5) #plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 4, r = 10, # rule = "k.of.m", n.plus.one.minus.upl.rank = 1, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), # add = TRUE, plot.col = 6, plot.lty = 6) #legend("topleft", legend = c("1-of-4, 3rd", "1-of-2, Max, Median", "Mod CA", # "1-of-4, 2nd", "1-of-3, Max", "1-of-4, Max"), lwd = 3 * par("cex"), # col = 1:6, lty = 1:6, bty = "n") #title(main = "Figure 19-2. Comparison of Full Power Curves") #========== # Clean up #--------- graphics.off()
# Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year (i.e., r = 10 * 1 = 10). # Here we will reproduce Figure 19-2 on page 19-35. This figure plots the # power of the nonparametric simultaneous prediction interval for 6 different # plans: # Rule Median.n k m Order.Statistic Achieved.alpha BG.Limit #1) k.of.m 1 1 3 Max 0.0055 0.28 #2) k.of.m 1 1 4 Max 0.0009 0.28 #3) Modified.CA 1 1 4 Max 0.0140 0.28 #4) k.of.m 3 1 2 Max 0.0060 0.28 #5) k.of.m 1 1 4 2nd 0.0046 0.25 #6) k.of.m 1 1 4 3rd 0.0135 0.24 # Here is the power curve for the 1-of-4 sampling strategy. dev.new() plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 4, r = 10, rule = "k.of.m", n.plus.one.minus.upl.rank = 3, pi.type = "upper", r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), main = "") title(main = paste( "Power Curve for Nonparametric 1-of-4 Sampling Strategy Based on", "25 Background Samples, SWFPR=10%, and 2 Future Sampling Periods", sep = "\n"), cex.main = 1.1) #---------- # Here are the power curves for all 6 sampling strategies. # Because these take several seconds to create, here we have commented out # the R commands. To run this example, just remove the pound signs (#) from # in front of the R commands. #dev.new() #plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 4, r = 10, # rule = "k.of.m", n.plus.one.minus.upl.rank = 3, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), main = "") #plotPredIntNparSimultaneousTestPowerCurve(n = 20, n.median = 3, k = 1, m = 2, # r = 10, rule = "k.of.m", n.plus.one.minus.upl.rank = 1, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), # add = TRUE, plot.col = 2, plot.lty = 2) #plotPredIntNparSimultaneousTestPowerCurve(n = 20, r = 10, rule = "Modified.CA", # n.plus.one.minus.upl.rank = 1, pi.type = "upper", r.shifted = 1, # method = "approx", range.delta.over.sigma = c(0, 5), add = TRUE, # plot.col = 3, plot.lty = 3) #plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 4, r = 10, # rule = "k.of.m", n.plus.one.minus.upl.rank = 2, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), # add = TRUE, plot.col = 4, plot.lty = 4) #plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 3, r = 10, # rule = "k.of.m", n.plus.one.minus.upl.rank = 1, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), # add = TRUE, plot.col = 5, plot.lty = 5) #plotPredIntNparSimultaneousTestPowerCurve(n = 20, k = 1, m = 4, r = 10, # rule = "k.of.m", n.plus.one.minus.upl.rank = 1, pi.type = "upper", # r.shifted = 1, method = "approx", range.delta.over.sigma = c(0, 5), # add = TRUE, plot.col = 6, plot.lty = 6) #legend("topleft", legend = c("1-of-4, 3rd", "1-of-2, Max, Median", "Mod CA", # "1-of-4, 2nd", "1-of-3, Max", "1-of-4, Max"), lwd = 3 * par("cex"), # col = 1:6, lty = 1:6, bty = "n") #title(main = "Figure 19-2. Comparison of Full Power Curves") #========== # Clean up #--------- graphics.off()
Create plots involving sample size, power, difference, and significance level for a one- or two-sample proportion test.
plotPropTestDesign(x.var = "n", y.var = "power", range.x.var = NULL, n.or.n1 = 25, n2 = n.or.n1, ratio = 1, p.or.p1 = switch(alternative, greater = 0.6, less = 0.4, two.sided = ifelse(two.sided.direction == "greater", 0.6, 0.4)), p0.or.p2 = 0.5, alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2) || !missing(ratio), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = TRUE, correct = sample.type == "two.sample", round.up = FALSE, warn = TRUE, n.min = 2, n.max = 10000, tol.alpha = 0.1 * alpha, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 50, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotPropTestDesign(x.var = "n", y.var = "power", range.x.var = NULL, n.or.n1 = 25, n2 = n.or.n1, ratio = 1, p.or.p1 = switch(alternative, greater = 0.6, less = 0.4, two.sided = ifelse(two.sided.direction == "greater", 0.6, 0.4)), p0.or.p2 = 0.5, alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2) || !missing(ratio), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = TRUE, correct = sample.type == "two.sample", round.up = FALSE, warn = TRUE, n.min = 2, n.max = 10000, tol.alpha = 0.1 * alpha, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 50, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis.
Possible values are |
y.var |
character string indicating what variable to use for the y-axis.
Possible values are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use
for the plot. The default value depends on the value of |
n.or.n1 |
numeric scalar indicating the sample size. The default value is
|
n2 |
numeric scalar indicating the sample size for group 2. The default value
is the value of |
ratio |
numeric vector indicating the ratio of sample size in group 2 to sample
size in group 1 |
p.or.p1 |
numeric vector of proportions. When |
p0.or.p2 |
numeric vector of proportions. When |
alpha |
numeric scalar between 0 and 1 indicating the Type I error level associated
with the hypothesis test. The default value is |
power |
numeric scalar between 0 and 1 indicating the power associated with the
hypothesis test. The default value is |
sample.type |
character string indicating whether the design is based on a one-sample or
two-sample proportion test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible
values are |
two.sided.direction |
character string indicating the direction (positive or negative) for the minimal
detectable difference when |
approx |
logical scalar indicating whether to compute the power, sample size, or minimal
detectable difference based on the normal approximation to the binomial distribution.
The default value is |
correct |
logical scalar indicating whether to use the continuity correction when |
round.up |
logical scalar indicating whether to round up the values of the computed sample
size(s) to the next smallest integer. The default value is |
warn |
logical scalar indicating whether to issue a warning. The default value is |
n.min |
integer relevant to the case when |
n.max |
integer relevant to the case when |
tol.alpha |
numeric vector relevant to the case when |
tol |
numeric scalar relevant to the case when |
maxiter |
integer relevant to the case when |
plot.it |
a logical scalar indicating whether to create a new plot or add to the existing plot
(see |
add |
a logical scalar indicating whether to add the design plot to the
existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted
line or points. The default value is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for propTestPower
, propTestN
, and
propTestMdd
for information on how to compute the power, sample size,
or minimal detectable difference for a one- or two-sample proportion test.
plotPropTestDesign
invisibly returns a list with components
x.var
and y.var
, giving coordinates of the points that have
been or would have been plotted.
See the help files for propTestPower
, propTestN
, and
propTestMdd
.
Steven P. Millard ([email protected])
See the help files for propTestPower
, propTestN
, and
propTestMdd
.
propTestPower
, propTestN
,
propTestMdd
, Binomial,
binom.test
, prop.test
.
# Look at the relationship between power and sample size for a # one-sample proportion test, assuming the true proportion is 0.6, the # hypothesized proportion is 0.5, and a 5% significance level. # Compute the power based on the normal approximation to the binomial # distribution. dev.new() plotPropTestDesign() #---------- # For a two-sample proportion test, plot sample size vs. the minimal detectable # difference for various levels of power, using a 5% significance level and a # two-sided alternative: dev.new() plotPropTestDesign(x.var = "delta", y.var = "n", sample.type = "two", ylim = c(0, 2800), main="") plotPropTestDesign(x.var = "delta", y.var = "n", sample.type = "two", power = 0.9, add = TRUE, plot.col = "red") plotPropTestDesign(x.var = "delta", y.var = "n", sample.type = "two", power = 0.8, add = TRUE, plot.col = "blue") legend("topright", c("95%", "90%", "80%"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Sample Size vs. Minimal Detectable Difference for Two-Sample", "Proportion Test with p2=0.5, Alpha=0.05 and Various Powers", sep = "\n")) #========== # Example 22-3 on page 22-20 of USEPA (2009) involves determining whether more than # 10% of chlorine gas containers are stored at pressures above a compliance limit. # We want to test the one-sided null hypothesis that 10% or fewer of the containers # are stored at pressures greater than the compliance limit versus the alternative # that more than 10% are stored at pressures greater than the compliance limit. # We want to have at least 90% power of detecting a true proportion of 30% or # greater, using a 5% Type I error level. # Here we will modify this example and create a plot of power versus # sample size for various assumed minimal detactable differences, # using a 5% Type I error level. dev.new() plotPropTestDesign(x.var = "n", y.var = "power", sample.type = "one", alternative = "greater", p0.or.p2 = 0.1, p.or.p1 = 0.25, range.x.var = c(20, 50), ylim = c(0.6, 1), main = "") plotPropTestDesign(x.var = "n", y.var = "power", sample.type = "one", alternative = "greater", p0.or.p2 = 0.1, p.or.p1 = 0.3, range.x.var = c(20, 50), add = TRUE, plot.col = "red") plotPropTestDesign(x.var = "n", y.var = "power", sample.type = "one", alternative = "greater", p0.or.p2 = 0.1, p.or.p1 = 0.35, range.x.var = c(20, 50), add = TRUE, plot.col = "blue") legend("bottomright", c("p=0.35", "p=0.3", "p=0.25"), lty = 1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Sample Size for One-Sided One-Sample Proportion", "Test with p0=0.1, Alpha=0.05 and Various Detectable Differences", sep = "\n")) #========== # Clean up #--------- graphics.off()
# Look at the relationship between power and sample size for a # one-sample proportion test, assuming the true proportion is 0.6, the # hypothesized proportion is 0.5, and a 5% significance level. # Compute the power based on the normal approximation to the binomial # distribution. dev.new() plotPropTestDesign() #---------- # For a two-sample proportion test, plot sample size vs. the minimal detectable # difference for various levels of power, using a 5% significance level and a # two-sided alternative: dev.new() plotPropTestDesign(x.var = "delta", y.var = "n", sample.type = "two", ylim = c(0, 2800), main="") plotPropTestDesign(x.var = "delta", y.var = "n", sample.type = "two", power = 0.9, add = TRUE, plot.col = "red") plotPropTestDesign(x.var = "delta", y.var = "n", sample.type = "two", power = 0.8, add = TRUE, plot.col = "blue") legend("topright", c("95%", "90%", "80%"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Sample Size vs. Minimal Detectable Difference for Two-Sample", "Proportion Test with p2=0.5, Alpha=0.05 and Various Powers", sep = "\n")) #========== # Example 22-3 on page 22-20 of USEPA (2009) involves determining whether more than # 10% of chlorine gas containers are stored at pressures above a compliance limit. # We want to test the one-sided null hypothesis that 10% or fewer of the containers # are stored at pressures greater than the compliance limit versus the alternative # that more than 10% are stored at pressures greater than the compliance limit. # We want to have at least 90% power of detecting a true proportion of 30% or # greater, using a 5% Type I error level. # Here we will modify this example and create a plot of power versus # sample size for various assumed minimal detactable differences, # using a 5% Type I error level. dev.new() plotPropTestDesign(x.var = "n", y.var = "power", sample.type = "one", alternative = "greater", p0.or.p2 = 0.1, p.or.p1 = 0.25, range.x.var = c(20, 50), ylim = c(0.6, 1), main = "") plotPropTestDesign(x.var = "n", y.var = "power", sample.type = "one", alternative = "greater", p0.or.p2 = 0.1, p.or.p1 = 0.3, range.x.var = c(20, 50), add = TRUE, plot.col = "red") plotPropTestDesign(x.var = "n", y.var = "power", sample.type = "one", alternative = "greater", p0.or.p2 = 0.1, p.or.p1 = 0.35, range.x.var = c(20, 50), add = TRUE, plot.col = "blue") legend("bottomright", c("p=0.35", "p=0.3", "p=0.25"), lty = 1, lwd = 3 * par("cex"), col = c("blue", "red", "black"), bty = "n") title(main = paste("Power vs. Sample Size for One-Sided One-Sample Proportion", "Test with p0=0.1, Alpha=0.05 and Various Detectable Differences", sep = "\n")) #========== # Clean up #--------- graphics.off()
Create plots involving sample size, half-width, estimated standard deviation, coverage, and confidence level for a tolerance interval for a normal distribution.
plotTolIntNormDesign(x.var = "n", y.var = "half.width", range.x.var = NULL, n = 25, half.width = ifelse(x.var == "sigma.hat", 3 * max.x, 3 * sigma.hat), sigma.hat = 1, coverage = 0.95, conf.level = 0.95, cov.type = "content", round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = 1, plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotTolIntNormDesign(x.var = "n", y.var = "half.width", range.x.var = NULL, n = 25, half.width = ifelse(x.var == "sigma.hat", 3 * max.x, 3 * sigma.hat), sigma.hat = 1, coverage = 0.95, conf.level = 0.95, cov.type = "content", round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 100, plot.col = 1, plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis. Possible values
are |
y.var |
character string indicating what variable to use for the y-axis. Possible values
are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use for the plot.
The default value depends on the value of |
n |
positive integer greater than 1 indicating the sample size upon
which the tolerance interval is based. The default value is |
half.width |
positive scalar indicating the half-width of the prediction interval.
The default value depends on the value of |
sigma.hat |
numeric scalar specifying the value of the estimated standard deviation.
The default value is |
coverage |
numeric scalar between 0 and 1 indicating the desired coverage of the
tolerance interval. The default value is |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level of the
tolerance interval. The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval. The
possible values are |
round.up |
for the case when |
n.max |
for the case when |
tol |
for the case when |
maxiter |
for the case when |
plot.it |
a logical scalar indicating whether to create a plot or add to the existing plot
(see explanation of the argument |
add |
a logical scalar indicating whether to add the design plot to the existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted line or points. The default value
is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for tolIntNorm
, tolIntNormK
,
tolIntNormHalfWidth
, and tolIntNormN
for information
on how to compute a tolerance interval for a normal distribution, how the
half-width is computed when other quantities are fixed, and how the sample size
is computed when other quantities are fixed.
plotTolIntNormDesign
invisibly returns a list with components:
x.var |
x-coordinates of points that have been or would have been plotted. |
y.var |
y-coordinates of points that have been or would have been plotted. |
See the help file for tolIntNorm
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, confidence level, and half-width
if one of the objectives of the sampling program is to produce tolerance intervals.
The functions tolIntNormHalfWidth
, tolIntNormN
, and
plotTolIntNormDesign
can be used to investigate these relationships for the
case of normally-distributed observations.
Steven P. Millard ([email protected])
See the help file for tolIntNorm
.
tolIntNorm
, tolIntNormK
,
tolIntNormN
, plotTolIntNormDesign
,
Normal
.
# Look at the relationship between half-width and sample size for a # 95% beta-content tolerance interval, assuming an estimated standard # deviation of 1 and a confidence level of 95%: dev.new() plotTolIntNormDesign() #========== # Plot half-width vs. coverage for various levels of confidence: dev.new() plotTolIntNormDesign(x.var = "coverage", y.var = "half.width", ylim = c(0, 3.5), main="") plotTolIntNormDesign(x.var = "coverage", y.var = "half.width", conf.level = 0.9, add = TRUE, plot.col = "red") plotTolIntNormDesign(x.var = "coverage", y.var = "half.width", conf.level = 0.8, add = TRUE, plot.col = "blue") legend("topleft", c("95%", "90%", "80%"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Half-Width vs. Coverage for Tolerance Interval", "with Sigma Hat=1 and Various Confidence Levels", sep = "\n")) #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. # Here we will first take the log of the data and then estimate the # standard deviation based on the two background wells. We will use this # estimate of standard deviation to plot the half-widths of # future tolerance intervals on the log-scale for various sample sizes. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 summary.stats <- summaryStats(log(Chrysene.ppb) ~ Well.type, data = EPA.09.Ex.17.3.chrysene.df) summary.stats # N Mean SD Median Min Max #Background 8 2.5086 0.6279 2.4359 1.7405 3.6687 #Compliance 12 3.4173 0.4361 3.4111 2.7081 4.2195 sigma.hat <- summary.stats["Background", "SD"] sigma.hat #[1] 0.6279 dev.new() plotTolIntNormDesign(x.var = "n", y.var = "half.width", range.x.var = c(5, 40), sigma.hat = sigma.hat, cex.main = 1) #========== # Clean up #--------- rm(summary.stats, sigma.hat) graphics.off()
# Look at the relationship between half-width and sample size for a # 95% beta-content tolerance interval, assuming an estimated standard # deviation of 1 and a confidence level of 95%: dev.new() plotTolIntNormDesign() #========== # Plot half-width vs. coverage for various levels of confidence: dev.new() plotTolIntNormDesign(x.var = "coverage", y.var = "half.width", ylim = c(0, 3.5), main="") plotTolIntNormDesign(x.var = "coverage", y.var = "half.width", conf.level = 0.9, add = TRUE, plot.col = "red") plotTolIntNormDesign(x.var = "coverage", y.var = "half.width", conf.level = 0.8, add = TRUE, plot.col = "blue") legend("topleft", c("95%", "90%", "80%"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Half-Width vs. Coverage for Tolerance Interval", "with Sigma Hat=1 and Various Confidence Levels", sep = "\n")) #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. # Here we will first take the log of the data and then estimate the # standard deviation based on the two background wells. We will use this # estimate of standard deviation to plot the half-widths of # future tolerance intervals on the log-scale for various sample sizes. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 summary.stats <- summaryStats(log(Chrysene.ppb) ~ Well.type, data = EPA.09.Ex.17.3.chrysene.df) summary.stats # N Mean SD Median Min Max #Background 8 2.5086 0.6279 2.4359 1.7405 3.6687 #Compliance 12 3.4173 0.4361 3.4111 2.7081 4.2195 sigma.hat <- summary.stats["Background", "SD"] sigma.hat #[1] 0.6279 dev.new() plotTolIntNormDesign(x.var = "n", y.var = "half.width", range.x.var = c(5, 40), sigma.hat = sigma.hat, cex.main = 1) #========== # Clean up #--------- rm(summary.stats, sigma.hat) graphics.off()
Create plots involving sample size (), coverage (
), and confidence level
for a nonparametric tolerance interval.
plotTolIntNparDesign(x.var = "n", y.var = "conf.level", range.x.var = NULL, n = 25, coverage = 0.95, conf.level = 0.95, ti.type = "two.sided", cov.type = "content", ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotTolIntNparDesign(x.var = "n", y.var = "conf.level", range.x.var = NULL, n = 25, coverage = 0.95, conf.level = 0.95, ti.type = "two.sided", cov.type = "content", ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), plot.it = TRUE, add = FALSE, n.points = 100, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis. Possible values are
|
y.var |
character string indicating what variable to use for the y-axis. Possible values are
|
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use for the plot. The
default value depends on the value of |
n |
numeric scalar indicating the sample size. The default value is |
coverage |
numeric scalar between 0 and 1 specifying the coverage of the tolerance interval. The default
value is |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the tolerance interval.
The default value is |
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ltl.rank |
vector of positive integers indicating the rank of the order statistic to use for the lower bound
of the tolerance interval. If |
n.plus.one.minus.utl.rank |
vector of positive integers related to the rank of the order statistic to use for
the upper bound of the tolerance interval. A value of
|
plot.it |
a logical scalar indicating whether to create a plot or add to the
existing plot (see |
add |
a logical scalar indicating whether to add the design plot to the
existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted
line or points. The default value is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help file for tolIntNpar
, tolIntNparConfLevel
,
tolIntNparCoverage
, and tolIntNparN
for information on how
to compute a nonparametric tolerance interval, how the confidence level
is computed when other quantities are fixed, how the coverage is computed when other
quantites are fixed, and and how the sample size is computed when other quantities are fixed.
plotTolIntNparDesign
invisibly returns a list with components
x.var
and y.var
, giving coordinates of the points that
have been or would have been plotted.
See the help file for tolIntNpar
.
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, coverage, and confidence level if one of the objectives of
the sampling program is to produce tolerance intervals. The functions
tolIntNparN
, tolIntNparCoverage
, tolIntNparConfLevel
, and
plotTolIntNparDesign
can be used to investigate these relationships for
constructing nonparametric tolerance intervals.
Steven P. Millard ([email protected])
See the help file for tolIntNpar
.
tolIntNpar
, tolIntNparConfLevel
, tolIntNparCoverage
,
tolIntNparN
.
# Look at the relationship between confidence level and sample size for a two-sided # nonparametric tolerance interval. dev.new() plotTolIntNparDesign() #========== # Plot confidence level vs. sample size for various values of coverage: dev.new() plotTolIntNparDesign(coverage = 0.7, ylim = c(0,1), main = "") plotTolIntNparDesign(coverage = 0.8, add = TRUE, plot.col = "red") plotTolIntNparDesign(coverage = 0.9, add = TRUE, plot.col = "blue") legend("bottomright", c("coverage = 70%", "coverage = 80%", "coverage = 90%"), lty=1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Confidence Level vs. Sample Size for Nonparametric TI", "with Various Levels of Coverage", sep = "\n")) #========== # Example 17-4 on page 17-21 of USEPA (2009) uses copper concentrations (ppb) from 3 # background wells to set an upper limit for 2 compliance wells. There are 6 observations # per well, and the maximum value from the 3 wells is set to the 95% confidence upper # tolerance limit, and we need to determine the coverage of this tolerance interval. tolIntNparCoverage(n = 24, conf.level = 0.95, ti.type = "upper") #[1] 0.8826538 # Here we will modify the example and look at confidence level versus coverage for # a set sample size of n = 24. dev.new() plotTolIntNparDesign(x.var = "coverage", y.var = "conf.level", n = 24, ti.type = "upper") #========== # Clean up #--------- graphics.off()
# Look at the relationship between confidence level and sample size for a two-sided # nonparametric tolerance interval. dev.new() plotTolIntNparDesign() #========== # Plot confidence level vs. sample size for various values of coverage: dev.new() plotTolIntNparDesign(coverage = 0.7, ylim = c(0,1), main = "") plotTolIntNparDesign(coverage = 0.8, add = TRUE, plot.col = "red") plotTolIntNparDesign(coverage = 0.9, add = TRUE, plot.col = "blue") legend("bottomright", c("coverage = 70%", "coverage = 80%", "coverage = 90%"), lty=1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Confidence Level vs. Sample Size for Nonparametric TI", "with Various Levels of Coverage", sep = "\n")) #========== # Example 17-4 on page 17-21 of USEPA (2009) uses copper concentrations (ppb) from 3 # background wells to set an upper limit for 2 compliance wells. There are 6 observations # per well, and the maximum value from the 3 wells is set to the 95% confidence upper # tolerance limit, and we need to determine the coverage of this tolerance interval. tolIntNparCoverage(n = 24, conf.level = 0.95, ti.type = "upper") #[1] 0.8826538 # Here we will modify the example and look at confidence level versus coverage for # a set sample size of n = 24. dev.new() plotTolIntNparDesign(x.var = "coverage", y.var = "conf.level", n = 24, ti.type = "upper") #========== # Clean up #--------- graphics.off()
Create plots involving sample size, power, scaled difference, and significance level for a one- or two-sample t-test.
plotTTestDesign(x.var = "n", y.var = "power", range.x.var = NULL, n.or.n1 = 25, n2 = n.or.n1, delta.over.sigma = switch(alternative, greater = 0.5, less = -0.5, two.sided = ifelse(two.sided.direction == "greater", 0.5, -0.5)), alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 50, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotTTestDesign(x.var = "n", y.var = "power", range.x.var = NULL, n.or.n1 = 25, n2 = n.or.n1, delta.over.sigma = switch(alternative, greater = 0.5, less = -0.5, two.sided = ifelse(two.sided.direction == "greater", 0.5, -0.5)), alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 50, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis.
Possible values are |
y.var |
character string indicating what variable to use for the y-axis.
Possible values are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use
for the plot. The default value depends on the value of |
n.or.n1 |
numeric scalar indicating the sample size. The default value is
|
n2 |
numeric scalar indicating the sample size for group 2. The default value
is the value of |
delta.over.sigma |
numeric scalar specifying the ratio of the true difference ( |
alpha |
numeric scalar between 0 and 1 indicating the Type I error level associated
with the hypothesis test. The default value is |
power |
numeric scalar between 0 and 1 indicating the power associated with the
hypothesis test. The default value is |
sample.type |
character string indicating whether the design is based on a one-sample or
two-sample t-test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible
values are |
two.sided.direction |
character string indicating the direction (positive or negative) for the scaled
minimal detectable difference when |
approx |
logical scalar indicating whether to compute the power based on an approximation
to the non-central t-distribution. The default value is |
round.up |
logical scalar indicating whether to round up the values of the computed sample
size(s) to the next smallest integer. The default value is |
n.max |
for the case when |
tol |
numeric scalar relevant to the case when |
maxiter |
numeric scalar relevant to the case when |
plot.it |
a logical scalar indicating whether to create a new plot or add to the existing plot
(see |
add |
a logical scalar indicating whether to add the design plot to the
existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted
line or points. The default value is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for tTestPower
, tTestN
, and
tTestScaledMdd
for information on how to compute the power,
sample size, or scaled minimal detectable difference for a one- or two-sample
t-test.
plotTTestDesign
invisibly returns a list with components
x.var
and y.var
, giving coordinates of the points that have
been or would have been plotted.
See the help files for tTestPower
, tTestN
, and
tTestScaledMdd
.
Steven P. Millard ([email protected])
See the help files for tTestPower
, tTestN
, and
tTestScaledMdd
.
tTestPower
, tTestN
,
tTestScaledMdd
, t.test
.
# Look at the relationship between power and sample size for a two-sample t-test, # assuming a scaled difference of 0.5 and a 5% significance level: dev.new() plotTTestDesign(sample.type = "two") #---------- # For a two-sample t-test, plot sample size vs. the scaled minimal detectable # difference for various levels of power, using a 5% significance level: dev.new() plotTTestDesign(x.var = "delta.over.sigma", y.var = "n", sample.type = "two", ylim = c(0, 110), main="") plotTTestDesign(x.var = "delta.over.sigma", y.var = "n", sample.type = "two", power = 0.9, add = TRUE, plot.col = "red") plotTTestDesign(x.var = "delta.over.sigma", y.var = "n", sample.type = "two", power = 0.8, add = TRUE, plot.col = "blue") legend("topright", c("95%", "90%", "80%"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Sample Size vs. Scaled Difference for", "Two-Sample t-Test, with Alpha=0.05 and Various Powers", sep="\n")) #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), look at # power versus scaled minimal detectable difference for various sample # sizes in the context of the problem of using a one-sample t-test to # compare the mean for the well with the MCL of 7 ppb. Use alpha = 0.01, # assume an upper one-sided alternative (i.e., compliance well mean larger # than 7 ppb). dev.new() plotTTestDesign(x.var = "delta.over.sigma", y.var = "power", range.x.var = c(0.5, 2), n.or.n1 = 8, alpha = 0.01, alternative = "greater", ylim = c(0, 1), main = "") plotTTestDesign(x.var = "delta.over.sigma", y.var = "power", range.x.var = c(0.5, 2), n.or.n1 = 6, alpha = 0.01, alternative = "greater", add = TRUE, plot.col = "red") plotTTestDesign(x.var = "delta.over.sigma", y.var = "power", range.x.var = c(0.5, 2), n.or.n1 = 4, alpha = 0.01, alternative = "greater", add = TRUE, plot.col = "blue") legend("topleft", paste("N =", c(8, 6, 4)), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Power vs. Scaled Difference for One-Sample t-Test", "with Alpha=0.01 and Various Sample Sizes", sep="\n")) #========== # Clean up #--------- graphics.off()
# Look at the relationship between power and sample size for a two-sample t-test, # assuming a scaled difference of 0.5 and a 5% significance level: dev.new() plotTTestDesign(sample.type = "two") #---------- # For a two-sample t-test, plot sample size vs. the scaled minimal detectable # difference for various levels of power, using a 5% significance level: dev.new() plotTTestDesign(x.var = "delta.over.sigma", y.var = "n", sample.type = "two", ylim = c(0, 110), main="") plotTTestDesign(x.var = "delta.over.sigma", y.var = "n", sample.type = "two", power = 0.9, add = TRUE, plot.col = "red") plotTTestDesign(x.var = "delta.over.sigma", y.var = "n", sample.type = "two", power = 0.8, add = TRUE, plot.col = "blue") legend("topright", c("95%", "90%", "80%"), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Sample Size vs. Scaled Difference for", "Two-Sample t-Test, with Alpha=0.05 and Various Powers", sep="\n")) #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), look at # power versus scaled minimal detectable difference for various sample # sizes in the context of the problem of using a one-sample t-test to # compare the mean for the well with the MCL of 7 ppb. Use alpha = 0.01, # assume an upper one-sided alternative (i.e., compliance well mean larger # than 7 ppb). dev.new() plotTTestDesign(x.var = "delta.over.sigma", y.var = "power", range.x.var = c(0.5, 2), n.or.n1 = 8, alpha = 0.01, alternative = "greater", ylim = c(0, 1), main = "") plotTTestDesign(x.var = "delta.over.sigma", y.var = "power", range.x.var = c(0.5, 2), n.or.n1 = 6, alpha = 0.01, alternative = "greater", add = TRUE, plot.col = "red") plotTTestDesign(x.var = "delta.over.sigma", y.var = "power", range.x.var = c(0.5, 2), n.or.n1 = 4, alpha = 0.01, alternative = "greater", add = TRUE, plot.col = "blue") legend("topleft", paste("N =", c(8, 6, 4)), lty = 1, lwd = 3 * par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Power vs. Scaled Difference for One-Sample t-Test", "with Alpha=0.01 and Various Sample Sizes", sep="\n")) #========== # Clean up #--------- graphics.off()
Create plots involving sample size, power, ratio of means, coefficient of variation, and significance level for a one- or two-sample t-test, assuming lognormal data.
plotTTestLnormAltDesign(x.var = "n", y.var = "power", range.x.var = NULL, n.or.n1 = 25, n2 = n.or.n1, ratio.of.means = switch(alternative, greater = 2, less = 0.5, two.sided = ifelse(two.sided.direction == "greater", 2, 0.5)), cv = 1, alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 50, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
plotTTestLnormAltDesign(x.var = "n", y.var = "power", range.x.var = NULL, n.or.n1 = 25, n2 = n.or.n1, ratio.of.means = switch(alternative, greater = 2, less = 0.5, two.sided = ifelse(two.sided.direction == "greater", 2, 0.5)), cv = 1, alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, round.up = FALSE, n.max = 5000, tol = 1e-07, maxiter = 1000, plot.it = TRUE, add = FALSE, n.points = 50, plot.col = "black", plot.lwd = 3 * par("cex"), plot.lty = 1, digits = .Options$digits, cex.main = par("cex"), ..., main = NULL, xlab = NULL, ylab = NULL, type = "l")
x.var |
character string indicating what variable to use for the x-axis.
Possible values are |
y.var |
character string indicating what variable to use for the y-axis.
Possible values are |
range.x.var |
numeric vector of length 2 indicating the range of the x-variable to use
for the plot. The default value depends on the value of
|
n.or.n1 |
numeric scalar indicating the sample size. The default value is
|
n2 |
numeric scalar indicating the sample size for group 2. The default value
is the value of |
ratio.of.means |
numeric scalar specifying the ratio of the first mean to the second mean. When
When |
cv |
numeric scalar: a positive value specifying the coefficient of
variation. When |
alpha |
numeric scalar between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
power |
numeric scalar between 0 and 1 indicating the power
associated with the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
two.sided.direction |
character string indicating the direction (greater than 1 or less than 1) for the
detectable ratio of means when |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
round.up |
logical scalar indicating whether to round up the values of the computed
sample size(s) to the next smallest integer. The default value is
|
n.max |
for the case when |
tol |
numeric scalar indicating the toloerance to use in the
|
maxiter |
positive integer indicating the maximum number of iterations
argument to pass to the |
plot.it |
a logical scalar indicating whether to create a new plot or add to the existing plot
(see |
add |
a logical scalar indicating whether to add the design plot to the
existing plot ( |
n.points |
a numeric scalar specifying how many (x,y) pairs to use to produce the plot.
There are |
plot.col |
a numeric scalar or character string determining the color of the plotted
line or points. The default value is |
plot.lwd |
a numeric scalar determining the width of the plotted line. The default value is
|
plot.lty |
a numeric scalar determining the line type of the plotted line. The default value is
|
digits |
a scalar indicating how many significant digits to print out on the plot. The default
value is the current setting of |
cex.main , main , xlab , ylab , type , ...
|
additional graphical parameters (see |
See the help files for tTestLnormAltPower
,
tTestLnormAltN
, and tTestLnormAltRatioOfMeans
for
information on how to compute the power, sample size, or ratio of means for a
one- or two-sample t-test assuming lognormal data.
plotTTestLnormAltDesign
invisibly returns a list with components
x.var
and y.var
, giving coordinates of the points that have
been or would have been plotted.
See the help files for tTestLnormAltPower
,
tTestLnormAltN
, and tTestLnormAltRatioOfMeans
.
Steven P. Millard ([email protected])
See the help files for tTestLnormAltPower
,
tTestLnormAltN
, and tTestLnormAltRatioOfMeans
.
tTestLnormAltPower
, tTestLnormAltN
,
tTestLnormAltRatioOfMeans
, t.test
.
# Look at the relationship between power and sample size for a two-sample t-test, # assuming lognormal data, a ratio of means of 2, a coefficient of variation # of 1, and a 5% significance level: dev.new() plotTTestLnormAltDesign(sample.type = "two") #---------- # For a two-sample t-test based on lognormal data, plot sample size vs. the # minimal detectable ratio for various levels of power, assuming a coefficient # of variation of 1 and using a 5% significance level: dev.new() plotTTestLnormAltDesign(x.var = "ratio.of.means", y.var = "n", range.x.var = c(1.5, 2), sample.type = "two", ylim = c(20, 120), main="") plotTTestLnormAltDesign(x.var = "ratio.of.means", y.var = "n", range.x.var = c(1.5, 2), sample.type="two", power = 0.9, add = TRUE, plot.col = "red") plotTTestLnormAltDesign(x.var = "ratio.of.means", y.var = "n", range.x.var = c(1.5, 2), sample.type="two", power = 0.8, add = TRUE, plot.col = "blue") legend("topright", c("95%", "90%", "80%"), lty=1, lwd = 3*par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Sample Size vs. Ratio of Lognormal Means for", "Two-Sample t-Test, with CV=1, Alpha=0.05 and Various Powers", sep="\n")) #========== # The guidance document Soil Screening Guidance: Technical Background Document # (USEPA, 1996c, Part 4) discusses sampling design and sample size calculations # for studies to determine whether the soil at a potentially contaminated site # needs to be investigated for possible remedial action. Let 'theta' denote the # average concentration of the chemical of concern. The guidance document # establishes the following goals for the decision rule (USEPA, 1996c, p.87): # # Pr[Decide Don't Investigate | theta > 2 * SSL] = 0.05 # # Pr[Decide to Investigate | theta <= (SSL/2)] = 0.2 # # where SSL denotes the pre-established soil screening level. # # These goals translate into a Type I error of 0.2 for the null hypothesis # # H0: [theta / (SSL/2)] <= 1 # # and a power of 95% for the specific alternative hypothesis # # Ha: [theta / (SSL/2)] = 4 # # Assuming a lognormal distribution, a coefficient of variation of 2, and the above # values for Type I error and power, create a performance goal diagram # (USEPA, 1996c, p.89) showing the power of a one-sample test versus the minimal # detectable ratio of theta/(SSL/2) when the sample size is 6 and the exact power # calculations are used. dev.new() plotTTestLnormAltDesign(x.var = "ratio.of.means", y.var = "power", range.x.var = c(1, 5), n.or.n1 = 6, cv = 2, alpha = 0.2, alternative = "greater", approx = FALSE, ylim = c(0.2, 1), xlab = "theta / (SSL/2)") #========== # Clean up #--------- graphics.off()
# Look at the relationship between power and sample size for a two-sample t-test, # assuming lognormal data, a ratio of means of 2, a coefficient of variation # of 1, and a 5% significance level: dev.new() plotTTestLnormAltDesign(sample.type = "two") #---------- # For a two-sample t-test based on lognormal data, plot sample size vs. the # minimal detectable ratio for various levels of power, assuming a coefficient # of variation of 1 and using a 5% significance level: dev.new() plotTTestLnormAltDesign(x.var = "ratio.of.means", y.var = "n", range.x.var = c(1.5, 2), sample.type = "two", ylim = c(20, 120), main="") plotTTestLnormAltDesign(x.var = "ratio.of.means", y.var = "n", range.x.var = c(1.5, 2), sample.type="two", power = 0.9, add = TRUE, plot.col = "red") plotTTestLnormAltDesign(x.var = "ratio.of.means", y.var = "n", range.x.var = c(1.5, 2), sample.type="two", power = 0.8, add = TRUE, plot.col = "blue") legend("topright", c("95%", "90%", "80%"), lty=1, lwd = 3*par("cex"), col = c("black", "red", "blue"), bty = "n") title(main = paste("Sample Size vs. Ratio of Lognormal Means for", "Two-Sample t-Test, with CV=1, Alpha=0.05 and Various Powers", sep="\n")) #========== # The guidance document Soil Screening Guidance: Technical Background Document # (USEPA, 1996c, Part 4) discusses sampling design and sample size calculations # for studies to determine whether the soil at a potentially contaminated site # needs to be investigated for possible remedial action. Let 'theta' denote the # average concentration of the chemical of concern. The guidance document # establishes the following goals for the decision rule (USEPA, 1996c, p.87): # # Pr[Decide Don't Investigate | theta > 2 * SSL] = 0.05 # # Pr[Decide to Investigate | theta <= (SSL/2)] = 0.2 # # where SSL denotes the pre-established soil screening level. # # These goals translate into a Type I error of 0.2 for the null hypothesis # # H0: [theta / (SSL/2)] <= 1 # # and a power of 95% for the specific alternative hypothesis # # Ha: [theta / (SSL/2)] = 4 # # Assuming a lognormal distribution, a coefficient of variation of 2, and the above # values for Type I error and power, create a performance goal diagram # (USEPA, 1996c, p.89) showing the power of a one-sample test versus the minimal # detectable ratio of theta/(SSL/2) when the sample size is 6 and the exact power # calculations are used. dev.new() plotTTestLnormAltDesign(x.var = "ratio.of.means", y.var = "power", range.x.var = c(1, 5), n.or.n1 = 6, cv = 2, alpha = 0.2, alternative = "greater", approx = FALSE, ylim = c(0.2, 1), xlab = "theta / (SSL/2)") #========== # Clean up #--------- graphics.off()
Computes pointwise confidence limits for predictions computed by the function
predict
.
pointwise(results.predict, coverage = 0.99, simultaneous = FALSE, individual = FALSE)
pointwise(results.predict, coverage = 0.99, simultaneous = FALSE, individual = FALSE)
results.predict |
output from a call to |
coverage |
optional numeric scalar between 0 and 1 indicating the confidence level associated with the confidence limits.
The default value is |
simultaneous |
optional logical scalar indicating whether to base the confidence limits for the
predicted values on simultaneous or non-simultaneous prediction limits.
The default value is |
individual |
optional logical scalar indicating whether to base the confidence intervals for the
predicted values on prediction limits for the mean ( |
This function computes pointwise confidence limits for predictions computed by the
function predict
. The limits are computed at those points specified by the argument
newdata
of predict
.
The predict
function is a generic function with methods for several
different classes. The funciton pointwise
was part of the S language.
The modifications to pointwise
in the package EnvStats involve confidence
limits for predictions for a linear model (i.e., an object of class "lm"
).
Confidence Limits for a Predicted Mean Value (individual=FALSE
).
Consider a standard linear model with predictor variables.
Often, one of the major goals of regression analysis is to predict a future
value of the response variable given known values of the predictor variables.
The equations for the predicted mean value of the response given
fixed values of the predictor variables as well as the equation for a
two-sided (1-
)100% confidence interval for the mean value of the
response can be found in Draper and Smith (1998, p.80) and
Millard and Neerchal (2001, p.547).
Technically, this formula is a confidence interval for the mean of
the response for one set of fixed values of the predictor variables and
corresponds to the case when simultaneous=FALSE
. To create simultaneous
confidence intervals over the range of of the predictor variables,
the critical t-value in the equation has to be replaced with a critical
F-value and the modified formula is given in Draper and Smith (1998, p. 83),
Miller (1981a, p. 111), and Millard and Neerchal (2001, p. 547).
This formula is used in the case when simultaneous=TRUE
.
Confidence Limits for a Predicted Individual Value (individual=TRUE
).
In the above section we discussed how to create a confidence interval for
the mean of the response given fixed values for the predictor variables.
If instead we want to create a prediction interval for a single
future observation of the response variable, the fomula is given in
Miller (1981a, p. 115) and Millard and Neerchal (2001, p. 551).
Technically, this formula is a prediction interval for a single future
observation for one set of fixed values of the predictor variables and
corresponds to the case when simultaneous=FALSE
. Miller (1981a, p. 115)
gives a formula for simultaneous prediction intervals for future
observations. If we are interested in creating an interval that will
encompass all possible future observations over the range of the
preictor variables with some specified probability however, we need to
create simultaneous tolerance intervals. A formula for such an interval
was developed by Lieberman and Miller (1963) and is given in
Miller (1981a, p. 124). This formula is used in the case when
simultaneous=TRUE
.
a list with the following components:
upper |
upper limits of pointwise confidence intervals. |
fit |
surface values. This is the same as the component |
lower |
lower limits of pointwise confidence intervals. |
The function pointwise
is called by the functions
detectionLimitCalibrate
and inversePredictCalibrate
, which are used in calibration.
Almost always the process of determining the concentration of a chemical in
a soil, water, or air sample involves using some kind of machine that
produces a signal, and this signal is related to the concentration of the
chemical in the physical sample. The process of relating the machine signal
to the concentration of the chemical is called calibration
(see calibrate
). Once calibration has been performed,
estimated concentrations in physical samples with unknown concentrations
are computed using inverse regression. The uncertainty in the process used
to estimate the concentration may be quantified with decision, detection,
and quantitation limits.
In practice, only the point estimate of concentration is reported (along
with a possible qualifier), without confidence bounds for the true
concentration . This is most unfortunate because it gives the
impression that there is no error associated with the reported concentration.
Indeed, both the International Organization for Standardization (ISO) and
the International Union of Pure and Applied Chemistry (IUPAC) recommend
always reporting both the estimated concentration and the uncertainty
associated with this estimate (Currie, 1997).
Authors of S (for code for pointwise
in S).
Steven P. Millard (for modification to allow the arguments simultaneous
and individual
);
[email protected])
Chambers, J.M., and Hastie, T.J., eds. (1992). Statistical Models in S. Chapman and Hall/CRC, Boca Raton, FL.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, Chapter 3.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.546-553.
Miller, R.G. (1981a). Simultaneous Statistical Inference. Springer-Verlag, New York, pp.111, 124.
predict
, predict.lm
,
lm
, calibrate
,
inversePredictCalibrate
, detectionLimitCalibrate
.
# Using the data in the built-in data frame Air.df, # fit the cube root of ozone as a function of temperature. # Then compute predicted values for ozone at 70 and 90 # degrees F, and compute 95% confidence intervals for the # mean value of ozone at these temperatures. # First create the lm object #--------------------------- ozone.fit <- lm(ozone ~ temperature, data = Air.df) # Now get predicted values and CIs at 70 and 90 degrees #------------------------------------------------------ predict.list <- predict(ozone.fit, newdata = data.frame(temperature = c(70, 90)), se.fit = TRUE) pointwise(predict.list, coverage = 0.95) # $upper # 1 2 # 2.839145 4.278533 # $fit # 1 2 # 2.697810 4.101808 # $lower # 1 2 # 2.556475 3.925082 #-------------------------------------------------------------------- # Continuing with the above example, create a scatterplot of ozone # vs. temperature, and add the fitted line along with simultaneous # 95% confidence bands. x <- Air.df$temperature y <- Air.df$ozone dev.new() plot(x, y, xlab="Temperature (degrees F)", ylab = expression(sqrt("Ozone (ppb)", 3))) abline(ozone.fit, lwd = 2) new.x <- seq(min(x), max(x), length=100) predict.ozone <- predict(ozone.fit, newdata = data.frame(temperature = new.x), se.fit = TRUE) ci.ozone <- pointwise(predict.ozone, coverage=0.95, simultaneous=TRUE) lines(new.x, ci.ozone$lower, lty=2, lwd = 2, col = 2) lines(new.x, ci.ozone$upper, lty=2, lwd = 2, col = 2) title(main=paste("Cube Root Ozone vs. Temperature with Fitted Line", "and Simultaneous 95% Confidence Bands", sep="\n")) #-------------------------------------------------------------------- # Redo the last example by creating non-simultaneous # confidence bounds and prediction bounds as well. dev.new() plot(x, y, xlab = "Temperature (degrees F)", ylab = expression(sqrt("Ozone (ppb)", 3))) abline(ozone.fit, lwd = 2) new.x <- seq(min(x), max(x), length=100) predict.ozone <- predict(ozone.fit, newdata = data.frame(temperature = new.x), se.fit = TRUE) ci.ozone <- pointwise(predict.ozone, coverage=0.95) lines(new.x, ci.ozone$lower, lty=2, col = 2, lwd = 2) lines(new.x, ci.ozone$upper, lty=2, col = 2, lwd = 2) pi.ozone <- pointwise(predict.ozone, coverage = 0.95, individual = TRUE) lines(new.x, pi.ozone$lower, lty=4, col = 4, lwd = 2) lines(new.x, pi.ozone$upper, lty=4, col = 4, lwd = 2) title(main=paste("Cube Root Ozone vs. Temperature with Fitted Line", "and 95% Confidence and Prediction Bands", sep="\n")) #-------------------------------------------------------------------- # Clean up rm(predict.list, ozone.fit, x, y, new.x, predict.ozone, ci.ozone, pi.ozone)
# Using the data in the built-in data frame Air.df, # fit the cube root of ozone as a function of temperature. # Then compute predicted values for ozone at 70 and 90 # degrees F, and compute 95% confidence intervals for the # mean value of ozone at these temperatures. # First create the lm object #--------------------------- ozone.fit <- lm(ozone ~ temperature, data = Air.df) # Now get predicted values and CIs at 70 and 90 degrees #------------------------------------------------------ predict.list <- predict(ozone.fit, newdata = data.frame(temperature = c(70, 90)), se.fit = TRUE) pointwise(predict.list, coverage = 0.95) # $upper # 1 2 # 2.839145 4.278533 # $fit # 1 2 # 2.697810 4.101808 # $lower # 1 2 # 2.556475 3.925082 #-------------------------------------------------------------------- # Continuing with the above example, create a scatterplot of ozone # vs. temperature, and add the fitted line along with simultaneous # 95% confidence bands. x <- Air.df$temperature y <- Air.df$ozone dev.new() plot(x, y, xlab="Temperature (degrees F)", ylab = expression(sqrt("Ozone (ppb)", 3))) abline(ozone.fit, lwd = 2) new.x <- seq(min(x), max(x), length=100) predict.ozone <- predict(ozone.fit, newdata = data.frame(temperature = new.x), se.fit = TRUE) ci.ozone <- pointwise(predict.ozone, coverage=0.95, simultaneous=TRUE) lines(new.x, ci.ozone$lower, lty=2, lwd = 2, col = 2) lines(new.x, ci.ozone$upper, lty=2, lwd = 2, col = 2) title(main=paste("Cube Root Ozone vs. Temperature with Fitted Line", "and Simultaneous 95% Confidence Bands", sep="\n")) #-------------------------------------------------------------------- # Redo the last example by creating non-simultaneous # confidence bounds and prediction bounds as well. dev.new() plot(x, y, xlab = "Temperature (degrees F)", ylab = expression(sqrt("Ozone (ppb)", 3))) abline(ozone.fit, lwd = 2) new.x <- seq(min(x), max(x), length=100) predict.ozone <- predict(ozone.fit, newdata = data.frame(temperature = new.x), se.fit = TRUE) ci.ozone <- pointwise(predict.ozone, coverage=0.95) lines(new.x, ci.ozone$lower, lty=2, col = 2, lwd = 2) lines(new.x, ci.ozone$upper, lty=2, col = 2, lwd = 2) pi.ozone <- pointwise(predict.ozone, coverage = 0.95, individual = TRUE) lines(new.x, pi.ozone$lower, lty=4, col = 4, lwd = 2) lines(new.x, pi.ozone$upper, lty=4, col = 4, lwd = 2) title(main=paste("Cube Root Ozone vs. Temperature with Fitted Line", "and 95% Confidence and Prediction Bands", sep="\n")) #-------------------------------------------------------------------- # Clean up rm(predict.list, ozone.fit, x, y, new.x, predict.ozone, ci.ozone, pi.ozone)
Returns a list of “ordered” observations and associated plotting positions based on Type I left-censored or right-censored data. These plotting positions may be used to construct empirical cumulative distribution plots or quantile-quantile plots, or to estimate distribution parameters.
ppointsCensored(x, censored, censoring.side = "left", prob.method = "michael-schucany", plot.pos.con = 0.375)
ppointsCensored(x, censored, censoring.side = "left", prob.method = "michael-schucany", plot.pos.con = 0.375)
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible values are
|
prob.method |
character string indicating what method to use to compute the plotting positions
(empirical probabilities). Possible values are: The default value is The |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position constant.
The default value is |
Methods for computing plotting positions for complete data sets
(no censored observations) are discussed in D'Agostino, R.B. (1986a) and
Cleveland (1993). For data sets with censored observations, these methods
must be modified. The function ppointsCensored
allows you to compute
plotting positions based on any of the following methods:
Product-limit method of Kaplan and Meier (1958) (prob.method="kaplan-meier"
).
Hazard plotting method of Nelson (1972) (prob.method="nelson"
).
Generalization of the product-limit method due to Michael and Schucany (1986)
(prob.method="michael-schucany"
) (the default).
Generalization of the product-limit method due to Hirsch and Stedinger (1987)
(prob.method="hirsch-stedinger"
).
Let denote a random sample of
observations from
some distribution. Assume
(
) of these
observations are known and
(
) of these observations are
all censored below (left-censored) or all censored above (right-censored) at
fixed censoring levels
For the case when , the data are said to be Type I
multiply censored. For the case when
,
set
. If the data are left-censored
and all
known observations are greater
than or equal to
, or if the data are right-censored and all
known observations are less than or equal to
, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let denote the number of observations censored below or above censoring
level
for
, so that
Let denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
right-censored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For left-censored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity does not necessarily represent
the
'th “largest” observation from the (unknown) complete sample.
Finally, let (omega) denote the set of
subscripts in the
“ordered” sample that correspond to uncensored observations.
Product-Limit Method of Kaplan and Meier (prob.method="kaplan-meier"
)
For complete data sets (no censored observations), the
empirical probabilities estimator of the cumulative distribution
function evaluated at the 'th ordered observation is given by
(D'Agostino, 1986a, p.8):
where denotes the number of observations less than
or equal to
(see the help file for
ecdfPlot
).
Kaplan and Meier (1958) extended this method of computing the empirical cdf to
the case of right-censored data.
Right-Censored Data (censoring.side="right"
)
Let denote the survival function evaluated at
, that is:
Kaplan and Meier (1958) show that a nonparametric estimate of the survival
function at the 'th ordered observation that is not censored
(i.e.,
), is given by:
|
|
|
|
|
|
|
||
|
||
|
|
where is the number of observations (uncensored or censored) with values
greater than or equal to
, and
denotes the number of
uncensored observations exactly equal to
(if there are no tied
uncensored observations then
will equal 1 for all values of
).
(See also Lee and Wang, 2003, pp. 64–69; Michael and Schucany, 1986). By convention,
the estimate of the survival function at a censored observation is set equal to
the estimated value of the survival function at the largest uncensored observation
less than or equal to that censoring level. If there are no uncensored observations
less than or equal to a particular censoring level, the estimate of the survival
function is set to 1 for that censoring level.
Thus the Kaplan-Meier plotting position at the 'th ordered observation
that is not censored (i.e.,
), is given by:
The plotting position for a censored observation is set equal to the plotting position associated with the largest uncensored observation less than or equal to that censoring level. If there are no uncensored observations less than or equal to a particular censoring level, the plotting position is set to 0 for that censoring level.
As an example, consider the following right-censored data set:
The table below shows how the plotting positions are computed.
|
|
|
|
|
Plotting Position |
1 | |
|
|
|
|
2 | |
||||
3 | |
||||
4 | |
|
|
|
|
5 | |
|
|||
6 | |
|
|
|
|
Note that for complete data sets, Equation (6) reduces to Equation (3).
Left-Censored Data (censoring.side="left"
)
Gillespie et al. (2010) give formulas for the Kaplan-Meier estimator for the case of
left-cesoring (censoring.side="left"
). In this case, the plotting position
for the 'th ordered observation, assuming it is not censored, is computed as:
where is the number of observations (uncensored or censored) with values
less than or equal to
, and
denotes the number of
uncensored observations exactly equal to
(if there are no tied
uncensored observations then
will equal 1 for all values of
).
The plotting position is equal to 1 for the largest uncensored order statistic.
As an example, consider the following left-censored data set:
The table below shows how the plotting positions are computed.
|
|
|
|
|
Plotting Position |
1 | |
|
|
|
|
2 | |
||||
3 | |
||||
4 | |
|
|
|
|
5 | |
|
|||
6 | |
|
|
|
|
Note that for complete data sets, Equation (7) reduces to Equation (3).
Modified Kaplan-Meier Method (prob.method="modified kaplan-meier"
)
(Left-Censored Data Only.) For left-censored data, the modified Kaplan-Meier
method is the same as the Kaplan-Meier method, except that for the largest
uncensored order statistic, the plotting position is not set to 1 but rather is
set equal to the Blom plotting position: . This method
is useful, for example, when creating Quantile-Quantile plots.
Hazard Plotting Method of Nelson (prob.method="nelson"
)
(Right-Censored Data Only.) For right-censored data, Equation (5) can be
re-written as:
Nelson (1972) proposed the following formula for plotting positions for the uncensored observations in the context of estimating the hazard function (see Michael and Schucany,1986, p.469):
See Lee and Wang (2003) for more information about the hazard function.
As for the Kaplan and Meier (1958) method, the plotting position for a censored
observation is set equal to the plotting position associated with the largest
uncensored observation less than or equal to that censoring level. If there are
no uncensored observations less than or equal to a particular censoring level,
the plotting position is set to 0 for that censoring level.
Generalization of Product-Limit Method, Michael and Schucany
(prob.method="michael-schucany"
)
For complete data sets, the disadvantage of using Equation (3) above to define
plotting positions is that it implies the largest observed value is the maximum
possible value of the distribution (the 'th percentile). This may be
satisfactory if the underlying distribution is known to be discrete, but it is
usually not satisfactory if the underlying distribution is known to be continuous.
A more frequently used formula for plotting positions for complete data sets is given by:
where (Cleveland, 1993, p. 18; D'Agostino, 1986a, pp. 8,25).
The value of
is usually chosen so that the plotting positions are
approximately unbiased (i.e., approximate the mean of their distribution) or else
approximate the median value of their distribution (see the help file for
ecdfPlot
). Michael and Schucany (1986) extended this method for
both left- and right-censored data sets.
Right-Censored Data (censoring.side="right"
)
For right-censored data sets, the plotting positions for the uncensored
observations are computed as:
Note that the plotting positions proposed by Herd (1960) and Johnson (1964) are a
special case of Equation (11) with . Equation (11) reduces to Equation (10)
in the case of complete data sets. Note that unlike the Kaplan-Meier method,
plotting positions associated with tied uncensored observations are not the same
(just as in the case for complete data using Equation (10)).
As for the Kaplan and Meier (1958) method, for right-censored data the plotting
position for a censored observation is set equal to the plotting position associated
with the largest uncensored observation less than or equal to that censoring level.
If there are no uncensored observations less than or equal to a particular censoring
level, the plotting position is set to 0 for that censoring level.
Left-Censored Data (censoring.side="left"
)
For left-censored data sets the plotting positions are computed as:
Equation (12) reduces to Equation (10) in the case of complete data sets. Note that unlike the Kaplan-Meier method, plotting positions associated with tied uncensored observations are not the same (just as in the case for complete data using Equation (10)).
For left-censored data, the plotting position for a censored observation is set
equal to the plotting position associated with the smallest uncensored observation
greater than or equal to that censoring level. If there are no uncensored
observations greater than or equal to a particular censoring level, the plotting
position is set to 1 for that censoring level.
Generalization of Product-Limit Method, Hirsch and Stedinger
(prob.method="hirsch-stedinger"
)
Hirsch and Stedinger (1987) use a slightly different approach than Kaplan and Meier
(1958) and Michael and Schucany (1986) to derive a nonparametric estimate of the
survival function (probability of exceedance) in the context of left-censored data.
First they estimate the value of the survival function at each of the censoring
levels. The value of the survival function for an uncensored observation between
two adjacent censoring levels is then computed by linear interpolation (in the form
of a plotting position). See also Helsel and Cohn (1988).
The discussion below presents an extension of the method of Hirsch and Stedinger (1987) to the case of right-censored data, and then presents the original derivation due to Hirsch and Stedinger (1987) for left-censored data.
Right-Censored Data (censoring.side="right"
)
For right-censored data, the survival function is estimated as follows.
For the 'th censoring level (
), write the
value of the survival function as:
|
|
|
|
|
|
|
|
|
|
|
where
Now set
|
|
# uncensored observations in |
|
|
# observations in
|
for . Then the method of moments estimator of the
conditional probability in Equation (13) is given by:
Hence, by equations (13) and (18) we have
which can be rewritten as:
Equation (20) can be solved interatively for . Note that
Once the values of the survival function at the censoring levels are computed, the
plotting positions for the uncensored observations in the interval
(
) are computed as
where denotes the plotting position constant,
, and
denotes the rank of the
'th observation among the
uncensored observations in the interval
.
(Tied observations are given distinct ranks.)
For the observations censored at censoring level
(
), the plotting positions are computed as:
where denotes the rank of the
'th observation among the
observations censored at censoring level
. Note that all the
observations censored at the same censoring level are given distinct ranks,
even though there is no way to distinguish between them.
Left-Censored Data (censoring.side="left"
)
For left-censored data, Hirsch and Stedinger (1987) modify the definition of the
survival function as follows:
For continuous distributions, the functions in Equations (4) and (25) are identical.
Hirsch and Stedinger (1987) show that for the 'th censoring level
(
), the value of the survival function can be written as:
|
|
|
|
|
|
|
|
|
|
|
where and
are defined in Equations (14) and (15).
Now set
|
|
# uncensored observations in |
|
|
# observations in
|
for . Then the method of moments estimator of the
conditional probability in Equation (26) is given by:
Hence, by Equations (26) and (29) we have
which can be solved interatively for . Note that
Once the values of the survival function at the censoring levels are computed, the
plotting positions for the uncensored observations in the interval
(
) are computed as
where denotes the plotting position constant,
, and
denotes the rank of the
'th observation among the
uncensored observations in the interval
.
(Tied observations are given distinct ranks.)
For the observations censored at censoring level
(
), the plotting positions are computed as:
where denotes the rank of the
'th observation among the
observations censored at censoring level
. Note that all the
observations censored at the same censoring level are given distinct ranks,
even though there is no way to distinguish between them.
ppointsCensored
returns a list with the following components:
Order.Statistics |
numeric vector of the “ordered” observations. |
Cumulative.Probabilities |
numeric vector of the associated plotting positions. |
Censored |
logical vector indicating which of the ordered observations are censored. |
Censoring.Side |
character string indicating whether the data are left- or right-censored.
This is same value as the argument |
Prob.Method |
character string indicating what method was used to compute the plotting positions.
This is the same value as the argument |
Optional Component (only present when prob.method="michael-schucany"
or prob.method="hirsch-stedinger"
):
Plot.Pos.Con |
numeric scalar containing the value of the plotting position constant that was used.
This is the same as the argument |
For censored data sets, plotting positions may be used to construct empirical cumulative distribution
plots (see ecdfPlotCensored
), construct quantile-quantile plots
(see qqPlotCensored
), or to estimate distribution parameters
(see FcnsByCatCensoredData
).
The function survfit
in the built-in R library
survival computes the survival function for right-censored, left-censored, or
interval-censored data. Calling survfit
with
type="kaplan-meier"
will produce similar results to calling
ppointsCensored
with prob.method="kaplan-meier"
. Also, calling
survfit
with type="fh2"
will produce similar results
to calling ppointsCensored
with prob.method="nelson"
.
Helsel and Cohn (1988, p.2001) found very little effect of changing the value of the plotting position constant when using the method of Hirsch and Stedinger (1987) to compute plotting positions for multiply left-censored data. In general, there will be very little difference between plotting positions computed by the different methods except in the case of very small samples and a large amount of censoring.
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.
Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997-2004.
Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715-727.
Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.
Lee, E.T., and J. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley and Sons, New York.
Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461-496.
Nelson, W. (1972). Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14, 945-966.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
ppoints
, ecdfPlot
, qqPlot
,
ecdfPlotCensored
, qqPlotCensored
,
survfit
.
# Generate 20 observations from a normal distribution with mean=20 and sd=5, # censor all observations less than 18, then compute plotting positions for # this data set. Compare the plotting positions to the plotting positions # for the uncensored data set. Note that the plotting positions for the # censored data set start at the first ordered uncensored observation and # that for values of x > 18 the plotting positions for the two data sets are # exactly the same. This is because there is only one censoring level and # no uncensored observations fall below the censored observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(333) x <- rnorm(20, mean=20, sd=5) censored <- x < 18 censored # [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE #[13] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE sum(censored) #[1] 7 new.x <- x new.x[censored] <- 18 round(sort(new.x),1) # [1] 18.0 18.0 18.0 18.0 18.0 18.0 18.0 18.1 18.7 19.6 20.2 20.3 20.6 21.4 #[15] 21.8 21.8 23.2 26.2 26.8 29.7 p.list <- ppointsCensored(new.x, censored) p.list #$Order.Statistics # [1] 18.00000 18.00000 18.00000 18.00000 18.00000 18.00000 18.00000 18.09771 # [9] 18.65418 19.58594 20.21931 20.26851 20.55296 21.38869 21.76359 21.82364 #[17] 23.16804 26.16527 26.84336 29.67340 # #$Cumulative.Probabilities # [1] 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 # [8] 0.3765432 0.4259259 0.4753086 0.5246914 0.5740741 0.6234568 0.6728395 #[15] 0.7222222 0.7716049 0.8209877 0.8703704 0.9197531 0.9691358 # #$Censored # [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE #[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE # #$Censoring.Side #[1] "left" # #$Prob.Method #[1] "michael-schucany" # #$Plot.Pos.Con #[1] 0.375 #---------- # Round off plotting positions to two decimal places # and compare to plotting positions that ignore censoring #-------------------------------------------------------- round(p.list$Cum, 2) # [1] 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.43 0.48 0.52 0.57 0.62 0.67 #[15] 0.72 0.77 0.82 0.87 0.92 0.97 round(ppoints(x, a=0.375), 2) # [1] 0.03 0.08 0.13 0.18 0.23 0.28 0.33 0.38 0.43 0.48 0.52 0.57 0.62 0.67 #[15] 0.72 0.77 0.82 0.87 0.92 0.97 #---------- # Clean up #--------- rm(x, censored, new.x, p.list) #---------------------------------------------------------------------------- # Reproduce the example in Appendix B of Helsel and Cohn (1988). The data # are stored in Helsel.Cohn.88.appb.df. This data frame contains 18 # observations, of which 9 are censored below one of 2 distinct censoring # levels. Helsel.Cohn.88.app.b.df # Conc.orig Conc Censored #1 <1 1 TRUE #2 <1 1 TRUE #... #17 33 33 FALSE #18 50 50 FALSE p.list <- with(Helsel.Cohn.88.app.b.df, ppointsCensored(Conc, Censored, prob.method="hirsch-stedinger", plot.pos.con=0)) lapply(p.list[1:2], round, 3) #$Order.Statistics # [1] 1 1 1 1 1 1 3 7 9 10 10 10 12 15 20 27 33 50 # #$Cumulative.Probabilities # [1] 0.063 0.127 0.190 0.254 0.317 0.381 0.500 0.556 0.611 0.167 0.333 0.500 #[13] 0.714 0.762 0.810 0.857 0.905 0.952 # Clean up #--------- rm(p.list) #---------------------------------------------------------------------------- # Example 15-1 of USEPA (2009, page 15-10) gives an example of # computing plotting positions based on censored manganese # concentrations (ppb) in groundwater collected at 5 monitoring # wells. The data for this example are stored in # EPA.09.Ex.15.1.manganese.df. EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #4 4 Well.1 21.6 21.6 FALSE #5 5 Well.1 <2 2.0 TRUE #... #21 1 Well.5 17.9 17.9 FALSE #22 2 Well.5 22.7 22.7 FALSE #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE p.list.EPA <- with(EPA.09.Ex.15.1.manganese.df, ppointsCensored(Manganese.ppb, Censored, prob.method = "kaplan-meier")) data.frame(Mn = p.list.EPA$Order.Statistics, Censored = p.list.EPA$Censored, CDF = p.list.EPA$Cumulative.Probabilities) # Mn Censored CDF #1 2.0 TRUE 0.21 #2 2.0 TRUE 0.21 #3 2.0 TRUE 0.21 #4 3.3 FALSE 0.28 #5 5.0 TRUE 0.28 #6 5.0 TRUE 0.28 #7 5.0 TRUE 0.28 #8 5.3 FALSE 0.32 #9 6.3 FALSE 0.36 #10 7.7 FALSE 0.40 #11 8.4 FALSE 0.44 #12 9.5 FALSE 0.48 #13 10.0 FALSE 0.52 #14 11.9 FALSE 0.56 #15 12.1 FALSE 0.60 #16 12.6 FALSE 0.64 #17 16.9 FALSE 0.68 #18 17.9 FALSE 0.72 #19 21.6 FALSE 0.76 #20 22.7 FALSE 0.80 #21 34.5 FALSE 0.84 #22 45.9 FALSE 0.88 #23 53.6 FALSE 0.92 #24 77.2 FALSE 0.96 #25 106.3 FALSE 1.00 #---------- # Clean up #--------- rm(p.list.EPA)
# Generate 20 observations from a normal distribution with mean=20 and sd=5, # censor all observations less than 18, then compute plotting positions for # this data set. Compare the plotting positions to the plotting positions # for the uncensored data set. Note that the plotting positions for the # censored data set start at the first ordered uncensored observation and # that for values of x > 18 the plotting positions for the two data sets are # exactly the same. This is because there is only one censoring level and # no uncensored observations fall below the censored observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(333) x <- rnorm(20, mean=20, sd=5) censored <- x < 18 censored # [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE #[13] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE sum(censored) #[1] 7 new.x <- x new.x[censored] <- 18 round(sort(new.x),1) # [1] 18.0 18.0 18.0 18.0 18.0 18.0 18.0 18.1 18.7 19.6 20.2 20.3 20.6 21.4 #[15] 21.8 21.8 23.2 26.2 26.8 29.7 p.list <- ppointsCensored(new.x, censored) p.list #$Order.Statistics # [1] 18.00000 18.00000 18.00000 18.00000 18.00000 18.00000 18.00000 18.09771 # [9] 18.65418 19.58594 20.21931 20.26851 20.55296 21.38869 21.76359 21.82364 #[17] 23.16804 26.16527 26.84336 29.67340 # #$Cumulative.Probabilities # [1] 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 0.3765432 # [8] 0.3765432 0.4259259 0.4753086 0.5246914 0.5740741 0.6234568 0.6728395 #[15] 0.7222222 0.7716049 0.8209877 0.8703704 0.9197531 0.9691358 # #$Censored # [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE #[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE # #$Censoring.Side #[1] "left" # #$Prob.Method #[1] "michael-schucany" # #$Plot.Pos.Con #[1] 0.375 #---------- # Round off plotting positions to two decimal places # and compare to plotting positions that ignore censoring #-------------------------------------------------------- round(p.list$Cum, 2) # [1] 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.38 0.43 0.48 0.52 0.57 0.62 0.67 #[15] 0.72 0.77 0.82 0.87 0.92 0.97 round(ppoints(x, a=0.375), 2) # [1] 0.03 0.08 0.13 0.18 0.23 0.28 0.33 0.38 0.43 0.48 0.52 0.57 0.62 0.67 #[15] 0.72 0.77 0.82 0.87 0.92 0.97 #---------- # Clean up #--------- rm(x, censored, new.x, p.list) #---------------------------------------------------------------------------- # Reproduce the example in Appendix B of Helsel and Cohn (1988). The data # are stored in Helsel.Cohn.88.appb.df. This data frame contains 18 # observations, of which 9 are censored below one of 2 distinct censoring # levels. Helsel.Cohn.88.app.b.df # Conc.orig Conc Censored #1 <1 1 TRUE #2 <1 1 TRUE #... #17 33 33 FALSE #18 50 50 FALSE p.list <- with(Helsel.Cohn.88.app.b.df, ppointsCensored(Conc, Censored, prob.method="hirsch-stedinger", plot.pos.con=0)) lapply(p.list[1:2], round, 3) #$Order.Statistics # [1] 1 1 1 1 1 1 3 7 9 10 10 10 12 15 20 27 33 50 # #$Cumulative.Probabilities # [1] 0.063 0.127 0.190 0.254 0.317 0.381 0.500 0.556 0.611 0.167 0.333 0.500 #[13] 0.714 0.762 0.810 0.857 0.905 0.952 # Clean up #--------- rm(p.list) #---------------------------------------------------------------------------- # Example 15-1 of USEPA (2009, page 15-10) gives an example of # computing plotting positions based on censored manganese # concentrations (ppb) in groundwater collected at 5 monitoring # wells. The data for this example are stored in # EPA.09.Ex.15.1.manganese.df. EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #4 4 Well.1 21.6 21.6 FALSE #5 5 Well.1 <2 2.0 TRUE #... #21 1 Well.5 17.9 17.9 FALSE #22 2 Well.5 22.7 22.7 FALSE #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE p.list.EPA <- with(EPA.09.Ex.15.1.manganese.df, ppointsCensored(Manganese.ppb, Censored, prob.method = "kaplan-meier")) data.frame(Mn = p.list.EPA$Order.Statistics, Censored = p.list.EPA$Censored, CDF = p.list.EPA$Cumulative.Probabilities) # Mn Censored CDF #1 2.0 TRUE 0.21 #2 2.0 TRUE 0.21 #3 2.0 TRUE 0.21 #4 3.3 FALSE 0.28 #5 5.0 TRUE 0.28 #6 5.0 TRUE 0.28 #7 5.0 TRUE 0.28 #8 5.3 FALSE 0.32 #9 6.3 FALSE 0.36 #10 7.7 FALSE 0.40 #11 8.4 FALSE 0.44 #12 9.5 FALSE 0.48 #13 10.0 FALSE 0.52 #14 11.9 FALSE 0.56 #15 12.1 FALSE 0.60 #16 12.6 FALSE 0.64 #17 16.9 FALSE 0.68 #18 17.9 FALSE 0.72 #19 21.6 FALSE 0.76 #20 22.7 FALSE 0.80 #21 34.5 FALSE 0.84 #22 45.9 FALSE 0.88 #23 53.6 FALSE 0.92 #24 77.2 FALSE 0.96 #25 106.3 FALSE 1.00 #---------- # Clean up #--------- rm(p.list.EPA)
The EnvStats function predict
is a generic function for
predictions from the results of various model fitting functions.
The function invokes particular methods
which
depend on the class
of the first argument.
The EnvStats function predict.default
simply calls the R generic
function predict
.
The EnvStats functions predict
and predict.default
have been
created in order to comply with CRAN policies, because EnvStats contains a
modified version of the R function predict.lm
.
predict(object, ...) ## Default S3 method: predict(object, ...)
predict(object, ...) ## Default S3 method: predict(object, ...)
object |
a model object for which prediction is desired. |
... |
Further arguments passed to or from other methods. See the R help file for
|
See the R help file for predict
.
See the R help file for predict
.
R Development Core Team for code for R version of predict
.
Steven P. Millard for EnvStats version of predict.default
; [email protected])
Chambers, J.M., and Hastie, T.J., eds. (1992). Statistical Models in S. Chapman and Hall/CRC, Boca Raton, FL.
R help file for predict
,
predict.lm
.
# Using the data from the built-in data frame Air.df, # fit the cube-root of ozone as a function of temperature, # then compute predicted values for ozone at 70 and 90 degrees F, # along with the standard errors of these predicted values. # First look at the data #----------------------- with(Air.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Cube-Root Ozone (ppb)")) # Now create the lm object #------------------------- ozone.fit <- lm(ozone ~ temperature, data = Air.df) # Now get predicted values and CIs at 70 and 90 degrees. # Note the presence of the last component called n.coefs. #-------------------------------------------------------- predict.list <- predict(ozone.fit, newdata = data.frame(temperature = c(70, 90)), se.fit = TRUE) predict.list #$fit # 1 2 #2.697810 4.101808 # #$se.fit # 1 2 #0.07134554 0.08921071 # #$df #[1] 114 # #$residual.scale #[1] 0.5903046 # #$n.coefs #[1] 2 #---------- #Continuing with the above example, create a scatterplot of # cube-root ozone vs. temperature, and add the fitted line # along with simultaneous 95% confidence bands. with(Air.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Cube-Root Ozone (ppb)")) abline(ozone.fit, lwd = 3, col = "blue") new.temp <- with(Air.df, seq(min(temperature), max(temperature), length = 100)) predict.list <- predict(ozone.fit, newdata = data.frame(temperature = new.temp), se.fit = TRUE) ci.ozone <- pointwise(predict.list, coverage = 0.95, simultaneous = TRUE) lines(new.temp, ci.ozone$lower, lty = 2, lwd = 3, col = "magenta") lines(new.temp, ci.ozone$upper, lty = 2, lwd = 3, col = "magenta") title(main=paste("Scatterplot of Cube-Root Ozone vs. Temperature", "with Fitted Line and Simultaneous 95% Confidence Bands", sep="\n")) #---------- # Clean up #--------- rm(ozone.fit, predict.list, new.temp, ci.ozone) graphics.off()
# Using the data from the built-in data frame Air.df, # fit the cube-root of ozone as a function of temperature, # then compute predicted values for ozone at 70 and 90 degrees F, # along with the standard errors of these predicted values. # First look at the data #----------------------- with(Air.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Cube-Root Ozone (ppb)")) # Now create the lm object #------------------------- ozone.fit <- lm(ozone ~ temperature, data = Air.df) # Now get predicted values and CIs at 70 and 90 degrees. # Note the presence of the last component called n.coefs. #-------------------------------------------------------- predict.list <- predict(ozone.fit, newdata = data.frame(temperature = c(70, 90)), se.fit = TRUE) predict.list #$fit # 1 2 #2.697810 4.101808 # #$se.fit # 1 2 #0.07134554 0.08921071 # #$df #[1] 114 # #$residual.scale #[1] 0.5903046 # #$n.coefs #[1] 2 #---------- #Continuing with the above example, create a scatterplot of # cube-root ozone vs. temperature, and add the fitted line # along with simultaneous 95% confidence bands. with(Air.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Cube-Root Ozone (ppb)")) abline(ozone.fit, lwd = 3, col = "blue") new.temp <- with(Air.df, seq(min(temperature), max(temperature), length = 100)) predict.list <- predict(ozone.fit, newdata = data.frame(temperature = new.temp), se.fit = TRUE) ci.ozone <- pointwise(predict.list, coverage = 0.95, simultaneous = TRUE) lines(new.temp, ci.ozone$lower, lty = 2, lwd = 3, col = "magenta") lines(new.temp, ci.ozone$upper, lty = 2, lwd = 3, col = "magenta") title(main=paste("Scatterplot of Cube-Root Ozone vs. Temperature", "with Fitted Line and Simultaneous 95% Confidence Bands", sep="\n")) #---------- # Clean up #--------- rm(ozone.fit, predict.list, new.temp, ci.ozone) graphics.off()
The function predict.lm
in EnvStats is a modified version
of the built-in R function predict.lm
.
The only modification is that for the EnvStats function predict.lm
,
if se.fit=TRUE
, the list returned includes a component called
n.coefs
. The component n.coefs
is used by the function
pointwise
to create simultaneous confidence or prediction limits.
## S3 method for class 'lm' predict(object, ...)
## S3 method for class 'lm' predict(object, ...)
object |
Object of class inheriting from |
... |
Further arguments passed to the R function |
See the R help file for predict.lm
.
The function predict.lm
in EnvStats is a modified version
of the built-in R function predict.lm
.
The only modification is that for the EnvStats function predict.lm
,
if se.fit=TRUE
, the list returned includes a component called
n.coefs
. The component n.coefs
is used by the function
pointwise
to create simultaneous confidence or prediction limits.
See the R help file for predict.lm
.
The only modification is that for the EnvStats function predict.lm
,
if se.fit=TRUE
, the list returned includes a component called
n.coefs
, i.e., the function returns a list with the following components:
fit |
vector or matrix as above |
se.fit |
standard error of predicted means |
residual.scale |
residual standard deviations |
df |
degrees of freedom for residual |
n.coefs |
numeric scalar denoting the number of predictor variables used in the model |
R Development Core Team (for code for R version of predict.lm
).
Steven P. Millard (for modification to add compenent n.coefs
; [email protected])
Chambers, J.M., and Hastie, T.J., eds. (1992). Statistical Models in S. Chapman and Hall/CRC, Boca Raton, FL.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, Chapter 3.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.546-553.
Miller, R.G. (1981a). Simultaneous Statistical Inference. Springer-Verlag, New York, pp.111, 124.
Help file for R function predict
,
Help file for R function predict.lm
,
lm
, calibrate
,
calibrate
, inversePredictCalibrate
,
detectionLimitCalibrate
.
# Using the data from the built-in data frame Air.df, # fit the cube-root of ozone as a function of temperature, # then compute predicted values for ozone at 70 and 90 degrees F, # along with the standard errors of these predicted values. # First look at the data #----------------------- with(Air.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Cube-Root Ozone (ppb)")) # Now create the lm object #------------------------- ozone.fit <- lm(ozone ~ temperature, data = Air.df) # Now get predicted values and CIs at 70 and 90 degrees. # Note the presence of the last component called n.coefs. #-------------------------------------------------------- predict.list <- predict(ozone.fit, newdata = data.frame(temperature = c(70, 90)), se.fit = TRUE) predict.list #$fit # 1 2 #2.697810 4.101808 # #$se.fit # 1 2 #0.07134554 0.08921071 # #$df #[1] 114 # #$residual.scale #[1] 0.5903046 # #$n.coefs #[1] 2 #---------- #Continuing with the above example, create a scatterplot of # cube-root ozone vs. temperature, and add the fitted line # along with simultaneous 95% confidence bands. with(Air.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Cube-Root Ozone (ppb)")) abline(ozone.fit, lwd = 3, col = "blue") new.temp <- with(Air.df, seq(min(temperature), max(temperature), length = 100)) predict.list <- predict(ozone.fit, newdata = data.frame(temperature = new.temp), se.fit = TRUE) ci.ozone <- pointwise(predict.list, coverage = 0.95, simultaneous = TRUE) lines(new.temp, ci.ozone$lower, lty = 2, lwd = 3, col = "magenta") lines(new.temp, ci.ozone$upper, lty = 2, lwd = 3, col = "magenta") title(main=paste("Scatterplot of Cube-Root Ozone vs. Temperature", "with Fitted Line and Simultaneous 95% Confidence Bands", sep="\n")) #---------- # Clean up #--------- rm(ozone.fit, predict.list, new.temp, ci.ozone) graphics.off()
# Using the data from the built-in data frame Air.df, # fit the cube-root of ozone as a function of temperature, # then compute predicted values for ozone at 70 and 90 degrees F, # along with the standard errors of these predicted values. # First look at the data #----------------------- with(Air.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Cube-Root Ozone (ppb)")) # Now create the lm object #------------------------- ozone.fit <- lm(ozone ~ temperature, data = Air.df) # Now get predicted values and CIs at 70 and 90 degrees. # Note the presence of the last component called n.coefs. #-------------------------------------------------------- predict.list <- predict(ozone.fit, newdata = data.frame(temperature = c(70, 90)), se.fit = TRUE) predict.list #$fit # 1 2 #2.697810 4.101808 # #$se.fit # 1 2 #0.07134554 0.08921071 # #$df #[1] 114 # #$residual.scale #[1] 0.5903046 # #$n.coefs #[1] 2 #---------- #Continuing with the above example, create a scatterplot of # cube-root ozone vs. temperature, and add the fitted line # along with simultaneous 95% confidence bands. with(Air.df, plot(temperature, ozone, xlab = "Temperature (degrees F)", ylab = "Cube-Root Ozone (ppb)")) abline(ozone.fit, lwd = 3, col = "blue") new.temp <- with(Air.df, seq(min(temperature), max(temperature), length = 100)) predict.list <- predict(ozone.fit, newdata = data.frame(temperature = new.temp), se.fit = TRUE) ci.ozone <- pointwise(predict.list, coverage = 0.95, simultaneous = TRUE) lines(new.temp, ci.ozone$lower, lty = 2, lwd = 3, col = "magenta") lines(new.temp, ci.ozone$upper, lty = 2, lwd = 3, col = "magenta") title(main=paste("Scatterplot of Cube-Root Ozone vs. Temperature", "with Fitted Line and Simultaneous 95% Confidence Bands", sep="\n")) #---------- # Clean up #--------- rm(ozone.fit, predict.list, new.temp, ci.ozone) graphics.off()
Construct a prediction interval for the next observations or
next set of
transformed means for a gamma distribution.
predIntGamma(x, n.transmean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95, est.method = "mle", normal.approx.transform = "kulkarni.powar") predIntGammaAlt(x, n.transmean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95, est.method = "mle", normal.approx.transform = "kulkarni.powar")
predIntGamma(x, n.transmean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95, est.method = "mle", normal.approx.transform = "kulkarni.powar") predIntGammaAlt(x, n.transmean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95, est.method = "mle", normal.approx.transform = "kulkarni.powar")
x |
numeric vector of non-negative observations. Missing ( |
n.transmean |
positive integer specifying the sample size associated with the |
k |
positive integer specifying the number of future observations or means
the prediction interval should contain with confidence level |
method |
character string specifying the method to use if the number of future observations
or averages ( |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the prediction
interval. The default value is |
est.method |
character string specifying the method of estimation for the shape and scale
distribution parameters. The possible values are
|
normal.approx.transform |
character string indicating which power transformation to use.
Possible values are |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
The function predIntGamma
returns a prediction interval as well as
estimates of the shape and scale parameters.
The function predIntGammaAlt
returns a prediction interval as well as
estimates of the mean and coefficient of variation.
Following Krishnamoorthy et al. (2008), the prediction interval is computed by:
using a power transformation on the original data to induce approximate normality,
calling predIntNorm
with the transformed data to
compute the prediction interval, and then
back-transforming the interval to create a prediction interval on the original scale.
The argument normal.approx.transform
determines which transformation is used.
The value normal.approx.transform="cube.root"
uses
the cube root transformation suggested by Wilson and Hilferty (1931) and used by
Krishnamoorthy et al. (2008) and Singh et al. (2010b), and the value
normal.approx.transform="fourth.root"
uses the fourth root transformation suggested
by Hawkins and Wixley (1986) and used by Singh et al. (2010b).
The default value normal.approx.transform="kulkarni.powar"
uses the "Optimum Power Normal Approximation Method" of Kulkarni and Powar (2010).
The "optimum" power is determined by:
|
if |
|
if |
where denotes the estimate of the shape parameter. Although
Kulkarni and Powar (2010) use the maximum likelihood estimate of shape to
determine the power
, for the functions
predIntGamma
and predIntGammaAlt
the power is based on whatever estimate of
shape is used (e.g.,
est.method="mle"
, est.method="bcmle"
, etc.).
When the argument n.transmean
is larger than 1 (i.e., you are
constructing a prediction interval for future means, not just single
observations), in order to properly compare a future mean with the
prediction limits, you must follow these steps:
Take the observations that will be used to compute the mean and
transform them by raising them to the power given by the value in the component
interval$normal.transform.power
(see the section VALUE below).
Compute the mean of the transformed observations.
Take the mean computed in step 2 above and raise it to the inverse of the power originally used to transform the observations.
A list of class "estimate"
containing the estimated parameters,
the prediction interval, and other information. See estimate.object
for details.
In addition to the usual components contained in an object of class
"estimate"
, the returned value also includes two additional
components within the "interval"
component:
n.transmean |
the value of |
normal.transform.power |
the value of the power used to transform the original data to approximate normality. |
It is possible for the lower prediction limit based on the transformed data to be less than 0. In this case, the lower prediction limit on the original scale is set to 0 and a warning is issued stating that the normal approximation is not accurate in this case.
The gamma distribution takes values on the positive real line. Special cases of the gamma are the exponential distribution and the chi-square distributions. Applications of the gamma include life testing, statistical ecology, queuing theory, inventory control, and precipitation processes. A gamma distribution starts to resemble a normal distribution as the shape parameter a tends to infinity.
Some EPA guidance documents (e.g., Singh et al., 2002; Singh et al., 2010a,b) strongly recommend against using a lognormal model for environmental data and recommend trying a gamma distribuiton instead.
Prediction intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973), and are often discussed in the context of linear regression (Draper and Smith, 1998; Zar, 2010). Prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities. References that discuss prediction intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.
Evans, M., N. Hastings, and B. Peacock. (1993). Statistical Distributions. Second Edition. John Wiley and Sons, New York, Chapter 18.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hawkins, D. M., and R.A.J. Wixley. (1986). A Note on the Transformation of Chi-Squared Variables to Normality. The American Statistician, 40, 296–298.
Johnson, N.L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York, Chapter 17.
Krishnamoorthy K., T. Mathew, and S. Mukherjee. (2008). Normal-Based Methods for a Gamma Distribution: Prediction and Tolerance Intervals and Stress-Strength Reliability. Technometrics, 50(1), 69–78.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Kulkarni, H.V., and S.K. Powar. (2010). A New Method for Interval Estimation of the Mean of the Gamma Distribution. Lifetime Data Analysis, 16, 431–447.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Singh, A., A.K. Singh, and R.J. Iaci. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Wilson, E.B., and M.M. Hilferty. (1931). The Distribution of Chi-Squares. Proceedings of the National Academy of Sciences, 17, 684–688.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, New Jersey.
GammaDist
, GammaAlt
, estimate.object
,
egamma
, predIntNorm
,
tolIntGamma
.
# Generate 20 observations from a gamma distribution with parameters # shape=3 and scale=2, then create a prediciton interval for the # next observation. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rgamma(20, shape = 3, scale = 2) predIntGamma(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.203862 # scale = 2.174928 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Normal Transform Power: 0.246 # #Prediction Interval Type: two-sided # #Confidence Level: 95% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.5371569 # UPL = 15.5313783 #-------------------------------------------------------------------- # Using the same data as in the previous example, create an upper # one-sided prediction interval for the next three averages of # order 2 (i.e., each mean is based on 2 future observations), and # use the bias-corrected estimate of shape. pred.list <- predIntGamma(dat, n.transmean = 2, k = 3, pi.type = "upper", est.method = "bcmle") pred.list #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 1.906616 # scale = 2.514005 # #Estimation Method: bcmle # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Bonferroni using # Kulkarni & Powar (2010) # transformation to Normality # based on bcmle of 'shape' # #Normal Transform Power: 0.246 # #Prediction Interval Type: upper # #Confidence Level: 95% # #Number of Future #Transformed Means: 3 # #Sample Size for #Transformed Means: 2 # #Prediction Interval: LPL = 0.00000 # UPL = 12.17404 #-------------------------------------------------------------------- # Continuing with the above example, assume the distribution shifts # in the future to a gamma distribution with shape = 5 and scale = 2. # Create 6 future observations from this distribution, and create 3 # means by pairing the observations sequentially. Note we must first # transform these observations using the power 0.246, then compute the # means based on the transformed data, and then transform the means # back to the original scale and compare them to the upper prediction # limit of 12.17 set.seed(427) new.dat <- rgamma(6, shape = 5, scale = 2) p <- pred.list$interval$normal.transform.power p #[1] 0.246 new.dat.trans <- new.dat^p means.trans <- c(mean(new.dat.trans[1:2]), mean(new.dat.trans[3:4]), mean(new.dat.trans[5:6])) means <- means.trans^(1/p) means #[1] 11.74214 17.05299 11.65272 any(means > pred.list$interval$limits["UPL"]) #[1] TRUE #---------- # Clean up rm(dat, pred.list, new.dat, p, new.dat.trans, means.trans, means) #-------------------------------------------------------------------- # Reproduce part of the example on page 73 of # Krishnamoorthy et al. (2008), which uses alkalinity concentrations # reported in Gibbons (1994) and Gibbons et al. (2009) to construct a # one-sided upper 90% prediction limit. predIntGamma(Gibbons.et.al.09.Alkilinity.vec, pi.type = "upper", conf.level = 0.9, normal.approx.transform = "cube.root") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 9.375013 # scale = 6.202461 # #Estimation Method: mle # #Data: Gibbons.et.al.09.Alkilinity.vec # #Sample Size: 27 # #Prediction Interval Method: exact using # Wilson & Hilferty (1931) cube-root # transformation to Normality # #Normal Transform Power: 0.3333333 # #Prediction Interval Type: upper # #Confidence Level: 90% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.0000 # UPL = 85.3495
# Generate 20 observations from a gamma distribution with parameters # shape=3 and scale=2, then create a prediciton interval for the # next observation. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rgamma(20, shape = 3, scale = 2) predIntGamma(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.203862 # scale = 2.174928 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Normal Transform Power: 0.246 # #Prediction Interval Type: two-sided # #Confidence Level: 95% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.5371569 # UPL = 15.5313783 #-------------------------------------------------------------------- # Using the same data as in the previous example, create an upper # one-sided prediction interval for the next three averages of # order 2 (i.e., each mean is based on 2 future observations), and # use the bias-corrected estimate of shape. pred.list <- predIntGamma(dat, n.transmean = 2, k = 3, pi.type = "upper", est.method = "bcmle") pred.list #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 1.906616 # scale = 2.514005 # #Estimation Method: bcmle # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Bonferroni using # Kulkarni & Powar (2010) # transformation to Normality # based on bcmle of 'shape' # #Normal Transform Power: 0.246 # #Prediction Interval Type: upper # #Confidence Level: 95% # #Number of Future #Transformed Means: 3 # #Sample Size for #Transformed Means: 2 # #Prediction Interval: LPL = 0.00000 # UPL = 12.17404 #-------------------------------------------------------------------- # Continuing with the above example, assume the distribution shifts # in the future to a gamma distribution with shape = 5 and scale = 2. # Create 6 future observations from this distribution, and create 3 # means by pairing the observations sequentially. Note we must first # transform these observations using the power 0.246, then compute the # means based on the transformed data, and then transform the means # back to the original scale and compare them to the upper prediction # limit of 12.17 set.seed(427) new.dat <- rgamma(6, shape = 5, scale = 2) p <- pred.list$interval$normal.transform.power p #[1] 0.246 new.dat.trans <- new.dat^p means.trans <- c(mean(new.dat.trans[1:2]), mean(new.dat.trans[3:4]), mean(new.dat.trans[5:6])) means <- means.trans^(1/p) means #[1] 11.74214 17.05299 11.65272 any(means > pred.list$interval$limits["UPL"]) #[1] TRUE #---------- # Clean up rm(dat, pred.list, new.dat, p, new.dat.trans, means.trans, means) #-------------------------------------------------------------------- # Reproduce part of the example on page 73 of # Krishnamoorthy et al. (2008), which uses alkalinity concentrations # reported in Gibbons (1994) and Gibbons et al. (2009) to construct a # one-sided upper 90% prediction limit. predIntGamma(Gibbons.et.al.09.Alkilinity.vec, pi.type = "upper", conf.level = 0.9, normal.approx.transform = "cube.root") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 9.375013 # scale = 6.202461 # #Estimation Method: mle # #Data: Gibbons.et.al.09.Alkilinity.vec # #Sample Size: 27 # #Prediction Interval Method: exact using # Wilson & Hilferty (1931) cube-root # transformation to Normality # #Normal Transform Power: 0.3333333 # #Prediction Interval Type: upper # #Confidence Level: 90% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.0000 # UPL = 85.3495
Estimate the shape and scale parameters for a
gamma distribution,
or estimate the mean and coefficient of variation for a
gamma distribution (alternative parameterization),
and construct a simultaneous prediction interval for the next sampling
occasions, based on one of three possible rules: k-of-m, California, or
Modified California.
predIntGammaSimultaneous(x, n.transmean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = 1e-07, est.method = "mle", normal.approx.transform = "kulkarni.powar") predIntGammaAltSimultaneous(x, n.transmean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = 1e-07, est.method = "mle", normal.approx.transform = "kulkarni.powar")
predIntGammaSimultaneous(x, n.transmean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = 1e-07, est.method = "mle", normal.approx.transform = "kulkarni.powar") predIntGammaAltSimultaneous(x, n.transmean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = 1e-07, est.method = "mle", normal.approx.transform = "kulkarni.powar")
x |
numeric vector of non-negative observations. Missing ( |
n.transmean |
positive integer specifying the sample size associated with future transformed
means (see the DETAILS section for an explanation of what the transformation is).
The default value is |
k |
for the |
m |
positive integer specifying the maximum number of future observations (or
transformed means) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
delta.over.sigma |
numeric scalar indicating the ratio |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
K.tol |
numeric scalar indicating the tolerance to use in the nonlinear search algorithm to
compute |
est.method |
character string specifying the method of estimation for the shape and scale
distribution parameters. The possible values are
|
normal.approx.transform |
character string indicating which power transformation to use.
Possible values are |
The function predIntGammaSimultaneous
returns a simultaneous prediction
interval as well as estimates of the shape and scale parameters.
The function predIntGammaAltSimultaneous
returns a simultaneous prediction
interval as well as estimates of the mean and coefficient of variation.
Following Krishnamoorthy et al. (2008), the simultaneous prediction interval is computed by:
using a power transformation on the original data to induce approximate normality,
calling predIntNormSimultaneous
with the transformed data to
compute the simultaneous prediction interval, and then
back-transforming the interval to create a simultaneous prediction interval on the original scale.
The argument normal.approx.transform
determines which transformation is used.
The value normal.approx.transform="cube.root"
uses
the cube root transformation suggested by Wilson and Hilferty (1931) and used by
Krishnamoorthy et al. (2008) and Singh et al. (2010b), and the value
normal.approx.transform="fourth.root"
uses the fourth root transformation
suggested by Hawkins and Wixley (1986) and used by Singh et al. (2010b).
The default value normal.approx.transform="kulkarni.powar"
uses the "Optimum Power Normal Approximation Method" of Kulkarni and Powar (2010).
The "optimum" power is determined by:
|
if |
|
if |
where denotes the estimate of the shape parameter. Although
Kulkarni and Powar (2010) use the maximum likelihood estimate of shape to
determine the power
, for the functions
predIntGammaSimultaneous
and predIntGammaAltSimultaneous
the power
is based on whatever estimate of shape is used
(e.g.,
est.method="mle"
, est.method="bcmle"
, etc.).
When the argument n.transmean
is larger than 1 (i.e., you are
constructing a prediction interval for future means, not just single
observations), in order to properly compare a future mean with the
prediction limits, you must follow these steps:
Take the observations that will be used to compute the mean and
transform them by raising them to the power given by the value in the
component interval$normal.transform.power
(see the section VALUE below).
Compute the mean of the transformed observations.
Take the mean computed in step 2 above and raise it to the inverse of the power originally used to transform the observations.
A list of class "estimate"
containing the estimated parameters,
the simultaneous prediction interval, and other information.
See estimate.object
for details.
In addition to the usual components contained in an object of class
"estimate"
, the returned value also includes two additional
components within the "interval"
component:
n.transmean |
the value of |
normal.transform.power |
the value of the power used to transform the original data to approximate normality. |
It is possible for the lower prediction limit based on the transformed data to be less than 0. In this case, the lower prediction limit on the original scale is set to 0 and a warning is issued stating that the normal approximation is not accurate in this case.
The Gamma Distribution
The gamma distribution takes values on the positive real line.
Special cases of the gamma are the exponential distribution and
the chi-square distributions. Applications of the gamma include
life testing, statistical ecology, queuing theory, inventory control, and precipitation
processes. A gamma distribution starts to resemble a normal distribution as the
shape parameter a tends to infinity.
Some EPA guidance documents (e.g., Singh et al., 2002; Singh et al., 2010a,b) strongly
recommend against using a lognormal model for environmental data and recommend trying a
gamma distribuiton instead.
Motivation
Prediction and tolerance intervals have long been applied to quality control and
life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973). In the context of
environmental statistics, prediction intervals are useful for analyzing data from
groundwater detection monitoring programs at hazardous and solid waste facilities.
One of the main statistical problems that plague groundwater monitoring programs at
hazardous and solid waste facilities is the requirement of testing several wells and
several constituents at each well on each sampling occasion. This is an obvious
multiple comparisons problem, and the naive approach of using a standard t-test at
a conventional -level (e.g., 0.05 or 0.01) for each test leads to a
very high probability of at least one significant result on each sampling occasion,
when in fact no contamination has occurred. This problem was pointed out years ago
by Millard (1987) and others.
Davis and McNichols (1987) proposed simultaneous prediction intervals as a way of
controlling the facility-wide false positive rate (FWFPR) while maintaining adequate
power to detect contamination in the groundwater. Because of the ubiquitous presence
of spatial variability, it is usually best to use simultaneous prediction intervals
at each well (Davis, 1998a). That is, by constructing prediction intervals based on
background (pre-landfill) data on each well, and comparing future observations at a
well to the prediction interval for that particular well. In each of these cases,
the individual -level at each well is equal to the FWFRP divided by the
product of the number of wells and constituents.
Often, observations at downgradient wells are not available prior to the construction and operation of the landfill. In this case, upgradient well data can be combined to create a background prediction interval, and observations at each downgradient well can be compared to this prediction interval. If spatial variability is present and a major source of variation, however, this method is not really valid (Davis, 1994; Davis, 1998a).
Chapter 19 of USEPA (2009) contains an extensive discussion of using the
-of-
rule and the Modified California rule.
Chapters 1 and 3 of Gibbons et al. (2009) discuss simultaneous prediction intervals
for the normal and lognormal distributions, respectively.
The k-of-m Rule
For the -of-
rule, Davis and McNichols (1987) give tables with
“optimal” choices of
(in terms of best power for a given overall
confidence level) for selected values of
,
, and
. They found
that the optimal ratios of
to
(i.e.,
) are generally small,
in the range of 15-50%.
The California Rule
The California rule was mandated in that state for groundwater monitoring at waste
disposal facilities when resampling verification is part of the statistical program
(Barclay's Code of California Regulations, 1991). The California code mandates a
“California” rule with . The motivation for this rule may have
been a desire to have a majority of the observations in bounds (Davis, 1998a). For
example, for a
-of-
rule with
and
, a monitoring
location will pass if the first observation is out of bounds, the second resample
is out of bounds, but the last resample is in bounds, so that 2 out of 3 observations
are out of bounds. For the California rule with
, either the first
observation must be in bounds, or the next 2 observations must be in bounds in order
for the monitoring location to pass.
Davis (1998a) states that if the FWFPR is kept constant, then the California rule
offers little increased power compared to the -of-
rule, and can
actually decrease the power of detecting contamination.
The Modified California Rule
The Modified California Rule was proposed as a compromise between a 1-of-
rule and the California rule. For a given FWFPR, the Modified California rule
achieves better power than the California rule, and still requires at least as many
observations in bounds as out of bounds, unlike a 1-of-
rule.
Different Notations Between Different References
For the -of-
rule described in this help file, both
Davis and McNichols (1987) and USEPA (2009, Chapter 19) use the variable
instead of
to represent the minimum number
of future observations the interval should contain on each of the
sampling
occasions.
Gibbons et al. (2009, Chapter 1) presents extensive lists of the value of
for both
-of-
rules and California rules. Gibbons et al.'s
notation reverses the meaning of
and
compared to the notation used
in this help file. That is, in Gibbons et al.'s notation,
represents the
number of future sampling occasions or monitoring wells, and
represents the
minimum number of observations the interval should contain on each sampling occasion.
Steven P. Millard ([email protected])
Barclay's California Code of Regulations. (1991). Title 22, Section 66264.97 [concerning hazardous waste facilities] and Title 23, Section 2550.7(e)(8) [concerning solid waste facilities]. Barclay's Law Publishers, San Francisco, CA.
Davis, C.B. (1998a). Ground-Water Statistics & Regulations: Principles, Progress and Problems. Second Edition. Environmetrics & Statistics Limited, Henderson, NV.
Davis, C.B. (1998b). Personal Communication, September 3, 1998.
Davis, C.B., and R.J. McNichols. (1987). One-sided Intervals for at Least
of
Observations from a Lognormal Population on Each of
Future Occasions.
Technometrics 29, 359–370.
Evans, M., N. Hastings, and B. Peacock. (1993). Statistical Distributions. Second Edition. John Wiley and Sons, New York, Chapter 18.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Fertig, K.W., and N.R. Mann. (1977). One-Sided Prediction Intervals for at Least
Out of
Future Observations From a Lognormal Population.
Technometrics 19, 167–177.
Hahn, G.J. (1969). Factors for Calculating Two-Sided Prediction Intervals for Samples from a Lognormal Distribution. Journal of the American Statistical Association 64(327), 878-898.
Hahn, G.J. (1970a). Additional Factors for Calculating Prediction Intervals for Samples from a Lognormal Distribution. Journal of the American Statistical Association 65(332), 1668-1676.
Hahn, G.J. (1970b). Statistical Intervals for a Lognormal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Lognormal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178-188.
Hall, I.J., and R.R. Prairie. (1973). One-Sided Prediction Intervals to Contain at
Least Out of
Future Observations.
Technometrics 15, 897–914.
Hawkins, D. M., and R.A.J. Wixley. (1986). A Note on the Transformation of Chi-Squared Variables to Normality. The American Statistician, 40, 296–298.
Johnson, N.L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York, Chapter 17.
Krishnamoorthy K., T. Mathew, and S. Mukherjee. (2008). Normal-Based Methods for a Gamma Distribution: Prediction and Tolerance Intervals and Stress-Strength Reliability. Technometrics, 50(1), 69–78.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Kulkarni, H.V., and S.K. Powar. (2010). A New Method for Interval Estimation of the Mean of the Gamma Distribution. Lifetime Data Analysis, 16, 431–447.
Millard, S.P. (1987). Environmental Monitoring, Statistics, and the Law: Room for Improvement (with Comment). The American Statistician 41(4), 249–259.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Singh, A., A.K. Singh, and R.J. Iaci. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Wilson, E.B., and M.M. Hilferty. (1931). The Distribution of Chi-Squares. Proceedings of the National Academy of Sciences, 17, 684–688.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
GammaDist
, GammaAlt
,
predIntNorm
, predIntNormSimultaneous
, predIntNormSimultaneousTestPower
, tolIntGamma
,
egamma
, egammaAlt
, estimate.object
.
# Generate 8 observations from a gamma distribution with parameters # mean=10 and cv=1, then use predIntGammaAltSimultaneous to estimate the # mean and coefficient of variation of the true distribution and construct an # upper 95% prediction interval to contain at least 1 out of the next # 3 observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(479) dat <- rgammaAlt(8, mean = 10, cv = 1) predIntGammaAltSimultaneous(dat, k = 1, m = 3) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): mean = 13.875825 # cv = 1.049504 # #Estimation Method: MLE # #Data: dat # #Sample Size: 8 # #Prediction Interval Method: exact using # Kulkarni & Powar (2010) # transformation to Normality # based on MLE of 'shape' # #Normal Transform Power: 0.2204908 # #Prediction Interval Type: upper # #Confidence Level: 95% # #Minimum Number of #Future Observations #Interval Should Contain: 1 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = 0.00000 # UPL = 15.87101 #---------- # Compare the 95% 1-of-3 upper prediction limit to the California and # Modified California upper prediction limits. Note that the upper # prediction limit for the Modified California rule is between the limit # for the 1-of-3 rule and the limit for the California rule. predIntGammaAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #15.87101 predIntGammaAltSimultaneous(dat, m = 3, rule = "CA")$interval$limits["UPL"] # UPL #34.11499 predIntGammaAltSimultaneous(dat, rule = "Modified.CA")$interval$limits["UPL"] # UPL #22.58809 #---------- # Show how the upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. # Here, we'll use the 1-of-3 rule. predIntGammaAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #15.87101 predIntGammaAltSimultaneous(dat, k = 1, m = 3, r = 10)$interval$limits["UPL"] # UPL #37.86825 #---------- # Compare the upper simultaneous prediction limit for the 1-of-3 rule # based on individual observations versus based on transformed means of # order 4. predIntGammaAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #15.87101 predIntGammaAltSimultaneous(dat, n.transmean = 4, k = 1, m = 3)$interval$limits["UPL"] # UPL #14.76528 #========== # Example 19-1 of USEPA (2009, p. 19-17) shows how to compute an # upper simultaneous prediction limit for the 1-of-3 rule for # r = 2 future sampling occasions. The data for this example are # stored in EPA.09.Ex.19.1.sulfate.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 25 to use to construct the prediction limit. # There are 50 compliance wells and we will monitor 10 different # constituents at each well at each of the r=2 future sampling # occasions. To determine the confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the individual Type I Error level at each well to # 1 - (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the confidence level is given by: nc <- 10 nw <- 50 SWFPR <- 0.1 conf.level <- (1 - SWFPR)^(1 / (nc * nw)) conf.level #[1] 0.9997893 #---------- # Look at the data: names(EPA.09.Ex.19.1.sulfate.df) #[1] "Well" "Month" "Day" #[4] "Year" "Date" "Sulfate.mg.per.l" #[7] "log.Sulfate.mg.per.l" EPA.09.Ex.19.1.sulfate.df[, c("Well", "Date", "Sulfate.mg.per.l", "log.Sulfate.mg.per.l")] # Well Date Sulfate.mg.per.l log.Sulfate.mg.per.l #1 GW-01 1999-07-08 63.0 4.143135 #2 GW-01 1999-09-12 51.0 3.931826 #3 GW-01 1999-10-16 60.0 4.094345 #4 GW-01 1999-11-02 86.0 4.454347 #5 GW-04 1999-07-09 104.0 4.644391 #6 GW-04 1999-09-14 102.0 4.624973 #7 GW-04 1999-10-12 84.0 4.430817 #8 GW-04 1999-11-15 72.0 4.276666 #9 GW-08 1997-10-12 31.0 3.433987 #10 GW-08 1997-11-16 84.0 4.430817 #11 GW-08 1998-01-28 65.0 4.174387 #12 GW-08 1999-04-20 41.0 3.713572 #13 GW-08 2002-06-04 51.8 3.947390 #14 GW-08 2002-09-16 57.5 4.051785 #15 GW-08 2002-12-02 66.8 4.201703 #16 GW-08 2003-03-24 87.1 4.467057 #17 GW-09 1997-10-16 59.0 4.077537 #18 GW-09 1998-01-28 85.0 4.442651 #19 GW-09 1998-04-12 75.0 4.317488 #20 GW-09 1998-07-12 99.0 4.595120 #21 GW-09 2000-01-30 75.8 4.328098 #22 GW-09 2000-04-24 82.5 4.412798 #23 GW-09 2000-10-24 85.5 4.448516 #24 GW-09 2002-12-01 188.0 5.236442 #25 GW-09 2003-03-24 150.0 5.010635 # The EPA guidance document constructs the upper simultaneous # prediction limit for the 1-of-3 plan assuming a lognormal # distribution for the sulfate data. Here we will compare # the value of the limit based on assuming a lognormal distribution # versus assuming a gamma distribution. Sulfate <- EPA.09.Ex.19.1.sulfate.df$Sulfate.mg.per.l pred.int.list.lnorm <- predIntLnormSimultaneous(x = Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) pred.int.list.gamma <- predIntGammaSimultaneous(x = Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) pred.int.list.lnorm$interval$limits["UPL"] # UPL #159.5497 pred.int.list.gamma$interval$limits["UPL"] # UPL #153.3232 #========== # Cleanup #-------- rm(dat, nc, nw, SWFPR, conf.level, Sulfate, pred.int.list.lnorm, pred.int.list.gamma)
# Generate 8 observations from a gamma distribution with parameters # mean=10 and cv=1, then use predIntGammaAltSimultaneous to estimate the # mean and coefficient of variation of the true distribution and construct an # upper 95% prediction interval to contain at least 1 out of the next # 3 observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(479) dat <- rgammaAlt(8, mean = 10, cv = 1) predIntGammaAltSimultaneous(dat, k = 1, m = 3) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): mean = 13.875825 # cv = 1.049504 # #Estimation Method: MLE # #Data: dat # #Sample Size: 8 # #Prediction Interval Method: exact using # Kulkarni & Powar (2010) # transformation to Normality # based on MLE of 'shape' # #Normal Transform Power: 0.2204908 # #Prediction Interval Type: upper # #Confidence Level: 95% # #Minimum Number of #Future Observations #Interval Should Contain: 1 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = 0.00000 # UPL = 15.87101 #---------- # Compare the 95% 1-of-3 upper prediction limit to the California and # Modified California upper prediction limits. Note that the upper # prediction limit for the Modified California rule is between the limit # for the 1-of-3 rule and the limit for the California rule. predIntGammaAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #15.87101 predIntGammaAltSimultaneous(dat, m = 3, rule = "CA")$interval$limits["UPL"] # UPL #34.11499 predIntGammaAltSimultaneous(dat, rule = "Modified.CA")$interval$limits["UPL"] # UPL #22.58809 #---------- # Show how the upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. # Here, we'll use the 1-of-3 rule. predIntGammaAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #15.87101 predIntGammaAltSimultaneous(dat, k = 1, m = 3, r = 10)$interval$limits["UPL"] # UPL #37.86825 #---------- # Compare the upper simultaneous prediction limit for the 1-of-3 rule # based on individual observations versus based on transformed means of # order 4. predIntGammaAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #15.87101 predIntGammaAltSimultaneous(dat, n.transmean = 4, k = 1, m = 3)$interval$limits["UPL"] # UPL #14.76528 #========== # Example 19-1 of USEPA (2009, p. 19-17) shows how to compute an # upper simultaneous prediction limit for the 1-of-3 rule for # r = 2 future sampling occasions. The data for this example are # stored in EPA.09.Ex.19.1.sulfate.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 25 to use to construct the prediction limit. # There are 50 compliance wells and we will monitor 10 different # constituents at each well at each of the r=2 future sampling # occasions. To determine the confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the individual Type I Error level at each well to # 1 - (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the confidence level is given by: nc <- 10 nw <- 50 SWFPR <- 0.1 conf.level <- (1 - SWFPR)^(1 / (nc * nw)) conf.level #[1] 0.9997893 #---------- # Look at the data: names(EPA.09.Ex.19.1.sulfate.df) #[1] "Well" "Month" "Day" #[4] "Year" "Date" "Sulfate.mg.per.l" #[7] "log.Sulfate.mg.per.l" EPA.09.Ex.19.1.sulfate.df[, c("Well", "Date", "Sulfate.mg.per.l", "log.Sulfate.mg.per.l")] # Well Date Sulfate.mg.per.l log.Sulfate.mg.per.l #1 GW-01 1999-07-08 63.0 4.143135 #2 GW-01 1999-09-12 51.0 3.931826 #3 GW-01 1999-10-16 60.0 4.094345 #4 GW-01 1999-11-02 86.0 4.454347 #5 GW-04 1999-07-09 104.0 4.644391 #6 GW-04 1999-09-14 102.0 4.624973 #7 GW-04 1999-10-12 84.0 4.430817 #8 GW-04 1999-11-15 72.0 4.276666 #9 GW-08 1997-10-12 31.0 3.433987 #10 GW-08 1997-11-16 84.0 4.430817 #11 GW-08 1998-01-28 65.0 4.174387 #12 GW-08 1999-04-20 41.0 3.713572 #13 GW-08 2002-06-04 51.8 3.947390 #14 GW-08 2002-09-16 57.5 4.051785 #15 GW-08 2002-12-02 66.8 4.201703 #16 GW-08 2003-03-24 87.1 4.467057 #17 GW-09 1997-10-16 59.0 4.077537 #18 GW-09 1998-01-28 85.0 4.442651 #19 GW-09 1998-04-12 75.0 4.317488 #20 GW-09 1998-07-12 99.0 4.595120 #21 GW-09 2000-01-30 75.8 4.328098 #22 GW-09 2000-04-24 82.5 4.412798 #23 GW-09 2000-10-24 85.5 4.448516 #24 GW-09 2002-12-01 188.0 5.236442 #25 GW-09 2003-03-24 150.0 5.010635 # The EPA guidance document constructs the upper simultaneous # prediction limit for the 1-of-3 plan assuming a lognormal # distribution for the sulfate data. Here we will compare # the value of the limit based on assuming a lognormal distribution # versus assuming a gamma distribution. Sulfate <- EPA.09.Ex.19.1.sulfate.df$Sulfate.mg.per.l pred.int.list.lnorm <- predIntLnormSimultaneous(x = Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) pred.int.list.gamma <- predIntGammaSimultaneous(x = Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) pred.int.list.lnorm$interval$limits["UPL"] # UPL #159.5497 pred.int.list.gamma$interval$limits["UPL"] # UPL #153.3232 #========== # Cleanup #-------- rm(dat, nc, nw, SWFPR, conf.level, Sulfate, pred.int.list.lnorm, pred.int.list.gamma)
Estimate the mean and standard deviation on the log-scale for a
lognormal distribution, or estimate the mean
and coefficient of variation for a
lognormal distribution (alternative parameterization),
and construct a prediction interval for the next observations or
next set of
geometric means.
predIntLnorm(x, n.geomean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95) predIntLnormAlt(x, n.geomean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95, est.arg.list = NULL)
predIntLnorm(x, n.geomean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95) predIntLnormAlt(x, n.geomean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95, est.arg.list = NULL)
x |
For For If |
n.geomean |
positive integer specifying the sample size associated with the |
k |
positive integer specifying the number of future observations or geometric means the
prediction interval should contain with confidence level |
method |
character string specifying the method to use if the number of future observations
( |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
est.arg.list |
for |
The function predIntLnorm
returns a prediction interval as well as
estimates of the meanlog and sdlog parameters.
The function predIntLnormAlt
returns a prediction interval as well as
estimates of the mean and coefficient of variation.
A prediction interval for a lognormal distribution is constructed by taking the
natural logarithm of the observations and constructing a prediction interval
based on the normal (Gaussian) distribution by calling predIntNorm
.
These prediction limits are then exponentiated to produce a prediction interval on
the original scale of the data.
If x
is a numeric vector, a list of class
"estimate"
containing the estimated parameters, the prediction interval,
and other information. See the help file for estimate.object
for details.
If x
is the result of calling an estimation function,
predIntLnorm
returns a list whose class is the same as x
.
The list contains the same components as x
, as well as a component called
interval
containing the prediction interval information.
If x
already has a component called interval
, this component is
replaced with the prediction interval information.
Prediction and tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973; Krishnamoorthy and Mathew, 2009). In the context of environmental statistics, prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities (e.g., Gibbons et al., 2009; Millard and Neerchal, 2001; USEPA, 2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Dunnett, C.W. (1955). A Multiple Comparisons Procedure for Comparing Several Treatments with a Control. Journal of the American Statistical Association 50, 1096-1121.
Dunnett, C.W. (1964). New Tables for Multiple Comparisons with a Control. Biometrics 20, 482-491.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J. (1969). Factors for Calculating Two-Sided Prediction Intervals for Samples from a Lognormal Distribution. Journal of the American Statistical Association 64(327), 878-898.
Hahn, G.J. (1970a). Additional Factors for Calculating Prediction Intervals for Samples from a Lognormal Distribution. Journal of the American Statistical Association 65(332), 1668-1676.
Hahn, G.J. (1970b). Statistical Intervals for a Lognormal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Lognormal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178-188.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York.
Helsel, D.R., and R.M. Hirsch. (2002). Statistical Methods in Water Resources. Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. (available on-line at: https://pubs.usgs.gov/tm/04/a03/tm4a3.pdf).
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Miller, R.G. (1981a). Simultaneous Statistical Inference. McGraw-Hill, New York.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
elnorm
, elnormAlt
,
predIntNorm
, predIntNormK
,
predIntLnormSimultaneous
, predIntLnormAltSimultaneous
,
tolIntLnorm
, tolIntLnormAlt
,
Lognormal, estimate.object
.
# Generate 20 observations from a lognormal distribution with parameters # meanlog=0 and sdlog=1. The exact two-sided 90% prediction interval for # k=1 future observation is given by: [exp(-1.645), exp(1.645)] = [0.1930, 5.181]. # Use predIntLnorm to estimate the distribution parameters, and construct a # two-sided 90% prediction interval. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(47) dat <- rlnorm(20, meanlog = 0, sdlog = 1) predIntLnorm(dat, conf = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.1035722 # sdlog = 0.9106429 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact # #Prediction Interval Type: two-sided # #Confidence Level: 90% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.1795898 # UPL = 4.5264399 #---------- # Repeat the above example, but do it in two steps. # First create a list called est.list containing information about the # estimated parameters, then create the prediction interval. est.list <- elnorm(dat) est.list #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.1035722 # sdlog = 0.9106429 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 predIntLnorm(est.list, conf = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.1035722 # sdlog = 0.9106429 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact # #Prediction Interval Type: two-sided # #Confidence Level: 90% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.1795898 # UPL = 4.5264399 #---------- # Using the same data from the first example, create a one-sided # upper 99% prediction limit for the next 3 geometric means of order 2 # (i.e., each of the 3 future geometric means is based on a sample size # of 2 future observations). predIntLnorm(dat, n.geomean = 2, k = 3, conf.level = 0.99, pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.1035722 # sdlog = 0.9106429 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Bonferroni # #Prediction Interval Type: upper # #Confidence Level: 99% # #Number of Future #Geometric Means: 3 # #Sample Size for #Geometric Means: 2 # #Prediction Interval: LPL = 0.000000 # UPL = 7.047571 #---------- # Compare the result above that is based on the Bonferroni method # with the exact method predIntLnorm(dat, n.geomean = 2, k = 3, conf.level = 0.99, pi.type = "upper", method = "exact")$interval$limits["UPL"] # UPL #7.00316 #---------- # Clean up rm(dat, est.list) #-------------------------------------------------------------------- # Example 18-2 of USEPA (2009, p.18-15) shows how to construct a 99% # upper prediction interval for the log-scale mean of 4 future observations # (future mean of order 4) assuming a lognormal distribution based on # chrysene concentrations (ppb) in groundwater at 2 background wells. # Data were collected once per month over 4 months at the 2 background # wells, and also at a compliance well. # The question to be answered is whether there is evidence of # contamination at the compliance well. # Here we will follow the example, but look at the geometric mean # instead of the log-scale mean. #---------- # The data for this example are stored in EPA.09.Ex.18.2.chrysene.df. EPA.09.Ex.18.2.chrysene.df # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 6.9 #2 2 Well.1 Background 27.3 #3 3 Well.1 Background 10.8 #4 4 Well.1 Background 8.9 #5 1 Well.2 Background 15.1 #6 2 Well.2 Background 7.2 #7 3 Well.2 Background 48.4 #8 4 Well.2 Background 7.8 #9 1 Well.3 Compliance 68.0 #10 2 Well.3 Compliance 48.9 #11 3 Well.3 Compliance 30.1 #12 4 Well.3 Compliance 38.1 Chrysene.bkgd <- with(EPA.09.Ex.18.2.chrysene.df, Chrysene.ppb[Well.type == "Background"]) Chrysene.cmpl <- with(EPA.09.Ex.18.2.chrysene.df, Chrysene.ppb[Well.type == "Compliance"]) #---------- # A Shapiro-Wilks goodness-of-fit test for normality indicates # we should reject the assumption of normality and assume a # lognormal distribution for the background well data: gofTest(Chrysene.bkgd) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Normal # #Estimated Parameter(s): mean = 16.55000 # sd = 14.54441 # #Estimation Method: mvue # #Data: Chrysene.bkgd # #Sample Size: 8 # #Test Statistic: W = 0.7289006 # #Test Statistic Parameter: n = 8 # #P-value: 0.004759859 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. gofTest(Chrysene.bkgd, dist = "lnorm") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.5533006 # sdlog = 0.7060038 # #Estimation Method: mvue # #Data: Chrysene.bkgd # #Sample Size: 8 # #Test Statistic: W = 0.8546352 # #Test Statistic Parameter: n = 8 # #P-value: 0.1061057 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. #---------- # Here is the one-sided 99% upper prediction limit for # a geometric mean based on 4 future observations: predIntLnorm(Chrysene.bkgd, n.geomean = 4, k = 1, conf.level = 0.99, pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.5533006 # sdlog = 0.7060038 # #Estimation Method: mvue # #Data: Chrysene.bkgd # #Sample Size: 8 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99% # #Number of Future #Geometric Means: 1 # #Sample Size for #Geometric Means: 4 # #Prediction Interval: LPL = 0.00000 # UPL = 46.96613 UPL <- predIntLnorm(Chrysene.bkgd, n.geomean = 4, k = 1, conf.level = 0.99, pi.type = "upper")$interval$limits["UPL"] UPL # UPL #46.96613 # Is there evidence of contamination at the compliance well? geoMean(Chrysene.cmpl) #[1] 44.19034 # Since the geometric mean at the compliance well is less than # the upper prediction limit, there is no evidence of contamination. #---------- # Cleanup #-------- rm(Chrysene.bkgd, Chrysene.cmpl, UPL)
# Generate 20 observations from a lognormal distribution with parameters # meanlog=0 and sdlog=1. The exact two-sided 90% prediction interval for # k=1 future observation is given by: [exp(-1.645), exp(1.645)] = [0.1930, 5.181]. # Use predIntLnorm to estimate the distribution parameters, and construct a # two-sided 90% prediction interval. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(47) dat <- rlnorm(20, meanlog = 0, sdlog = 1) predIntLnorm(dat, conf = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.1035722 # sdlog = 0.9106429 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact # #Prediction Interval Type: two-sided # #Confidence Level: 90% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.1795898 # UPL = 4.5264399 #---------- # Repeat the above example, but do it in two steps. # First create a list called est.list containing information about the # estimated parameters, then create the prediction interval. est.list <- elnorm(dat) est.list #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.1035722 # sdlog = 0.9106429 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 predIntLnorm(est.list, conf = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.1035722 # sdlog = 0.9106429 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact # #Prediction Interval Type: two-sided # #Confidence Level: 90% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.1795898 # UPL = 4.5264399 #---------- # Using the same data from the first example, create a one-sided # upper 99% prediction limit for the next 3 geometric means of order 2 # (i.e., each of the 3 future geometric means is based on a sample size # of 2 future observations). predIntLnorm(dat, n.geomean = 2, k = 3, conf.level = 0.99, pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.1035722 # sdlog = 0.9106429 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Bonferroni # #Prediction Interval Type: upper # #Confidence Level: 99% # #Number of Future #Geometric Means: 3 # #Sample Size for #Geometric Means: 2 # #Prediction Interval: LPL = 0.000000 # UPL = 7.047571 #---------- # Compare the result above that is based on the Bonferroni method # with the exact method predIntLnorm(dat, n.geomean = 2, k = 3, conf.level = 0.99, pi.type = "upper", method = "exact")$interval$limits["UPL"] # UPL #7.00316 #---------- # Clean up rm(dat, est.list) #-------------------------------------------------------------------- # Example 18-2 of USEPA (2009, p.18-15) shows how to construct a 99% # upper prediction interval for the log-scale mean of 4 future observations # (future mean of order 4) assuming a lognormal distribution based on # chrysene concentrations (ppb) in groundwater at 2 background wells. # Data were collected once per month over 4 months at the 2 background # wells, and also at a compliance well. # The question to be answered is whether there is evidence of # contamination at the compliance well. # Here we will follow the example, but look at the geometric mean # instead of the log-scale mean. #---------- # The data for this example are stored in EPA.09.Ex.18.2.chrysene.df. EPA.09.Ex.18.2.chrysene.df # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 6.9 #2 2 Well.1 Background 27.3 #3 3 Well.1 Background 10.8 #4 4 Well.1 Background 8.9 #5 1 Well.2 Background 15.1 #6 2 Well.2 Background 7.2 #7 3 Well.2 Background 48.4 #8 4 Well.2 Background 7.8 #9 1 Well.3 Compliance 68.0 #10 2 Well.3 Compliance 48.9 #11 3 Well.3 Compliance 30.1 #12 4 Well.3 Compliance 38.1 Chrysene.bkgd <- with(EPA.09.Ex.18.2.chrysene.df, Chrysene.ppb[Well.type == "Background"]) Chrysene.cmpl <- with(EPA.09.Ex.18.2.chrysene.df, Chrysene.ppb[Well.type == "Compliance"]) #---------- # A Shapiro-Wilks goodness-of-fit test for normality indicates # we should reject the assumption of normality and assume a # lognormal distribution for the background well data: gofTest(Chrysene.bkgd) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Normal # #Estimated Parameter(s): mean = 16.55000 # sd = 14.54441 # #Estimation Method: mvue # #Data: Chrysene.bkgd # #Sample Size: 8 # #Test Statistic: W = 0.7289006 # #Test Statistic Parameter: n = 8 # #P-value: 0.004759859 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. gofTest(Chrysene.bkgd, dist = "lnorm") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.5533006 # sdlog = 0.7060038 # #Estimation Method: mvue # #Data: Chrysene.bkgd # #Sample Size: 8 # #Test Statistic: W = 0.8546352 # #Test Statistic Parameter: n = 8 # #P-value: 0.1061057 # #Alternative Hypothesis: True cdf does not equal the # Lognormal Distribution. #---------- # Here is the one-sided 99% upper prediction limit for # a geometric mean based on 4 future observations: predIntLnorm(Chrysene.bkgd, n.geomean = 4, k = 1, conf.level = 0.99, pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.5533006 # sdlog = 0.7060038 # #Estimation Method: mvue # #Data: Chrysene.bkgd # #Sample Size: 8 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99% # #Number of Future #Geometric Means: 1 # #Sample Size for #Geometric Means: 4 # #Prediction Interval: LPL = 0.00000 # UPL = 46.96613 UPL <- predIntLnorm(Chrysene.bkgd, n.geomean = 4, k = 1, conf.level = 0.99, pi.type = "upper")$interval$limits["UPL"] UPL # UPL #46.96613 # Is there evidence of contamination at the compliance well? geoMean(Chrysene.cmpl) #[1] 44.19034 # Since the geometric mean at the compliance well is less than # the upper prediction limit, there is no evidence of contamination. #---------- # Cleanup #-------- rm(Chrysene.bkgd, Chrysene.cmpl, UPL)
Compute the probability that at least one set of future observations violates the
given rule based on a simultaneous prediction interval for the next future
sampling occasions for a lognormal distribution. The
three possible rules are:
-of-
, California, or Modified California.
predIntLnormAltSimultaneousTestPower(n, df = n - 1, n.geomean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", ratio.of.means = 1, cv = 1, pi.type = "upper", conf.level = 0.95, r.shifted = r, K.tol = .Machine$double.eps^0.5, integrate.args.list = NULL)
predIntLnormAltSimultaneousTestPower(n, df = n - 1, n.geomean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", ratio.of.means = 1, cv = 1, pi.type = "upper", conf.level = 0.95, r.shifted = r, K.tol = .Machine$double.eps^0.5, integrate.args.list = NULL)
n |
vector of positive integers greater than 2 indicating the sample size upon which the prediction interval is based. |
df |
vector of positive integers indicating the degrees of freedom associated with
the sample size. The default value is |
n.geomean |
positive integer specifying the sample size associated with the future geometric
means.
The default value is |
k |
for the |
m |
vector of positive integers specifying the maximum number of future observations (or
averages) on one future sampling “occasion”.
The default value is |
r |
vector of positive integers specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
ratio.of.means |
numeric vector specifying the ratio of the mean of the population that will be
sampled to produce the future observations vs. the mean of the population that
was sampled to construct the prediction interval. See the DETAILS section below
for more information. The default value is |
cv |
numeric vector of positive values specifying the coefficient of variation for
both the population that was sampled to construct the prediction interval and
the population that will be sampled to produce the future observations. The
default value is |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
vector of values between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
r.shifted |
vector of positive integers specifying the number of future sampling occasions for
which the mean is shifted. All values must be integeters
between |
K.tol |
numeric scalar indicating the tolerance to use in the nonlinear search algorithm to
compute |
integrate.args.list |
a list of arguments to supply to the |
What is a Simultaneous Prediction Interval?
A prediction interval for some population is an interval on the real line constructed
so that it will contain future observations from that population
with some specified probability
, where
and
is some pre-specified positive integer.
The quantity
is called
the confidence coefficient or confidence level associated with the prediction
interval. The function
predIntNorm
computes a standard prediction
interval based on a sample from a normal distribution.
The function predIntLnormAltSimultaneous
computes a simultaneous
prediction interval (assuming lognormal observations) that will contain a
certain number of future observations
with probability for each of
future sampling
“occasions”, where
is some pre-specified positive integer.
The quantity
may refer to
distinct future sampling occasions in
time, or it may for example refer to sampling at
distinct locations on
one future sampling occasion,
assuming that the population standard deviation is the same at all of the
distinct locations.
The function predIntLnormAltSimultaneous
computes a simultaneous
prediction interval based on one of three possible rules:
For the -of-
rule (
rule="k.of.m"
), at least of
the next
future observations will fall in the prediction
interval with probability
on each of the
future
sampling occasions. If obserations are being taken sequentially, for a particular
sampling occasion, up to
observations may be taken, but once
of the observations fall within the prediction interval, sampling can stop.
Note: When
and
, the results of
predIntNormSimultaneous
are equivalent to the results of predIntNorm
.
For the California rule (rule="CA"
), with probability
, for each of the
future sampling occasions, either
the first observation will fall in the prediction interval, or else all of the next
observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise,
more observations must be taken.
For the Modified California rule (rule="Modified.CA"
), with probability
, for each of the
future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
Computing Power
The function predIntNormSimultaneousTestPower
computes the
probability that at least one set of future observations or averages will
violate the given rule based on a simultaneous prediction interval for the
next future sampling occasions for a normal distribution,
based on the assumption of normally distributed observations,
where the population mean for the future observations is allowed to differ from
the population mean for the observations used to construct the prediction interval.
The function predIntLnormAltSimultaneousTestPower
assumes all observations are
from a lognormal distribution. The observations used to
construct the prediction interval are assumed to come from a lognormal distribution
with mean and coefficient of variation
. The future
observations are assumed to come from a lognormal distribution with mean
and coefficient of variation
; that is, the means are
allowed to differ between the two populations, but not the coefficient of variation.
The function predIntLnormAltSimultaneousTestPower
calls the function predIntNormSimultaneousTestPower
, with the argument
delta.over.sigma
given by:
where is given by:
and corresponds to the argument ratio.of.means
for the function predIntLnormAltSimultaneousTestPower
, and corresponds to the
argument
cv
.
vector of values between 0 and 1 equal to the probability that the rule will be violated.
See the help files for predIntLnormAltSimultaneous
and
predIntNormSimultaneousTestPower
.
Steven P. Millard ([email protected])
See the help file for predIntLnormAltSimultaneous
.
predIntLnormAltSimultaneous
, plotPredIntLnormAltSimultaneousTestPowerCurve
, predIntNormSimultaneous
, plotPredIntNormSimultaneousTestPowerCurve
,
Prediction Intervals, LognormalAlt.
# For the k-of-m rule with n=4, k=1, m=3, and r=1, show how the power increases # as ratio.of.means increases. Assume a 95% upper prediction interval. predIntLnormAltSimultaneousTestPower(n = 4, m = 3, ratio.of.means = 1:3) #[1] 0.0500000 0.2356914 0.4236723 #---------- # Look at how the power increases with sample size for an upper one-sided # prediction interval using the k-of-m rule with k=1, m=3, r=20, # ratio.of.means=4, and a confidence level of 95%. predIntLnormAltSimultaneousTestPower(n = c(4, 8), m = 3, r = 20, ratio.of.means = 4) #[1] 0.4915743 0.8218175 #---------- # Compare the power for the 1-of-3 rule with the power for the California and # Modified California rules, based on a 95% upper prediction interval and # ratio.of.means=4. Assume a sample size of n=8. Note that in this case the # power for the Modified California rule is greater than the power for the # 1-of-3 rule and California rule. predIntLnormAltSimultaneousTestPower(n = 8, k = 1, m = 3, ratio.of.means = 4) #[1] 0.6594845 predIntLnormAltSimultaneousTestPower(n = 8, m = 3, rule = "CA", ratio.of.means = 4) #[1] 0.5864311 predIntLnormAltSimultaneousTestPower(n = 8, rule = "Modified.CA", ratio.of.means = 4) #[1] 0.691135 #---------- # Show how the power for an upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. Here, we'll use the # 1-of-3 rule with n=8 and ratio.of.means=4. predIntLnormAltSimultaneousTestPower(n = 8, k = 1, m = 3, r = c(1, 2, 5, 10), ratio.of.means = 4) #[1] 0.6594845 0.7529576 0.8180814 0.8302302
# For the k-of-m rule with n=4, k=1, m=3, and r=1, show how the power increases # as ratio.of.means increases. Assume a 95% upper prediction interval. predIntLnormAltSimultaneousTestPower(n = 4, m = 3, ratio.of.means = 1:3) #[1] 0.0500000 0.2356914 0.4236723 #---------- # Look at how the power increases with sample size for an upper one-sided # prediction interval using the k-of-m rule with k=1, m=3, r=20, # ratio.of.means=4, and a confidence level of 95%. predIntLnormAltSimultaneousTestPower(n = c(4, 8), m = 3, r = 20, ratio.of.means = 4) #[1] 0.4915743 0.8218175 #---------- # Compare the power for the 1-of-3 rule with the power for the California and # Modified California rules, based on a 95% upper prediction interval and # ratio.of.means=4. Assume a sample size of n=8. Note that in this case the # power for the Modified California rule is greater than the power for the # 1-of-3 rule and California rule. predIntLnormAltSimultaneousTestPower(n = 8, k = 1, m = 3, ratio.of.means = 4) #[1] 0.6594845 predIntLnormAltSimultaneousTestPower(n = 8, m = 3, rule = "CA", ratio.of.means = 4) #[1] 0.5864311 predIntLnormAltSimultaneousTestPower(n = 8, rule = "Modified.CA", ratio.of.means = 4) #[1] 0.691135 #---------- # Show how the power for an upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. Here, we'll use the # 1-of-3 rule with n=8 and ratio.of.means=4. predIntLnormAltSimultaneousTestPower(n = 8, k = 1, m = 3, r = c(1, 2, 5, 10), ratio.of.means = 4) #[1] 0.6594845 0.7529576 0.8180814 0.8302302
Compute the probability that at least one out of future observations
(or geometric means) falls outside a prediction interval for
future
observations (or geometric means) for a normal distribution.
predIntLnormAltTestPower(n, df = n - 1, n.geomean = 1, k = 1, ratio.of.means = 1, cv = 1, pi.type = "upper", conf.level = 0.95)
predIntLnormAltTestPower(n, df = n - 1, n.geomean = 1, k = 1, ratio.of.means = 1, cv = 1, pi.type = "upper", conf.level = 0.95)
n |
vector of positive integers greater than 2 indicating the sample size upon which the prediction interval is based. |
df |
vector of positive integers indicating the degrees of freedom associated with
the sample size. The default value is |
n.geomean |
positive integer specifying the sample size associated with the future
geometric means. The default value is |
k |
vector of positive integers specifying the number of future observations that the
prediction interval should contain with confidence level |
ratio.of.means |
numeric vector specifying the ratio of the mean of the population that will be
sampled to produce the future observations vs. the mean of the population that
was sampled to construct the prediction interval. See the DETAILS section below
for more information. The default value is |
cv |
numeric vector of positive values specifying the coefficient of variation for
both the population that was sampled to construct the prediction interval and
the population that will be sampled to produce the future observations. The
default value is |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
A prediction interval for some population is an interval on the real line
constructed so that it will contain future observations or averages
from that population with some specified probability
,
where
and
is some pre-specified positive integer.
The quantity
is call the confidence coefficient or
confidence level associated with the prediction interval. The function
predIntNorm
computes a standard prediction interval based on a
sample from a normal distribution.
The function predIntNormTestPower
computes the probability that at
least one out of future observations or averages will not be contained in
a prediction interval based on the assumption of normally distributed observations,
where the population mean for the future observations is allowed to differ from
the population mean for the observations used to construct the prediction interval.
The function predIntLnormAltTestPower
assumes all observations are
from a lognormal distribution. The observations used to
construct the prediction interval are assumed to come from a lognormal distribution
with mean and coefficient of variation
. The future
observations are assumed to come from a lognormal distribution with mean
and coefficient of variation
; that is, the means are
allowed to differ between the two populations, but not the coefficient of variation.
The function predIntLnormAltTestPower
calls the function
predIntNormTestPower
, with the argument delta.over.sigma
given by:
where is given by:
and corresponds to the argument ratio.of.means
for the function
predIntLnormAltTestPower
, and corresponds to the argument
cv
.
vector of numbers between 0 and 1 equal to the probability that at least one of
future observations or geometric means will fall outside the prediction
interval.
See the help files for predIntNormTestPower
.
Steven P. Millard ([email protected])
See the help files for predIntNormTestPower
and
tTestLnormAltPower
.
plotPredIntLnormAltTestPowerCurve
,
predIntLnormAlt
,
predIntNorm
, predIntNormK
, plotPredIntNormTestPowerCurve
,
predIntLnormAltSimultaneous
, predIntLnormAltSimultaneousTestPower
, Prediction Intervals,
LognormalAlt.
# Show how the power increases as ratio.of.means increases. Assume a # 95% upper prediction interval. predIntLnormAltTestPower(n = 4, ratio.of.means = 1:3) #[1] 0.0500000 0.1459516 0.2367793 #---------- # Look at how the power increases with sample size for an upper one-sided # prediction interval with k=3, ratio.of.means=4, and a confidence level of 95%. predIntLnormAltTestPower(n = c(4, 8), k = 3, ratio.of.means = 4) #[1] 0.2860952 0.4533567 #---------- # Show how the power for an upper 95% prediction limit increases as the # number of future observations k increases. Here, we'll use n=20 and # ratio.of.means=2. predIntLnormAltTestPower(n = 20, k = 1:3, ratio.of.means = 2) #[1] 0.1945886 0.2189538 0.2321562
# Show how the power increases as ratio.of.means increases. Assume a # 95% upper prediction interval. predIntLnormAltTestPower(n = 4, ratio.of.means = 1:3) #[1] 0.0500000 0.1459516 0.2367793 #---------- # Look at how the power increases with sample size for an upper one-sided # prediction interval with k=3, ratio.of.means=4, and a confidence level of 95%. predIntLnormAltTestPower(n = c(4, 8), k = 3, ratio.of.means = 4) #[1] 0.2860952 0.4533567 #---------- # Show how the power for an upper 95% prediction limit increases as the # number of future observations k increases. Here, we'll use n=20 and # ratio.of.means=2. predIntLnormAltTestPower(n = 20, k = 1:3, ratio.of.means = 2) #[1] 0.1945886 0.2189538 0.2321562
Estimate the mean and standard deviation on the log-scale for a
lognormal distribution, or estimate the mean
and coefficient of variation for a
lognormal distribution (alternative parameterization),
and construct a simultaneous prediction
interval for the next sampling occasions, based on one of three possible
rules: k-of-m, California, or Modified California.
predIntLnormSimultaneous(x, n.geomean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = .Machine$double.eps^0.5) predIntLnormAltSimultaneous(x, n.geomean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = .Machine$double.eps^0.5, est.arg.list = NULL)
predIntLnormSimultaneous(x, n.geomean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = .Machine$double.eps^0.5) predIntLnormAltSimultaneous(x, n.geomean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = .Machine$double.eps^0.5, est.arg.list = NULL)
x |
For For If |
n.geomean |
positive integer specifying the sample size associated with future
geometric means. The default value is |
k |
for the |
m |
positive integer specifying the maximum number of future observations (or
geometric means) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
delta.over.sigma |
numeric scalar indicating the ratio |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
K.tol |
numeric scalar indicating the tolerance to use in the nonlinear search algorithm to
compute |
est.arg.list |
a list containing arguments to pass to the function
|
The function predIntLnormSimultaneous
returns a simultaneous prediction
interval as well as estimates of the meanlog and sdlog parameters.
The function predIntLnormAltSimultaneous
returns a prediction interval as
well as estimates of the mean and coefficient of variation.
A simultaneous prediction interval for a lognormal distribution is constructed by
taking the natural logarithm of the observations and constructing a prediction
interval based on the normal (Gaussian) distribution by calling
predIntNormSimultaneous
.
These prediction limits are then exponentiated to produce a prediction interval on
the original scale of the data.
If x
is a numeric vector, predIntLnormSimultaneous
returns a list of
class "estimate"
containing the estimated parameters, the prediction interval,
and other information. See the help file for estimate.object
for details.
If x
is the result of calling an estimation function,
predIntLnormSimultaneous
returns a list whose class is the same as x
.
The list contains the same components as x
, as well as a component called
interval
containing the prediction interval information.
If x
already has a component called interval
, this component is
replaced with the prediction interval information.
Motivation
Prediction and tolerance intervals have long been applied to quality control and
life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973). In the context of
environmental statistics, prediction intervals are useful for analyzing data from
groundwater detection monitoring programs at hazardous and solid waste facilities.
One of the main statistical problems that plague groundwater monitoring programs at
hazardous and solid waste facilities is the requirement of testing several wells and
several constituents at each well on each sampling occasion. This is an obvious
multiple comparisons problem, and the naive approach of using a standard t-test at
a conventional -level (e.g., 0.05 or 0.01) for each test leads to a
very high probability of at least one significant result on each sampling occasion,
when in fact no contamination has occurred. This problem was pointed out years ago
by Millard (1987) and others.
Davis and McNichols (1987) proposed simultaneous prediction intervals as a way of
controlling the facility-wide false positive rate (FWFPR) while maintaining adequate
power to detect contamination in the groundwater. Because of the ubiquitous presence
of spatial variability, it is usually best to use simultaneous prediction intervals
at each well (Davis, 1998a). That is, by constructing prediction intervals based on
background (pre-landfill) data on each well, and comparing future observations at a
well to the prediction interval for that particular well. In each of these cases,
the individual -level at each well is equal to the FWFRP divided by the
product of the number of wells and constituents.
Often, observations at downgradient wells are not available prior to the construction and operation of the landfill. In this case, upgradient well data can be combined to create a background prediction interval, and observations at each downgradient well can be compared to this prediction interval. If spatial variability is present and a major source of variation, however, this method is not really valid (Davis, 1994; Davis, 1998a).
Chapter 19 of USEPA (2009) contains an extensive discussion of using the
-of-
rule and the Modified California rule.
Chapters 1 and 3 of Gibbons et al. (2009) discuss simultaneous prediction intervals
for the normal and lognormal distributions, respectively.
The k-of-m Rule
For the -of-
rule, Davis and McNichols (1987) give tables with
“optimal” choices of
(in terms of best power for a given overall
confidence level) for selected values of
,
, and
. They found
that the optimal ratios of
to
(i.e.,
) are generally small,
in the range of 15-50%.
The California Rule
The California rule was mandated in that state for groundwater monitoring at waste
disposal facilities when resampling verification is part of the statistical program
(Barclay's Code of California Regulations, 1991). The California code mandates a
“California” rule with . The motivation for this rule may have
been a desire to have a majority of the observations in bounds (Davis, 1998a). For
example, for a
-of-
rule with
and
, a monitoring
location will pass if the first observation is out of bounds, the second resample
is out of bounds, but the last resample is in bounds, so that 2 out of 3 observations
are out of bounds. For the California rule with
, either the first
observation must be in bounds, or the next 2 observations must be in bounds in order
for the monitoring location to pass.
Davis (1998a) states that if the FWFPR is kept constant, then the California rule
offers little increased power compared to the -of-
rule, and can
actually decrease the power of detecting contamination.
The Modified California Rule
The Modified California Rule was proposed as a compromise between a 1-of-
rule and the California rule. For a given FWFPR, the Modified California rule
achieves better power than the California rule, and still requires at least as many
observations in bounds as out of bounds, unlike a 1-of-
rule.
Different Notations Between Different References
For the -of-
rule described in this help file, both
Davis and McNichols (1987) and USEPA (2009, Chapter 19) use the variable
instead of
to represent the minimum number
of future observations the interval should contain on each of the
sampling
occasions.
Gibbons et al. (2009, Chapter 1) presents extensive lists of the value of
for both
-of-
rules and California rules. Gibbons et al.'s
notation reverses the meaning of
and
compared to the notation used
in this help file. That is, in Gibbons et al.'s notation,
represents the
number of future sampling occasions or monitoring wells, and
represents the
minimum number of observations the interval should contain on each sampling occasion.
Steven P. Millard ([email protected])
Barclay's California Code of Regulations. (1991). Title 22, Section 66264.97 [concerning hazardous waste facilities] and Title 23, Section 2550.7(e)(8) [concerning solid waste facilities]. Barclay's Law Publishers, San Francisco, CA.
Davis, C.B. (1998a). Ground-Water Statistics & Regulations: Principles, Progress and Problems. Second Edition. Environmetrics & Statistics Limited, Henderson, NV.
Davis, C.B. (1998b). Personal Communication, September 3, 1998.
Davis, C.B., and R.J. McNichols. (1987). One-sided Intervals for at Least
of
Observations from a Lognormal Population on Each of
Future Occasions.
Technometrics 29, 359–370.
Fertig, K.W., and N.R. Mann. (1977). One-Sided Prediction Intervals for at Least
Out of
Future Observations From a Lognormal Population.
Technometrics 19, 167–177.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J. (1969). Factors for Calculating Two-Sided Prediction Intervals for Samples from a Lognormal Distribution. Journal of the American Statistical Association 64(327), 878-898.
Hahn, G.J. (1970a). Additional Factors for Calculating Prediction Intervals for Samples from a Lognormal Distribution. Journal of the American Statistical Association 65(332), 1668-1676.
Hahn, G.J. (1970b). Statistical Intervals for a Lognormal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Lognormal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178-188.
Hall, I.J., and R.R. Prairie. (1973). One-Sided Prediction Intervals to Contain at
Least Out of
Future Observations.
Technometrics 15, 897–914.
Millard, S.P. (1987). Environmental Monitoring, Statistics, and the Law: Room for Improvement (with Comment). The American Statistician 41(4), 249–259.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
predIntLnormAltSimultaneousTestPower
,
predIntNorm
, predIntNormSimultaneous
, predIntNormSimultaneousTestPower
,
tolIntLnorm
, Lognormal, LognormalAlt, estimate.object
, elnorm
, elnormAlt
.
# Generate 8 observations from a lognormal distribution with parameters # mean=10 and cv=1, then use predIntLnormAltSimultaneous to estimate the # mean and coefficient of variation of the true distribution and construct an # upper 95% prediction interval to contain at least 1 out of the next # 3 observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(479) dat <- rlnormAlt(8, mean = 10, cv = 1) predIntLnormAltSimultaneous(dat, k = 1, m = 3) # Compare the 95% 1-of-3 upper prediction limit to the California and # Modified California upper prediction limits. Note that the upper # prediction limit for the Modified California rule is between the limit # for the 1-of-3 rule and the limit for the California rule. predIntLnormAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] predIntLnormAltSimultaneous(dat, m = 3, rule = "CA")$interval$limits["UPL"] predIntLnormAltSimultaneous(dat, rule = "Modified.CA")$interval$limits["UPL"] # Show how the upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. # Here, we'll use the 1-of-3 rule. predIntLnormAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] predIntLnormAltSimultaneous(dat, k = 1, m = 3, r = 10)$interval$limits["UPL"] # Compare the upper simultaneous prediction limit for the 1-of-3 rule # based on individual observations versus based on geometric means of # order 4. predIntLnormAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] predIntLnormAltSimultaneous(dat, n.geomean = 4, k = 1, m = 3)$interval$limits["UPL"] #========== # Example 19-1 of USEPA (2009, p. 19-17) shows how to compute an # upper simultaneous prediction limit for the 1-of-3 rule for # r = 2 future sampling occasions. The data for this example are # stored in EPA.09.Ex.19.1.sulfate.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 25 to use to construct the prediction limit. # There are 50 compliance wells and we will monitor 10 different # constituents at each well at each of the r=2 future sampling # occasions. To determine the confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the individual Type I Error level at each well to # 1 - (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the confidence level is given by: nc <- 10 nw <- 50 SWFPR <- 0.1 conf.level <- (1 - SWFPR)^(1 / (nc * nw)) conf.level #---------- # Look at the data: names(EPA.09.Ex.19.1.sulfate.df) EPA.09.Ex.19.1.sulfate.df[, c("Well", "Date", "Sulfate.mg.per.l", "log.Sulfate.mg.per.l")] # Construct the upper simultaneous prediction limit for the # 1-of-3 plan assuming a lognormal distribution for the # sulfate data Sulfate <- EPA.09.Ex.19.1.sulfate.df$Sulfate.mg.per.l predIntLnormSimultaneous(x = Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) #========== # NOTE: Two-sided simultaneous prediction intervals computed using # Versions 2.4.0 - 2.8.1 of EnvStats are *NOT* valid. ## Not run: predIntLnormSimultaneous(x = Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "two-sided", conf.level = conf.level) ## End(Not run)
# Generate 8 observations from a lognormal distribution with parameters # mean=10 and cv=1, then use predIntLnormAltSimultaneous to estimate the # mean and coefficient of variation of the true distribution and construct an # upper 95% prediction interval to contain at least 1 out of the next # 3 observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(479) dat <- rlnormAlt(8, mean = 10, cv = 1) predIntLnormAltSimultaneous(dat, k = 1, m = 3) # Compare the 95% 1-of-3 upper prediction limit to the California and # Modified California upper prediction limits. Note that the upper # prediction limit for the Modified California rule is between the limit # for the 1-of-3 rule and the limit for the California rule. predIntLnormAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] predIntLnormAltSimultaneous(dat, m = 3, rule = "CA")$interval$limits["UPL"] predIntLnormAltSimultaneous(dat, rule = "Modified.CA")$interval$limits["UPL"] # Show how the upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. # Here, we'll use the 1-of-3 rule. predIntLnormAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] predIntLnormAltSimultaneous(dat, k = 1, m = 3, r = 10)$interval$limits["UPL"] # Compare the upper simultaneous prediction limit for the 1-of-3 rule # based on individual observations versus based on geometric means of # order 4. predIntLnormAltSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] predIntLnormAltSimultaneous(dat, n.geomean = 4, k = 1, m = 3)$interval$limits["UPL"] #========== # Example 19-1 of USEPA (2009, p. 19-17) shows how to compute an # upper simultaneous prediction limit for the 1-of-3 rule for # r = 2 future sampling occasions. The data for this example are # stored in EPA.09.Ex.19.1.sulfate.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 25 to use to construct the prediction limit. # There are 50 compliance wells and we will monitor 10 different # constituents at each well at each of the r=2 future sampling # occasions. To determine the confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the individual Type I Error level at each well to # 1 - (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the confidence level is given by: nc <- 10 nw <- 50 SWFPR <- 0.1 conf.level <- (1 - SWFPR)^(1 / (nc * nw)) conf.level #---------- # Look at the data: names(EPA.09.Ex.19.1.sulfate.df) EPA.09.Ex.19.1.sulfate.df[, c("Well", "Date", "Sulfate.mg.per.l", "log.Sulfate.mg.per.l")] # Construct the upper simultaneous prediction limit for the # 1-of-3 plan assuming a lognormal distribution for the # sulfate data Sulfate <- EPA.09.Ex.19.1.sulfate.df$Sulfate.mg.per.l predIntLnormSimultaneous(x = Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) #========== # NOTE: Two-sided simultaneous prediction intervals computed using # Versions 2.4.0 - 2.8.1 of EnvStats are *NOT* valid. ## Not run: predIntLnormSimultaneous(x = Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "two-sided", conf.level = conf.level) ## End(Not run)
Estimate the mean and standard deviation of a
normal distribution, and
construct a prediction interval for the next observations or
next set of
means.
predIntNorm(x, n.mean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95)
predIntNorm(x, n.mean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95)
x |
a numeric vector of observations, or an object resulting from a call to an estimating
function that assumes a normal (Gaussian) distribution (e.g., |
n.mean |
positive integer specifying the sample size associated with the |
k |
positive integer specifying the number of future observations or averages the
prediction interval should contain with confidence level |
method |
character string specifying the method to use if the number of future observations
( |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
What is a Prediction Interval?
A prediction interval for some population is an interval on the real line constructed
so that it will contain future observations or averages from that population
with some specified probability
, where
and
is some pre-specified positive integer.
The quantity
is called
the confidence coefficient or confidence level associated with the prediction
interval.
The Form of a Prediction Interval
Let denote a vector of
observations from a normal distribution with parameters
mean=
and
sd=
. Also, let
denote the
sample size associated with the
future averages (i.e.,
n.mean=
).
When
, each average is really just a single observation, so in the rest of
this help file the term “averages” will replace the phrase
“observations or averages”.
For a normal distribution, the form of a two-sided prediction
interval is:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, the number of future averages
, and the
sample size associated with the future averages,
. Do not confuse the
constant
(uppercase K) with the number of future averages
(lowercase k). The symbol
is used here to be consistent with the
notation used for tolerance intervals (see
tolIntNorm
).
Similarly, the form of a one-sided lower prediction interval is:
and the form of a one-sided upper prediction interval is:
but differs for one-sided versus two-sided prediction intervals.
The derivation of the constant
is explained in the help file for
predIntNormK
.
A Prediction Interval is a Random Interval
A prediction interval is a random interval; that is, the lower and/or
upper bounds are random variables computed based on sample statistics in the
baseline sample. Prior to taking one specific baseline sample, the probability
that the prediction interval will contain the next averages is
. Once a specific baseline sample is taken and the
prediction interval based on that sample is computed, the probability that that
prediction interval will contain the next
averages is not necessarily
, but it should be close.
If an experiment is repeated times, and for each experiment:
A sample is taken and a prediction interval for
future observation is computed, and
One future observation is generated and compared to the prediction interval,
then the number of prediction intervals that actually contain the future observation
generated in step 2 above is a binomial random variable
with parameters size=
and
prob=
.
If, on the other hand, only one baseline sample is taken and only one prediction
interval for future observation is computed, then the number of
future observations out of a total of
future observations that will be
contained in that one prediction interval is a binomial random variable with
parameters
size=
and
prob=
, where
depends on the true population parameters and the computed
bounds of the prediction interval.
If x
is a numeric vector, predIntNorm
returns a list of class
"estimate"
containing the estimated parameters, the prediction interval,
and other information. See the help file for estimate.object
for details.
If x
is the result of calling an estimation function,
predIntNorm
returns a list whose class is the same as x
.
The list contains the same components as x
, as well as a component called
interval
containing the prediction interval information.
If x
already has a component called interval
, this component is
replaced with the prediction interval information.
Prediction and tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973; Krishnamoorthy and Mathew, 2009). In the context of environmental statistics, prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities (e.g., Gibbons et al., 2009; Millard and Neerchal, 2001; USEPA, 2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Dunnett, C.W. (1955). A Multiple Comparisons Procedure for Comparing Several Treatments with a Control. Journal of the American Statistical Association 50, 1096-1121.
Dunnett, C.W. (1964). New Tables for Multiple Comparisons with a Control. Biometrics 20, 482-491.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J. (1969). Factors for Calculating Two-Sided Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 64(327), 878-898.
Hahn, G.J. (1970a). Additional Factors for Calculating Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 65(332), 1668-1676.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178-188.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York.
Helsel, D.R., and R.M. Hirsch. (2002). Statistical Methods in Water Resources. Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. (available on-line at: https://pubs.usgs.gov/tm/04/a03/tm4a3.pdf).
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Miller, R.G. (1981a). Simultaneous Statistical Inference. McGraw-Hill, New York.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
predIntNormK
, predIntNormSimultaneous
,
predIntLnorm
, tolIntNorm
,
Normal, estimate.object
, enorm
,
eqnorm
.
# Generate 20 observations from a normal distribution with parameters # mean=10 and sd=2, then create a two-sided 95% prediction interval for # the next observation. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(47) dat <- rnorm(20, mean = 10, sd = 2) predIntNorm(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 9.792856 # sd = 1.821286 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact # #Prediction Interval Type: two-sided # #Confidence Level: 95% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 5.886723 # UPL = 13.698988 #---------- # Using the same data from the last example, create a one-sided # upper 99% prediction limit for the next 3 averages of order 2 # (i.e., each of the 3 future averages is based on a sample size # of 2 future observations). predIntNorm(dat, n.mean = 2, k = 3, conf.level = 0.99, pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 9.792856 # sd = 1.821286 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Bonferroni # #Prediction Interval Type: upper # #Confidence Level: 99% # #Number of Future Averages: 3 # #Sample Size for Averages: 2 # #Prediction Interval: LPL = -Inf # UPL = 13.90537 #---------- # Compare the result above that is based on the Bonferroni method # with the exact method predIntNorm(dat, n.mean = 2, k = 3, conf.level = 0.99, pi.type = "upper", method = "exact")$interval$limits["UPL"] # UPL #13.89272 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 18-1 of USEPA (2009, p.18-9) shows how to construct a 95% # prediction interval for 4 future observations assuming a # normal distribution based on arsenic concentrations (ppb) in # groundwater at a solid waste landfill. There were 4 years of # quarterly monitoring, and years 1-3 are considered background. # The question to be answered is whether there is evidence of # contamination in year 4. # The data for this example is stored in EPA.09.Ex.18.1.arsenic.df. EPA.09.Ex.18.1.arsenic.df # Year Sampling.Period Arsenic.ppb #1 1 Background 12.6 #2 1 Background 30.8 #3 1 Background 52.0 #4 1 Background 28.1 #5 2 Background 33.3 #6 2 Background 44.0 #7 2 Background 3.0 #8 2 Background 12.8 #9 3 Background 58.1 #10 3 Background 12.6 #11 3 Background 17.6 #12 3 Background 25.3 #13 4 Compliance 48.0 #14 4 Compliance 30.3 #15 4 Compliance 42.5 #16 4 Compliance 15.0 As.bkgd <- with(EPA.09.Ex.18.1.arsenic.df, Arsenic.ppb[Sampling.Period == "Background"]) As.cmpl <- with(EPA.09.Ex.18.1.arsenic.df, Arsenic.ppb[Sampling.Period == "Compliance"]) # A Shapiro-Wilks goodness-of-fit test for normality indicates # there is no evidence to reject the assumption of normality # for the background data: gofTest(As.bkgd) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Normal # #Estimated Parameter(s): mean = 27.51667 # sd = 17.10119 # #Estimation Method: mvue # #Data: As.bkgd # #Sample Size: 12 # #Test Statistic: W = 0.94695 # #Test Statistic Parameter: n = 12 # #P-value: 0.5929102 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. # Here is the one-sided 95% upper prediction limit: UPL <- predIntNorm(As.bkgd, k = 4, pi.type = "upper")$interval$limits["UPL"] UPL # UPL #73.67237 # Are any of the compliance observations above the prediction limit? any(As.cmpl > UPL) #[1] FALSE #========== # Cleanup #-------- rm(As.bkgd, As.cmpl, UPL)
# Generate 20 observations from a normal distribution with parameters # mean=10 and sd=2, then create a two-sided 95% prediction interval for # the next observation. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(47) dat <- rnorm(20, mean = 10, sd = 2) predIntNorm(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 9.792856 # sd = 1.821286 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact # #Prediction Interval Type: two-sided # #Confidence Level: 95% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 5.886723 # UPL = 13.698988 #---------- # Using the same data from the last example, create a one-sided # upper 99% prediction limit for the next 3 averages of order 2 # (i.e., each of the 3 future averages is based on a sample size # of 2 future observations). predIntNorm(dat, n.mean = 2, k = 3, conf.level = 0.99, pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 9.792856 # sd = 1.821286 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Bonferroni # #Prediction Interval Type: upper # #Confidence Level: 99% # #Number of Future Averages: 3 # #Sample Size for Averages: 2 # #Prediction Interval: LPL = -Inf # UPL = 13.90537 #---------- # Compare the result above that is based on the Bonferroni method # with the exact method predIntNorm(dat, n.mean = 2, k = 3, conf.level = 0.99, pi.type = "upper", method = "exact")$interval$limits["UPL"] # UPL #13.89272 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 18-1 of USEPA (2009, p.18-9) shows how to construct a 95% # prediction interval for 4 future observations assuming a # normal distribution based on arsenic concentrations (ppb) in # groundwater at a solid waste landfill. There were 4 years of # quarterly monitoring, and years 1-3 are considered background. # The question to be answered is whether there is evidence of # contamination in year 4. # The data for this example is stored in EPA.09.Ex.18.1.arsenic.df. EPA.09.Ex.18.1.arsenic.df # Year Sampling.Period Arsenic.ppb #1 1 Background 12.6 #2 1 Background 30.8 #3 1 Background 52.0 #4 1 Background 28.1 #5 2 Background 33.3 #6 2 Background 44.0 #7 2 Background 3.0 #8 2 Background 12.8 #9 3 Background 58.1 #10 3 Background 12.6 #11 3 Background 17.6 #12 3 Background 25.3 #13 4 Compliance 48.0 #14 4 Compliance 30.3 #15 4 Compliance 42.5 #16 4 Compliance 15.0 As.bkgd <- with(EPA.09.Ex.18.1.arsenic.df, Arsenic.ppb[Sampling.Period == "Background"]) As.cmpl <- with(EPA.09.Ex.18.1.arsenic.df, Arsenic.ppb[Sampling.Period == "Compliance"]) # A Shapiro-Wilks goodness-of-fit test for normality indicates # there is no evidence to reject the assumption of normality # for the background data: gofTest(As.bkgd) #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF # #Hypothesized Distribution: Normal # #Estimated Parameter(s): mean = 27.51667 # sd = 17.10119 # #Estimation Method: mvue # #Data: As.bkgd # #Sample Size: 12 # #Test Statistic: W = 0.94695 # #Test Statistic Parameter: n = 12 # #P-value: 0.5929102 # #Alternative Hypothesis: True cdf does not equal the # Normal Distribution. # Here is the one-sided 95% upper prediction limit: UPL <- predIntNorm(As.bkgd, k = 4, pi.type = "upper")$interval$limits["UPL"] UPL # UPL #73.67237 # Are any of the compliance observations above the prediction limit? any(As.cmpl > UPL) #[1] FALSE #========== # Cleanup #-------- rm(As.bkgd, As.cmpl, UPL)
Observations from a Normal Distribution
Compute the half-width of a prediction interval for the next observations
from a normal distribution.
predIntNormHalfWidth(n, df = n - 1, n.mean = 1, k = 1, sigma.hat = 1, method = "Bonferroni", conf.level = 0.95)
predIntNormHalfWidth(n, df = n - 1, n.mean = 1, k = 1, sigma.hat = 1, method = "Bonferroni", conf.level = 0.95)
n |
numeric vector of positive integers greater than 1 indicating the sample size upon
which the prediction interval is based.
Missing ( |
df |
numeric vector of positive integers indicating the degrees of freedom associated
with the prediction interval. The default is |
n.mean |
numeric vector of positive integers specifying the sample size associated with
the |
k |
numeric vector of positive integers specifying the number of future observations
or averages the prediction interval should contain with confidence level
|
sigma.hat |
numeric vector specifying the value(s) of the estimated standard deviation(s).
The default value is |
method |
character string specifying the method to use if the number of future observations
( |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
If the arguments n
, k
, n.mean
, sigma.hat
, and
conf.level
are not all the same length, they are replicated to be the
same length as the length of the longest argument.
The help files for predIntNorm
and predIntNormK
give formulas for a two-sided prediction interval based on the sample size, the
observed sample mean and sample standard deviation, and specified confidence level.
Specifically, the two-sided prediction interval is given by:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, the number of future averages
, and the
sample size associated with the future averages,
(see the help file for
predIntNormK
). Thus, the half-width of the prediction interval is
given by:
numeric vector of half-widths.
See the help file for predIntNorm
.
Steven P. Millard ([email protected])
See the help file for predIntNorm
.
predIntNorm
, predIntNormK
,
predIntNormN
, plotPredIntNormDesign
.
# Look at how the half-width of a prediction interval increases with # increasing number of future observations: 1:5 #[1] 1 2 3 4 5 hw <- predIntNormHalfWidth(n = 10, k = 1:5) round(hw, 2) #[1] 2.37 2.82 3.08 3.26 3.41 #---------- # Look at how the half-width of a prediction interval decreases with # increasing sample size: 2:5 #[1] 2 3 4 5 hw <- predIntNormHalfWidth(n = 2:5) round(hw, 2) #[1] 15.56 4.97 3.56 3.04 #---------- # Look at how the half-width of a prediction interval increases with # increasing estimated standard deviation for a fixed sample size: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 hw <- predIntNormHalfWidth(n = 10, sigma.hat = seq(0.5, 2, by = 0.5)) round(hw, 2) #[1] 1.19 2.37 3.56 4.75 #---------- # Look at how the half-width of a prediction interval increases with # increasing confidence level for a fixed sample size: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 hw <- predIntNormHalfWidth(n = 5, conf = seq(0.5, 0.9, by = 0.1)) round(hw, 2) #[1] 0.81 1.03 1.30 1.68 2.34 #========== # The data frame EPA.92c.arsenic3.df contains arsenic concentrations (ppb) # collected quarterly for 3 years at a background well and quarterly for # 2 years at a compliance well. Using the data from the background well, compute # the half-width associated with sample sizes of 12 (3 years of quarterly data), # 16 (4 years of quarterly data), and 20 (5 years of quarterly data) for a # two-sided 90% prediction interval for k=4 future observations. EPA.92c.arsenic3.df # Arsenic Year Well.type #1 12.6 1 Background #2 30.8 1 Background #3 52.0 1 Background #... #18 3.8 5 Compliance #19 2.6 5 Compliance #20 51.9 5 Compliance mu.hat <- with(EPA.92c.arsenic3.df, mean(Arsenic[Well.type=="Background"])) mu.hat #[1] 27.51667 sigma.hat <- with(EPA.92c.arsenic3.df, sd(Arsenic[Well.type=="Background"])) sigma.hat #[1] 17.10119 hw <- predIntNormHalfWidth(n = c(12, 16, 20), k = 4, sigma.hat = sigma.hat, conf.level = 0.9) round(hw, 2) #[1] 46.16 43.89 42.64 #========== # Clean up #--------- rm(hw, mu.hat, sigma.hat)
# Look at how the half-width of a prediction interval increases with # increasing number of future observations: 1:5 #[1] 1 2 3 4 5 hw <- predIntNormHalfWidth(n = 10, k = 1:5) round(hw, 2) #[1] 2.37 2.82 3.08 3.26 3.41 #---------- # Look at how the half-width of a prediction interval decreases with # increasing sample size: 2:5 #[1] 2 3 4 5 hw <- predIntNormHalfWidth(n = 2:5) round(hw, 2) #[1] 15.56 4.97 3.56 3.04 #---------- # Look at how the half-width of a prediction interval increases with # increasing estimated standard deviation for a fixed sample size: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 hw <- predIntNormHalfWidth(n = 10, sigma.hat = seq(0.5, 2, by = 0.5)) round(hw, 2) #[1] 1.19 2.37 3.56 4.75 #---------- # Look at how the half-width of a prediction interval increases with # increasing confidence level for a fixed sample size: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 hw <- predIntNormHalfWidth(n = 5, conf = seq(0.5, 0.9, by = 0.1)) round(hw, 2) #[1] 0.81 1.03 1.30 1.68 2.34 #========== # The data frame EPA.92c.arsenic3.df contains arsenic concentrations (ppb) # collected quarterly for 3 years at a background well and quarterly for # 2 years at a compliance well. Using the data from the background well, compute # the half-width associated with sample sizes of 12 (3 years of quarterly data), # 16 (4 years of quarterly data), and 20 (5 years of quarterly data) for a # two-sided 90% prediction interval for k=4 future observations. EPA.92c.arsenic3.df # Arsenic Year Well.type #1 12.6 1 Background #2 30.8 1 Background #3 52.0 1 Background #... #18 3.8 5 Compliance #19 2.6 5 Compliance #20 51.9 5 Compliance mu.hat <- with(EPA.92c.arsenic3.df, mean(Arsenic[Well.type=="Background"])) mu.hat #[1] 27.51667 sigma.hat <- with(EPA.92c.arsenic3.df, sd(Arsenic[Well.type=="Background"])) sigma.hat #[1] 17.10119 hw <- predIntNormHalfWidth(n = c(12, 16, 20), k = 4, sigma.hat = sigma.hat, conf.level = 0.9) round(hw, 2) #[1] 46.16 43.89 42.64 #========== # Clean up #--------- rm(hw, mu.hat, sigma.hat)
for a Prediction Interval for a Normal Distribution
Compute the value of (the multiplier of estimated standard deviation) used
to construct a prediction interval for the next
observations or next set of
means based on data from a normal distribution.
The function
predIntNormK
is called by predIntNorm
.
predIntNormK(n, df = n - 1, n.mean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95)
predIntNormK(n, df = n - 1, n.mean = 1, k = 1, method = "Bonferroni", pi.type = "two-sided", conf.level = 0.95)
n |
a positive integer greater than 2 indicating the sample size upon which the prediction interval is based. |
df |
the degrees of freedom associated with the prediction interval. The default is
|
n.mean |
positive integer specifying the sample size associated with the |
k |
positive integer specifying the number of future observations or averages the
prediction interval should contain with confidence level |
method |
character string specifying the method to use if the number of future observations
( |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
A prediction interval for some population is an interval on the real line constructed
so that it will contain future observations or averages from that population
with some specified probability
, where
and
is some pre-specified positive integer.
The quantity
is called
the confidence coefficient or confidence level associated with the prediction
interval.
Let denote a vector of
observations from a normal distribution with parameters
mean=
and
sd=
. Also, let
denote the
sample size associated with the
future averages (i.e.,
n.mean=
).
When
, each average is really just a single observation, so in the rest of
this help file the term “averages” will replace the phrase
“observations or averages”.
For a normal distribution, the form of a two-sided prediction
interval is:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, the number of future averages
, and the
sample size associated with the future averages,
. Do not confuse the
constant
(uppercase K) with the number of future averages
(lowercase k). The symbol
is used here to be consistent with the
notation used for tolerance intervals (see
tolIntNorm
).
Similarly, the form of a one-sided lower prediction interval is:
and the form of a one-sided upper prediction interval is:
but differs for one-sided versus two-sided prediction intervals.
The derivation of the constant
is explained below. The function
predIntNormK
computes the value of and is called by
predIntNorm
.
The Derivation of K for One Future Observation or Average (k = 1)
Let denote a random variable from a normal distribution
with parameters
mean=
and
sd=
, and let
denote the
'th quantile of
.
A true two-sided prediction interval for the next
observation of
is given by:
where denotes the
'th quantile of a standard normal distribution.
More generally, a true two-sided prediction interval for the
next
average based on a sample of size
is given by:
Because the values of and
are unknown, they must be
estimated, and a prediction interval then constructed based on the estimated
values of
and
.
For a two-sided prediction interval (pi.type="two-sided"
),
the constant for a
prediction interval for the next
average based on a sample size of
is computed as:
where denotes the
'th quantile of the
Student's t-distribution with
degrees of freedom. For a one-sided prediction interval
(
pi.type="lower"
or pi.type="lower"
), the prediction interval
is given by:
.
The formulas for these prediction intervals are derived as follows. Let
denote the future average based on
observations. Then
the quantity
has a normal distribution with expectation
and variance given by:
so the quantity
has a Student's t-distribution with degrees of freedom.
The Derivation of K for More than One Future Observation or Average (k >1)
When , the function
predIntNormK
allows for two ways to compute
: an exact method due to Dunnett (1955) (
method="exact"
), and
an approximate (conservative) method based on the Bonferroni inequality
(method="Bonferroni"
; see Miller, 1981a, pp.8, 67-70;
Gibbons et al., 2009, p.4). Each of these methods is explained below.
Exact Method Due to Dunnett (1955) (method="exact"
)
Dunnett (1955) derived the value of in the context of the multiple
comparisons problem of comparing several treatment means to one control mean.
The value of
is computed as:
where is a constant that depends on the sample size
, the number of
future observations (averages)
, the sample size associated with the
future averages
, and the confidence level
.
When pi.type="lower"
or pi.type="upper"
, the value of is the
number that satisfies the following equation (Gupta and Sobel, 1957; Hahn, 1970a):
where
and and
denote the cumulative distribution function and
probability density function, respectively, of the standard normal distribution.
Note that the function
is the probability density function of a
chi random variable with
degrees of freedom.
When pi.type="two-sided"
, the value of is the number that satisfies
the following equation:
where
Approximate Method Based on the Bonferroni Inequality (method="Bonferroni"
)
As shown above, when , the value of
is given by Equation (8) or
Equation (9) for two-sided or one-sided prediction intervals, respectively. When
, a conservative way to construct a
prediction
interval for the next
observations or averages is to use a Bonferroni
correction (Miller, 1981a, p.8) and set
in Equation (8)
or (9) (Chew, 1968). This value of
will be conservative in that the computed
prediction intervals will be wider than the exact predictions intervals.
Hahn (1969, 1970a) compared the exact values of
with those based on the
Bonferroni inequality for the case of
and found the approximation to be
quite satisfactory except when
is small,
is large, and
is large. For example, Gibbons (1987a) notes that for a 99% prediction interval
(i.e.,
) for the next
observations, if
,
the bias of
is never greater than 1% no matter what the value of
.
A numeric scalar equal to , the multiplier of estimated standard
deviation that is used to construct the prediction interval.
Prediction and tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973). In the context of environmental statistics, prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities (e.g., Gibbons et al., 2009; Millard and Neerchal, 2001; USEPA, 2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Dunnett, C.W. (1955). A Multiple Comparisons Procedure for Comparing Several Treatments with a Control. Journal of the American Statistical Association 50, 1096-1121.
Dunnett, C.W. (1964). New Tables for Multiple Comparisons with a Control. Biometrics 20, 482-491.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J. (1969). Factors for Calculating Two-Sided Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 64(327), 878-898.
Hahn, G.J. (1970a). Additional Factors for Calculating Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 65(332), 1668-1676.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178-188.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York.
Helsel, D.R., and R.M. Hirsch. (2002). Statistical Methods in Water Resources. Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. (available on-line at: https://pubs.usgs.gov/tm/04/a03/tm4a3.pdf).
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
Miller, R.G. (1981a). Simultaneous Statistical Inference. McGraw-Hill, New York.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
predIntNorm
, predIntNormSimultaneous
,
predIntLnorm
, tolIntNorm
,
Normal, estimate.object
, enorm
, eqnorm
.
# Compute the value of K for a two-sided 95% prediction interval # for the next observation given a sample size of n=20. predIntNormK(n = 20) #[1] 2.144711 #-------------------------------------------------------------------- # Compute the value of K for a one-sided upper 99% prediction limit # for the next 3 averages of order 2 (i.e., each of the 3 future # averages is based on a sample size of 2 future observations) given a # samle size of n=20. predIntNormK(n = 20, n.mean = 2, k = 3, pi.type = "upper", conf.level = 0.99) #[1] 2.258026 #---------- # Compare the result above that is based on the Bonferroni method # with the exact method. predIntNormK(n = 20, n.mean = 2, k = 3, method = "exact", pi.type = "upper", conf.level = 0.99) #[1] 2.251084 #-------------------------------------------------------------------- # Example 18-1 of USEPA (2009, p.18-9) shows how to construct a 95% # prediction interval for 4 future observations assuming a # normal distribution based on arsenic concentrations (ppb) in # groundwater at a solid waste landfill. There were 4 years of # quarterly monitoring, and years 1-3 are considered background, # So the sample size for the prediciton limit is n = 12, # and the number of future samples is k = 4. predIntNormK(n = 12, k = 4, pi.type = "upper") #[1] 2.698976
# Compute the value of K for a two-sided 95% prediction interval # for the next observation given a sample size of n=20. predIntNormK(n = 20) #[1] 2.144711 #-------------------------------------------------------------------- # Compute the value of K for a one-sided upper 99% prediction limit # for the next 3 averages of order 2 (i.e., each of the 3 future # averages is based on a sample size of 2 future observations) given a # samle size of n=20. predIntNormK(n = 20, n.mean = 2, k = 3, pi.type = "upper", conf.level = 0.99) #[1] 2.258026 #---------- # Compare the result above that is based on the Bonferroni method # with the exact method. predIntNormK(n = 20, n.mean = 2, k = 3, method = "exact", pi.type = "upper", conf.level = 0.99) #[1] 2.251084 #-------------------------------------------------------------------- # Example 18-1 of USEPA (2009, p.18-9) shows how to construct a 95% # prediction interval for 4 future observations assuming a # normal distribution based on arsenic concentrations (ppb) in # groundwater at a solid waste landfill. There were 4 years of # quarterly monitoring, and years 1-3 are considered background, # So the sample size for the prediciton limit is n = 12, # and the number of future samples is k = 4. predIntNormK(n = 12, k = 4, pi.type = "upper") #[1] 2.698976
Observations from a Normal Distribution
Compute the sample size necessary to achieve a specified half-width of a
prediction interval for the next observations from a normal distribution.
predIntNormN(half.width, n.mean = 1, k = 1, sigma.hat = 1, method = "Bonferroni", conf.level = 0.95, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
predIntNormN(half.width, n.mean = 1, k = 1, sigma.hat = 1, method = "Bonferroni", conf.level = 0.95, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
half.width |
numeric vector of (positive) half-widths.
Missing ( |
n.mean |
numeric vector of positive integers specifying the sample size associated with
the |
k |
numeric vector of positive integers specifying the number of future observations
or averages the prediction interval should contain with confidence level
|
sigma.hat |
numeric vector specifying the value(s) of the estimated standard deviation(s).
The default value is |
method |
character string specifying the method to use if the number of future observations
( |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
round.up |
logical scalar indicating whether to round up the values of the computed sample
size(s) to the next smallest integer. The default value is |
n.max |
positive integer greater than 1 indicating the maximum possible sample size. The
default value is |
tol |
numeric scalar indicating the tolerance to use in the |
maxiter |
positive integer indicating the maximum number of iterations to use in the
|
If the arguments half.width
, k
, n.mean
, sigma.hat
, and
conf.level
are not all the same length, they are replicated to be the same
length as the length of the longest argument.
The help files for predIntNorm
and predIntNormK
give formulas for a two-sided prediction interval based on the sample size, the
observed sample mean and sample standard deviation, and specified confidence level.
Specifically, the two-sided prediction interval is given by:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, the number of future averages
, and the
sample size associated with the future averages,
(see the help file for
predIntNormK
). Thus, the half-width of the prediction interval is
given by:
The function predIntNormN
uses the uniroot
search algorithm to
determine the sample size for specified values of the half-width, number of
observations used to create a single future average, number of future observations or
averages, the sample standard deviation, and the confidence level. Note that
unlike a confidence interval, the half-width of a prediction interval does not
approach 0 as the sample size increases.
numeric vector of sample sizes.
See the help file for predIntNorm
.
Steven P. Millard ([email protected])
See the help file for predIntNorm
.
predIntNorm
, predIntNormK
,
predIntNormHalfWidth
, plotPredIntNormDesign
.
# Look at how the required sample size for a prediction interval increases # with increasing number of future observations: 1:5 #[1] 1 2 3 4 5 predIntNormN(half.width = 3, k = 1:5) #[1] 6 9 11 14 18 #---------- # Look at how the required sample size for a prediction interval decreases # with increasing half-width: 2:5 #[1] 2 3 4 5 predIntNormN(half.width = 2:5) #[1] 86 6 4 3 predIntNormN(2:5, round = FALSE) #[1] 85.567387 5.122911 3.542393 2.987861 #---------- # Look at how the required sample size for a prediction interval increases # with increasing estimated standard deviation for a fixed half-width: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 predIntNormN(half.width = 4, sigma.hat = seq(0.5, 2, by = 0.5)) #[1] 3 4 7 86 #---------- # Look at how the required sample size for a prediction interval increases # with increasing confidence level for a fixed half-width: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 predIntNormN(half.width = 2, conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 2 2 3 4 9 #========== # The data frame EPA.92c.arsenic3.df contains arsenic concentrations (ppb) # collected quarterly for 3 years at a background well and quarterly for # 2 years at a compliance well. Using the data from the background well, # compute the required sample size in order to achieve a half-width of # 2.25, 2.5, or 3 times the estimated standard deviation for a two-sided # 90% prediction interval for k=4 future observations. # # For a half-width of 2.25 standard deviations, the required sample size is 526, # or about 131 years of quarterly observations! For a half-width of 2.5 # standard deviations, the required sample size is 20, or about 5 years of # quarterly observations. For a half-width of 3 standard deviations, the required # sample size is 9, or about 2 years of quarterly observations. EPA.92c.arsenic3.df # Arsenic Year Well.type #1 12.6 1 Background #2 30.8 1 Background #3 52.0 1 Background #... #18 3.8 5 Compliance #19 2.6 5 Compliance #20 51.9 5 Compliance mu.hat <- with(EPA.92c.arsenic3.df, mean(Arsenic[Well.type=="Background"])) mu.hat #[1] 27.51667 sigma.hat <- with(EPA.92c.arsenic3.df, sd(Arsenic[Well.type=="Background"])) sigma.hat #[1] 17.10119 predIntNormN(half.width=c(2.25, 2.5, 3) * sigma.hat, k = 4, sigma.hat = sigma.hat, conf.level = 0.9) #[1] 526 20 9 #========== # Clean up #--------- rm(mu.hat, sigma.hat)
# Look at how the required sample size for a prediction interval increases # with increasing number of future observations: 1:5 #[1] 1 2 3 4 5 predIntNormN(half.width = 3, k = 1:5) #[1] 6 9 11 14 18 #---------- # Look at how the required sample size for a prediction interval decreases # with increasing half-width: 2:5 #[1] 2 3 4 5 predIntNormN(half.width = 2:5) #[1] 86 6 4 3 predIntNormN(2:5, round = FALSE) #[1] 85.567387 5.122911 3.542393 2.987861 #---------- # Look at how the required sample size for a prediction interval increases # with increasing estimated standard deviation for a fixed half-width: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 predIntNormN(half.width = 4, sigma.hat = seq(0.5, 2, by = 0.5)) #[1] 3 4 7 86 #---------- # Look at how the required sample size for a prediction interval increases # with increasing confidence level for a fixed half-width: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 predIntNormN(half.width = 2, conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 2 2 3 4 9 #========== # The data frame EPA.92c.arsenic3.df contains arsenic concentrations (ppb) # collected quarterly for 3 years at a background well and quarterly for # 2 years at a compliance well. Using the data from the background well, # compute the required sample size in order to achieve a half-width of # 2.25, 2.5, or 3 times the estimated standard deviation for a two-sided # 90% prediction interval for k=4 future observations. # # For a half-width of 2.25 standard deviations, the required sample size is 526, # or about 131 years of quarterly observations! For a half-width of 2.5 # standard deviations, the required sample size is 20, or about 5 years of # quarterly observations. For a half-width of 3 standard deviations, the required # sample size is 9, or about 2 years of quarterly observations. EPA.92c.arsenic3.df # Arsenic Year Well.type #1 12.6 1 Background #2 30.8 1 Background #3 52.0 1 Background #... #18 3.8 5 Compliance #19 2.6 5 Compliance #20 51.9 5 Compliance mu.hat <- with(EPA.92c.arsenic3.df, mean(Arsenic[Well.type=="Background"])) mu.hat #[1] 27.51667 sigma.hat <- with(EPA.92c.arsenic3.df, sd(Arsenic[Well.type=="Background"])) sigma.hat #[1] 17.10119 predIntNormN(half.width=c(2.25, 2.5, 3) * sigma.hat, k = 4, sigma.hat = sigma.hat, conf.level = 0.9) #[1] 526 20 9 #========== # Clean up #--------- rm(mu.hat, sigma.hat)
Estimate the mean and standard deviation of a
normal distribution, and construct a simultaneous prediction
interval for the next sampling “occasions”, based on one of three
possible rules:
-of-
, California, or Modified California.
predIntNormSimultaneous(x, n.mean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = .Machine$double.eps^0.5)
predIntNormSimultaneous(x, n.mean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = .Machine$double.eps^0.5)
x |
a numeric vector of observations, or an object resulting from a call to an estimating
function that assumes a normal (Gaussian) distribution (e.g., |
n.mean |
positive integer specifying the sample size associated with the future averages.
The default value is |
k |
for the |
m |
positive integer specifying the maximum number of future observations (or
averages) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
delta.over.sigma |
numeric scalar indicating the ratio |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
K.tol |
numeric scalar indicating the tolerance to use in the nonlinear search algorithm to
compute |
What is a Simultaneous Prediction Interval?
A prediction interval for some population is an interval on the real line constructed
so that it will contain future observations from that population
with some specified probability
, where
and
is some pre-specified positive integer.
The quantity
is called
the confidence coefficient or confidence level associated with the prediction
interval. The function
predIntNorm
computes a standard prediction
interval based on a sample from a normal distribution.
The function predIntNormSimultaneous
computes a simultaneous prediction
interval that will contain a certain number of future observations with probability
for each of
future sampling “occasions”,
where
is some pre-specified positive integer. The quantity
may
refer to
distinct future sampling occasions in time, or it may for example
refer to sampling at
distinct locations on one future sampling occasion,
assuming that the population standard deviation is the same at all of the
distinct locations.
The function predIntNormSimultaneous
computes a simultaneous prediction
interval based on one of three possible rules:
For the -of-
rule (
rule="k.of.m"
), at least of
the next
future observations will fall in the prediction
interval with probability
on each of the
future
sampling occasions. If obserations are being taken sequentially, for a particular
sampling occasion, up to
observations may be taken, but once
of the observations fall within the prediction interval, sampling can stop.
Note: When
and
, the results of
predIntNormSimultaneous
are equivalent to the results of predIntNorm
.
For the California rule (rule="CA"
), with probability
, for each of the
future sampling occasions, either
the first observation will fall in the prediction interval, or else all of the next
observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise,
more observations must be taken.
For the Modified California rule (rule="Modified.CA"
), with probability
, for each of the
future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
Simultaneous prediction intervals can be extended to using averages (means) in place
of single observations (USEPA, 2009, Chapter 19). That is, you can create a
simultaneous prediction interval
that will contain a specified number of averages (based on which rule you choose) on
each of future sampling occassions, where each each average is based on
individual observations. For the function
predIntNormSimultaneous
,
the argument n.mean
corresponds to .
The Form of a Prediction Interval for 1 Future Observation
Let denote a vector of
observations from a normal distribution with parameters
mean=
and
sd=
. Also, let
denote the
sample size associated with the future averages (i.e.,
n.mean=
).
When
, each average is really just a single observation, so in the rest of
this help file the term “averages” will replace the phrase
“observations or averages”.
For a normal distribution, the form of a two-sided
prediction interval is:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, the number of future sampling occassions
, and the
sample size associated with the future averages,
. Do not confuse the
constant
(uppercase K) with the number of future averages
(lowercase k) in the
-of-
rule. The symbol
is used here
to be consistent with the notation used for tolerance intervals
(see
tolIntNorm
).
Similarly, the form of a one-sided lower prediction interval is:
and the form of a one-sided upper prediction interval is:
The derivation of the constant is explained in the help file for
predIntNormK
.
The Form of a Simultaneous Prediction Interval
For simultaneous prediction intervals, only lower
(pi.type="lower"
) and upper (pi.type="upper"
) prediction
intervals are available. Two-sided simultaneous prediction intervals were
available in Versions 2.4.0 - 2.8.1 of EnvStats, but these prediction
intervals were based on an incorrect algorithm for .
Equations (4) and (5) above hold for simultaneous prediction intervals, but the
derivation of the constant is more difficult, and is explained in the help file for
predIntNormSimultaneousK
.
Prediction Intervals are Random Intervals
A prediction interval is a random interval; that is, the lower and/or
upper bounds are random variables computed based on sample statistics in the
baseline sample. Prior to taking one specific baseline sample, the probability
that the prediction interval will perform according to the rule chosen is
. Once a specific baseline sample is taken and the prediction
interval based on that sample is computed, the probability that that prediction
interval will perform according to the rule chosen is not necessarily
, but it should be close. See the help file for
predIntNorm
for more information.
If x
is a numeric vector, predIntNormSimultaneous
returns a list of
class "estimate"
containing the estimated parameters, the prediction interval,
and other information. See the help file for estimate.object
for details.
If x
is the result of calling an estimation function,
predIntNormSimultaneous
returns a list whose class is the same as x
.
The list contains the same components as x
, as well as a component called
interval
containing the prediction interval information.
If x
already has a component called interval
, this component is
replaced with the prediction interval information.
Motivation
Prediction and tolerance intervals have long been applied to quality control and
life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973). In the context of
environmental statistics, prediction intervals are useful for analyzing data from
groundwater detection monitoring programs at hazardous and solid waste facilities.
One of the main statistical problems that plague groundwater monitoring programs at
hazardous and solid waste facilities is the requirement of testing several wells and
several constituents at each well on each sampling occasion. This is an obvious
multiple comparisons problem, and the naive approach of using a standard t-test at
a conventional -level (e.g., 0.05 or 0.01) for each test leads to a
very high probability of at least one significant result on each sampling occasion,
when in fact no contamination has occurred. This problem was pointed out years ago
by Millard (1987) and others.
Davis and McNichols (1987) proposed simultaneous prediction intervals as a way of
controlling the facility-wide false positive rate (FWFPR) while maintaining adequate
power to detect contamination in the groundwater. Because of the ubiquitous presence
of spatial variability, it is usually best to use simultaneous prediction intervals
at each well (Davis, 1998a). That is, by constructing prediction intervals based on
background (pre-landfill) data on each well, and comparing future observations at a
well to the prediction interval for that particular well. In each of these cases,
the individual -level at each well is equal to the FWFRP divided by the
product of the number of wells and constituents.
Often, observations at downgradient wells are not available prior to the construction and operation of the landfill. In this case, upgradient well data can be combined to create a background prediction interval, and observations at each downgradient well can be compared to this prediction interval. If spatial variability is present and a major source of variation, however, this method is not really valid (Davis, 1994; Davis, 1998a).
Chapter 19 of USEPA (2009) contains an extensive discussion of using the
-of-
rule and the Modified California rule.
Chapters 1 and 3 of Gibbons et al. (2009) discuss simultaneous prediction intervals
for the normal and lognormal distributions, respectively.
The k-of-m Rule
For the -of-
rule, Davis and McNichols (1987) give tables with
“optimal” choices of
(in terms of best power for a given overall
confidence level) for selected values of
,
, and
. They found
that the optimal ratios of
to
(i.e.,
) are generally small,
in the range of 15-50%.
The California Rule
The California rule was mandated in that state for groundwater monitoring at waste
disposal facilities when resampling verification is part of the statistical program
(Barclay's Code of California Regulations, 1991). The California code mandates a
“California” rule with . The motivation for this rule may have
been a desire to have a majority of the observations in bounds (Davis, 1998a). For
example, for a
-of-
rule with
and
, a monitoring
location will pass if the first observation is out of bounds, the second resample
is out of bounds, but the last resample is in bounds, so that 2 out of 3 observations
are out of bounds. For the California rule with
, either the first
observation must be in bounds, or the next 2 observations must be in bounds in order
for the monitoring location to pass.
Davis (1998a) states that if the FWFPR is kept constant, then the California rule
offers little increased power compared to the -of-
rule, and can
actually decrease the power of detecting contamination.
The Modified California Rule
The Modified California Rule was proposed as a compromise between a 1-of-
rule and the California rule. For a given FWFPR, the Modified California rule
achieves better power than the California rule, and still requires at least as many
observations in bounds as out of bounds, unlike a 1-of-
rule.
Different Notations Between Different References
For the -of-
rule described in this help file, both
Davis and McNichols (1987) and USEPA (2009, Chapter 19) use the variable
instead of
to represent the minimum number
of future observations the interval should contain on each of the
sampling
occasions.
Gibbons et al. (2009, Chapter 1) presents extensive lists of the value of
for both
-of-
rules and California rules. Gibbons et al.'s
notation reverses the meaning of
and
compared to the notation used
in this help file. That is, in Gibbons et al.'s notation,
represents the
number of future sampling occasions or monitoring wells, and
represents the
minimum number of observations the interval should contain on each sampling occasion.
USEPA (2009, Chapter 19) uses in place of
.
Steven P. Millard ([email protected])
Barclay's California Code of Regulations. (1991). Title 22, Section 66264.97 [concerning hazardous waste facilities] and Title 23, Section 2550.7(e)(8) [concerning solid waste facilities]. Barclay's Law Publishers, San Francisco, CA.
Davis, C.B. (1998a). Ground-Water Statistics & Regulations: Principles, Progress and Problems. Second Edition. Environmetrics & Statistics Limited, Henderson, NV.
Davis, C.B. (1998b). Personal Communication, September 3, 1998.
Davis, C.B., and R.J. McNichols. (1987). One-sided Intervals for at Least
of
Observations from a Normal Population on Each of
Future Occasions.
Technometrics 29, 359–370.
Fertig, K.W., and N.R. Mann. (1977). One-Sided Prediction Intervals for at Least
Out of
Future Observations From a Normal Population.
Technometrics 19, 167–177.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J. (1969). Factors for Calculating Two-Sided Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 64(327), 878-898.
Hahn, G.J. (1970a). Additional Factors for Calculating Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 65(332), 1668-1676.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178-188.
Hall, I.J., and R.R. Prairie. (1973). One-Sided Prediction Intervals to Contain at
Least Out of
Future Observations.
Technometrics 15, 897–914.
Millard, S.P. (1987). Environmental Monitoring, Statistics, and the Law: Room for Improvement (with Comment). The American Statistician 41(4), 249–259.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
predIntNormSimultaneousK
,
predIntNormSimultaneousTestPower
,
predIntNorm
, predIntLnormSimultaneous
, tolIntNorm
,
Normal, estimate.object
, enorm
# Generate 8 observations from a normal distribution with parameters # mean=10 and sd=2, then use predIntNormSimultaneous to estimate the # mean and standard deviation of the true distribution and construct an # upper 95% prediction interval to contain at least 1 out of the next # 3 observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(479) dat <- rnorm(8, mean = 10, sd = 2) predIntNormSimultaneous(dat, k = 1, m = 3) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 10.269773 # sd = 2.210246 # #Estimation Method: mvue # #Data: dat # #Sample Size: 8 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 95% # #Minimum Number of #Future Observations #Interval Should Contain: 1 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = -Inf # UPL = 11.4021 #---------- # Repeat the above example, but do it in two steps. First create a list called # est.list containing information about the estimated parameters, then create the # prediction interval. est.list <- enorm(dat) est.list #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 10.269773 # sd = 2.210246 # #Estimation Method: mvue # #Data: dat # #Sample Size: 8 predIntNormSimultaneous(est.list, k = 1, m = 3) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 10.269773 # sd = 2.210246 # #Estimation Method: mvue # #Data: dat # #Sample Size: 8 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 95% # #Minimum Number of #Future Observations #Interval Should Contain: 1 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = -Inf # UPL = 11.4021 #---------- # Compare the 95% 1-of-3 upper prediction interval to the California and # Modified California prediction intervals. Note that the upper prediction # bound for the Modified California rule is between the bound for the # 1-of-3 rule bound and the bound for the California rule. predIntNormSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #11.4021 predIntNormSimultaneous(dat, m = 3, rule = "CA")$interval$limits["UPL"] # UPL #13.03717 predIntNormSimultaneous(dat, rule = "Modified.CA")$interval$limits["UPL"] # UPL #12.12201 #---------- # Show how the upper bound on an upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. Here, we'll use the # 1-of-3 rule. predIntNormSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #11.4021 predIntNormSimultaneous(dat, k = 1, m = 3, r = 10)$interval$limits["UPL"] # UPL #13.28234 #---------- # Compare the upper simultaneous prediction limit for the 1-of-3 rule # based on individual observations versus based on means of order 4. predIntNormSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #11.4021 predIntNormSimultaneous(dat, n.mean = 4, k = 1, m = 3)$interval$limits["UPL"] # UPL #11.26157 #========== # Example 19-1 of USEPA (2009, p. 19-17) shows how to compute an # upper simultaneous prediction limit for the 1-of-3 rule for # r = 2 future sampling occasions. The data for this example are # stored in EPA.09.Ex.19.1.sulfate.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 25 to use to construct the prediction limit. # There are 50 compliance wells and we will monitor 10 different # constituents at each well at each of the r=2 future sampling # occasions. To determine the confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the individual Type I Error level at each well to # 1 - (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the confidence level is given by: nc <- 10 nw <- 50 SWFPR <- 0.1 conf.level <- (1 - SWFPR)^(1 / (nc * nw)) conf.level #[1] 0.9997893 #---------- # Look at the data: names(EPA.09.Ex.19.1.sulfate.df) #[1] "Well" "Month" "Day" #[4] "Year" "Date" "Sulfate.mg.per.l" #[7] "log.Sulfate.mg.per.l" EPA.09.Ex.19.1.sulfate.df[, c("Well", "Date", "Sulfate.mg.per.l", "log.Sulfate.mg.per.l")] # Well Date Sulfate.mg.per.l log.Sulfate.mg.per.l #1 GW-01 1999-07-08 63.0 4.143135 #2 GW-01 1999-09-12 51.0 3.931826 #3 GW-01 1999-10-16 60.0 4.094345 #4 GW-01 1999-11-02 86.0 4.454347 #5 GW-04 1999-07-09 104.0 4.644391 #6 GW-04 1999-09-14 102.0 4.624973 #7 GW-04 1999-10-12 84.0 4.430817 #8 GW-04 1999-11-15 72.0 4.276666 #9 GW-08 1997-10-12 31.0 3.433987 #10 GW-08 1997-11-16 84.0 4.430817 #11 GW-08 1998-01-28 65.0 4.174387 #12 GW-08 1999-04-20 41.0 3.713572 #13 GW-08 2002-06-04 51.8 3.947390 #14 GW-08 2002-09-16 57.5 4.051785 #15 GW-08 2002-12-02 66.8 4.201703 #16 GW-08 2003-03-24 87.1 4.467057 #17 GW-09 1997-10-16 59.0 4.077537 #18 GW-09 1998-01-28 85.0 4.442651 #19 GW-09 1998-04-12 75.0 4.317488 #20 GW-09 1998-07-12 99.0 4.595120 #21 GW-09 2000-01-30 75.8 4.328098 #22 GW-09 2000-04-24 82.5 4.412798 #23 GW-09 2000-10-24 85.5 4.448516 #24 GW-09 2002-12-01 188.0 5.236442 #25 GW-09 2003-03-24 150.0 5.010635 # Construct the upper simultaneous prediction limit for the # 1-of-3 plan based on the log-transformed sulfate data log.Sulfate <- EPA.09.Ex.19.1.sulfate.df$log.Sulfate.mg.per.l pred.int.list.log <- predIntNormSimultaneous(x = log.Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) pred.int.list.log #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 4.3156194 # sd = 0.3756697 # #Estimation Method: mvue # #Data: log.Sulfate # #Sample Size: 25 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99.97893% # #Minimum Number of #Future Observations #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Observations #(per Sampling Occasion): 3 # #Number of Future #Sampling Occasions: 2 # #Prediction Interval: LPL = -Inf # UPL = 5.072355 # Now exponentiate the prediction interval to get the limit on # the original scale exp(pred.int.list.log$interval$limits["UPL"]) # UPL #159.5497 #========== ## Not run: # Try to compute a two-sided simultaneous prediction interval: predIntNormSimultaneous(x = log.Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "two-sided", conf.level = conf.level) #Error in predIntNormSimultaneous(x = log.Sulfate, k = 1, m = 3, r = 2, : # Two-sided simultaneous prediction intervals are not currently available. # NOTE: Two-sided simultaneous prediction intervals computed using # Versions 2.4.0 - 2.8.1 of EnvStats are *NOT* valid. ## End(Not run) #========== # Cleanup #-------- rm(dat, est.list, nc, nw, SWFPR, conf.level, log.Sulfate, pred.int.list.log)
# Generate 8 observations from a normal distribution with parameters # mean=10 and sd=2, then use predIntNormSimultaneous to estimate the # mean and standard deviation of the true distribution and construct an # upper 95% prediction interval to contain at least 1 out of the next # 3 observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(479) dat <- rnorm(8, mean = 10, sd = 2) predIntNormSimultaneous(dat, k = 1, m = 3) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 10.269773 # sd = 2.210246 # #Estimation Method: mvue # #Data: dat # #Sample Size: 8 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 95% # #Minimum Number of #Future Observations #Interval Should Contain: 1 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = -Inf # UPL = 11.4021 #---------- # Repeat the above example, but do it in two steps. First create a list called # est.list containing information about the estimated parameters, then create the # prediction interval. est.list <- enorm(dat) est.list #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 10.269773 # sd = 2.210246 # #Estimation Method: mvue # #Data: dat # #Sample Size: 8 predIntNormSimultaneous(est.list, k = 1, m = 3) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 10.269773 # sd = 2.210246 # #Estimation Method: mvue # #Data: dat # #Sample Size: 8 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 95% # #Minimum Number of #Future Observations #Interval Should Contain: 1 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = -Inf # UPL = 11.4021 #---------- # Compare the 95% 1-of-3 upper prediction interval to the California and # Modified California prediction intervals. Note that the upper prediction # bound for the Modified California rule is between the bound for the # 1-of-3 rule bound and the bound for the California rule. predIntNormSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #11.4021 predIntNormSimultaneous(dat, m = 3, rule = "CA")$interval$limits["UPL"] # UPL #13.03717 predIntNormSimultaneous(dat, rule = "Modified.CA")$interval$limits["UPL"] # UPL #12.12201 #---------- # Show how the upper bound on an upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. Here, we'll use the # 1-of-3 rule. predIntNormSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #11.4021 predIntNormSimultaneous(dat, k = 1, m = 3, r = 10)$interval$limits["UPL"] # UPL #13.28234 #---------- # Compare the upper simultaneous prediction limit for the 1-of-3 rule # based on individual observations versus based on means of order 4. predIntNormSimultaneous(dat, k = 1, m = 3)$interval$limits["UPL"] # UPL #11.4021 predIntNormSimultaneous(dat, n.mean = 4, k = 1, m = 3)$interval$limits["UPL"] # UPL #11.26157 #========== # Example 19-1 of USEPA (2009, p. 19-17) shows how to compute an # upper simultaneous prediction limit for the 1-of-3 rule for # r = 2 future sampling occasions. The data for this example are # stored in EPA.09.Ex.19.1.sulfate.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 25 to use to construct the prediction limit. # There are 50 compliance wells and we will monitor 10 different # constituents at each well at each of the r=2 future sampling # occasions. To determine the confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the individual Type I Error level at each well to # 1 - (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the confidence level is given by: nc <- 10 nw <- 50 SWFPR <- 0.1 conf.level <- (1 - SWFPR)^(1 / (nc * nw)) conf.level #[1] 0.9997893 #---------- # Look at the data: names(EPA.09.Ex.19.1.sulfate.df) #[1] "Well" "Month" "Day" #[4] "Year" "Date" "Sulfate.mg.per.l" #[7] "log.Sulfate.mg.per.l" EPA.09.Ex.19.1.sulfate.df[, c("Well", "Date", "Sulfate.mg.per.l", "log.Sulfate.mg.per.l")] # Well Date Sulfate.mg.per.l log.Sulfate.mg.per.l #1 GW-01 1999-07-08 63.0 4.143135 #2 GW-01 1999-09-12 51.0 3.931826 #3 GW-01 1999-10-16 60.0 4.094345 #4 GW-01 1999-11-02 86.0 4.454347 #5 GW-04 1999-07-09 104.0 4.644391 #6 GW-04 1999-09-14 102.0 4.624973 #7 GW-04 1999-10-12 84.0 4.430817 #8 GW-04 1999-11-15 72.0 4.276666 #9 GW-08 1997-10-12 31.0 3.433987 #10 GW-08 1997-11-16 84.0 4.430817 #11 GW-08 1998-01-28 65.0 4.174387 #12 GW-08 1999-04-20 41.0 3.713572 #13 GW-08 2002-06-04 51.8 3.947390 #14 GW-08 2002-09-16 57.5 4.051785 #15 GW-08 2002-12-02 66.8 4.201703 #16 GW-08 2003-03-24 87.1 4.467057 #17 GW-09 1997-10-16 59.0 4.077537 #18 GW-09 1998-01-28 85.0 4.442651 #19 GW-09 1998-04-12 75.0 4.317488 #20 GW-09 1998-07-12 99.0 4.595120 #21 GW-09 2000-01-30 75.8 4.328098 #22 GW-09 2000-04-24 82.5 4.412798 #23 GW-09 2000-10-24 85.5 4.448516 #24 GW-09 2002-12-01 188.0 5.236442 #25 GW-09 2003-03-24 150.0 5.010635 # Construct the upper simultaneous prediction limit for the # 1-of-3 plan based on the log-transformed sulfate data log.Sulfate <- EPA.09.Ex.19.1.sulfate.df$log.Sulfate.mg.per.l pred.int.list.log <- predIntNormSimultaneous(x = log.Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) pred.int.list.log #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 4.3156194 # sd = 0.3756697 # #Estimation Method: mvue # #Data: log.Sulfate # #Sample Size: 25 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99.97893% # #Minimum Number of #Future Observations #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Observations #(per Sampling Occasion): 3 # #Number of Future #Sampling Occasions: 2 # #Prediction Interval: LPL = -Inf # UPL = 5.072355 # Now exponentiate the prediction interval to get the limit on # the original scale exp(pred.int.list.log$interval$limits["UPL"]) # UPL #159.5497 #========== ## Not run: # Try to compute a two-sided simultaneous prediction interval: predIntNormSimultaneous(x = log.Sulfate, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "two-sided", conf.level = conf.level) #Error in predIntNormSimultaneous(x = log.Sulfate, k = 1, m = 3, r = 2, : # Two-sided simultaneous prediction intervals are not currently available. # NOTE: Two-sided simultaneous prediction intervals computed using # Versions 2.4.0 - 2.8.1 of EnvStats are *NOT* valid. ## End(Not run) #========== # Cleanup #-------- rm(dat, est.list, nc, nw, SWFPR, conf.level, log.Sulfate, pred.int.list.log)
for a Simultaneous Prediction Interval for a Normal Distribution
Compute the value of (the multiplier of estimated standard deviation) used
to construct a simultaneous prediction interval based on data from a
normal distribution.
The function
predIntNormSimultaneousK
is called by predIntNormSimultaneous
.
predIntNormSimultaneousK(n, df = n - 1, n.mean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = .Machine$double.eps^0.5, integrate.args.list = NULL)
predIntNormSimultaneousK(n, df = n - 1, n.mean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, K.tol = .Machine$double.eps^0.5, integrate.args.list = NULL)
n |
a positive integer greater than 2 indicating the sample size upon which the prediction interval is based. |
df |
the degrees of freedom associated with the prediction interval. The default is
|
n.mean |
positive integer specifying the sample size associated with the future averages.
The default value is |
k |
for the |
m |
positive integer specifying the maximum number of future observations (or
averages) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
delta.over.sigma |
numeric scalar indicating the ratio |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
K.tol |
numeric scalar indicating the tolerance to use in the nonlinear search algorithm to
compute |
integrate.args.list |
a list of arguments to supply to the |
What is a Simultaneous Prediction Interval?
A prediction interval for some population is an interval on the real line constructed
so that it will contain future observations from that population
with some specified probability
, where
and
is some pre-specified positive integer.
The quantity
is called
the confidence coefficient or confidence level associated with the prediction
interval. The function
predIntNorm
computes a standard prediction
interval based on a sample from a normal distribution.
The function predIntNormSimultaneous
computes a simultaneous prediction
interval that will contain a certain number of future observations with probability
for each of
future sampling “occasions”,
where
is some pre-specified positive integer. The quantity
may
refer to
distinct future sampling occasions in time, or it may for example
refer to sampling at
distinct locations on one future sampling occasion,
assuming that the population standard deviation is the same at all of the
distinct locations.
The function predIntNormSimultaneous
computes a simultaneous prediction
interval based on one of three possible rules:
For the -of-
rule (
rule="k.of.m"
), at least of
the next
future observations will fall in the prediction
interval with probability
on each of the
future
sampling occasions. If obserations are being taken sequentially, for a particular
sampling occasion, up to
observations may be taken, but once
of the observations fall within the prediction interval, sampling can stop.
Note: When
and
, the results of
predIntNormSimultaneous
are equivalent to the results of predIntNorm
.
For the California rule (rule="CA"
), with probability
, for each of the
future sampling occasions, either
the first observation will fall in the prediction interval, or else all of the next
observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise,
more observations must be taken.
For the Modified California rule (rule="Modified.CA"
), with probability
, for each of the
future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
Simultaneous prediction intervals can be extended to using averages (means) in place
of single observations (USEPA, 2009, Chapter 19). That is, you can create a
simultaneous prediction interval
that will contain a specified number of averages (based on which rule you choose) on
each of future sampling occassions, where each each average is based on
individual observations. For the functions
predIntNormSimultaneous
and predIntNormSimultaneousK
,
the argument n.mean
corresponds to .
The Form of a Prediction Interval for 1 Future Observation
Let denote a vector of
observations from a normal distribution with parameters
mean=
and
sd=
. Also, let
denote the
sample size associated with the future averages (i.e.,
n.mean=
).
When
, each average is really just a single observation, so in the rest of
this help file the term “averages” will sometimes replace the phrase
“observations or averages”.
For a normal distribution, the form of a two-sided
simultaneous prediction interval is:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, the number of future sampling occassions
, and the
sample size associated with the future averages,
. Do not confuse the
constant
(uppercase K) with the number of future averages
(lowercase k) in the
-of-
rule. The symbol
is used here
to be consistent with the notation used for tolerance intervals
(see
tolIntNorm
).
Similarly, the form of a one-sided lower prediction interval is:
and the form of a one-sided upper prediction interval is:
The derivation of the constant for 1 future observation is
explained in the help file for
predIntNormK
.
The Form of a Simultaneous Prediction Interval
For simultaneous prediction intervals, only lower
(pi.type="lower"
) and upper (pi.type="upper"
) prediction
intervals are available. Two-sided simultaneous prediction intervals were
available in Versions 2.4.0 - 2.8.1 of EnvStats, but these prediction
intervals were based on an incorrect algorithm for .
Equations (4) and (5) above hold for simultaneous prediction intervals, but the
derivation of the constant is more difficult, and is explained below.
The Derivation of K for Future Observations
First we will show the derivation based on future observations (i.e.,
,
n.mean=1
), and then extend the formulas to future averages.
The Derivation of K for the k-of-m Rule (rule="k.of.m"
)
For the -of-
rule (
rule="k.of.m"
) with
(i.e.,
n.mean=1
), at least of the next
future
observations will fall in the prediction interval
with probability
on each of the
future sampling
occasions. If observations are being taken sequentially, for a particular
sampling occasion, up to
observations may be taken, but once
of the observations fall within the prediction interval, sampling can stop.
Note: When
and
, this kind of simultaneous prediction
interval becomes the same as a standard prediction interval for the next
observations (see
predIntNorm
).
For the case when future sampling occasion, both Hall and Prairie (1973)
and Fertig and Mann (1977) discuss the derivation of
. Davis and McNichols
(1987) extend the derivation to the case where
is a positive integer. They
show that for a one-sided upper prediction interval (
pi.type="upper"
), the
probability that at least
of the next
future observations
will be contained in the interval given in Equation (5) above, for each of
future sampling occasions, is given by:
where denotes the cdf of the
non-central Student's t-distribution with parameters
df=
and
ncp=
evaluated at
;
denotes the cdf of the standard normal distribution
evaluated at
;
denotes the cdf of the
beta distribution with parameters
shape1=
and
shape2=
; and
denotes the value of the
beta function with parameters
a=
and
b=
.
The quantity (upper case delta) denotes the difference between the
mean of the population that was sampled to construct the prediction interval, and
the mean of the population that will be sampled to produce the future observations.
The quantity
(sigma) denotes the population standard deviation of both
of these populations. Usually you assume
unless you are interested
in computing the power of the rule to detect a change in means between the
populations (see
predIntNormSimultaneousTestPower
).
For given values of the confidence level (), sample size (
),
minimum number of future observations to be contained in the interval per
sampling occasion (
), number of future observations per sampling occasion
(
), and number of future sampling occasions (
), Equation (6) can
be solved for
. The function
predIntNormSimultaneousK
uses the
R function nlminb
to solve Equation (6) for .
When pi.type="lower"
, the same value of is used as when
pi.type="upper"
, but Equation (4) is used to construct the prediction
interval.
The Derivation of K for the California Rule (rule="CA"
)
For the California rule (rule="CA"
), with probability ,
for each of the
future sampling occasions, either the first observation will
fall in the prediction interval, or else all of the next
observations will
fall in the prediction interval. That is, if the first observation falls in the
prediction interval then sampling can stop. Otherwise,
more observations
must be taken.
The formula for is the same as for the
-of-
rule, except that
Equation (6) becomes the following (Davis, 1998b):
The Derivation of K for the Modified California Rule (rule="Modified.CA"
)
For the Modified California rule (rule="Modified.CA"
), with probability
, for each of the
future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
The formula for is the same as for the
-of-
rule, except that
Equation (6) becomes the following (Davis, 1998b):
The Derivation of K for Future Means
For each of the above rules, if we are interested in using averages instead of
single observations, with (i.e.,
n.mean
), the first
term in the integral in Equations (6)-(8) that involves the cdf of the
non-central Student's t-distribution becomes:
A numeric scalar equal to , the multiplier of estimated standard
deviation that is used to construct the simultaneous prediction interval.
Motivation
Prediction and tolerance intervals have long been applied to quality control and
life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973). In the context of
environmental statistics, prediction intervals are useful for analyzing data from
groundwater detection monitoring programs at hazardous and solid waste facilities.
One of the main statistical problems that plague groundwater monitoring programs at
hazardous and solid waste facilities is the requirement of testing several wells and
several constituents at each well on each sampling occasion. This is an obvious
multiple comparisons problem, and the naive approach of using a standard t-test at
a conventional -level (e.g., 0.05 or 0.01) for each test leads to a
very high probability of at least one significant result on each sampling occasion,
when in fact no contamination has occurred. This problem was pointed out years ago
by Millard (1987) and others.
Davis and McNichols (1987) proposed simultaneous prediction intervals as a way of
controlling the facility-wide false positive rate (FWFPR) while maintaining adequate
power to detect contamination in the groundwater. Because of the ubiquitous presence
of spatial variability, it is usually best to use simultaneous prediction intervals
at each well (Davis, 1998a). That is, by constructing prediction intervals based on
background (pre-landfill) data on each well, and comparing future observations at a
well to the prediction interval for that particular well. In each of these cases,
the individual -level at each well is equal to the FWFRP divided by the
product of the number of wells and constituents.
Often, observations at downgradient wells are not available prior to the construction and operation of the landfill. In this case, upgradient well data can be combined to create a background prediction interval, and observations at each downgradient well can be compared to this prediction interval. If spatial variability is present and a major source of variation, however, this method is not really valid (Davis, 1994; Davis, 1998a).
Chapter 19 of USEPA (2009) contains an extensive discussion of using the
-of-
rule and the Modified California rule.
Chapters 1 and 3 of Gibbons et al. (2009) discuss simultaneous prediction intervals
for the normal and lognormal distributions, respectively.
The k-of-m Rule
For the -of-
rule, Davis and McNichols (1987) give tables with
“optimal” choices of
(in terms of best power for a given overall
confidence level) for selected values of
,
, and
. They found
that the optimal ratios of
to
(i.e.,
) are generally small,
in the range of 15-50%.
The California Rule
The California rule was mandated in that state for groundwater monitoring at waste
disposal facilities when resampling verification is part of the statistical program
(Barclay's Code of California Regulations, 1991). The California code mandates a
“California” rule with . The motivation for this rule may have
been a desire to have a majority of the observations in bounds (Davis, 1998a). For
example, for a
-of-
rule with
and
, a monitoring
location will pass if the first observation is out of bounds, the second resample
is out of bounds, but the last resample is in bounds, so that 2 out of 3 observations
are out of bounds. For the California rule with
, either the first
observation must be in bounds, or the next 2 observations must be in bounds in order
for the monitoring location to pass.
Davis (1998a) states that if the FWFPR is kept constant, then the California rule
offers little increased power compared to the -of-
rule, and can
actually decrease the power of detecting contamination.
The Modified California Rule
The Modified California Rule was proposed as a compromise between a 1-of-
rule and the California rule. For a given FWFPR, the Modified California rule
achieves better power than the California rule, and still requires at least as many
observations in bounds as out of bounds, unlike a 1-of-
rule.
Different Notations Between Different References
For the -of-
rule described in this help file, both
Davis and McNichols (1987) and USEPA (2009, Chapter 19) use the variable
instead of
to represent the minimum number
of future observations the interval should contain on each of the
sampling
occasions.
Gibbons et al. (2009, Chapter 1) presents extensive lists of the value of
for both
-of-
rules and California rules. Gibbons et al.'s
notation reverses the meaning of
and
compared to the notation used
in this help file. That is, in Gibbons et al.'s notation,
represents the
number of future sampling occasions or monitoring wells, and
represents the
minimum number of observations the interval should contain on each sampling occasion.
USEPA (2009, Chapter 19) uses in place of
.
Steven P. Millard ([email protected])
Barclay's California Code of Regulations. (1991). Title 22, Section 66264.97 [concerning hazardous waste facilities] and Title 23, Section 2550.7(e)(8) [concerning solid waste facilities]. Barclay's Law Publishers, San Francisco, CA.
Davis, C.B. (1998a). Ground-Water Statistics & Regulations: Principles, Progress and Problems. Second Edition. Environmetrics & Statistics Limited, Henderson, NV.
Davis, C.B. (1998b). Personal Communication, September 3, 1998.
Davis, C.B., and R.J. McNichols. (1987). One-sided Intervals for at Least
of
Observations from a Normal Population on Each of
Future Occasions.
Technometrics 29, 359–370.
Fertig, K.W., and N.R. Mann. (1977). One-Sided Prediction Intervals for at Least
Out of
Future Observations From a Normal Population.
Technometrics 19, 167–177.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J. (1969). Factors for Calculating Two-Sided Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 64(327), 878-898.
Hahn, G.J. (1970a). Additional Factors for Calculating Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 65(332), 1668-1676.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178-188.
Hall, I.J., and R.R. Prairie. (1973). One-Sided Prediction Intervals to Contain at
Least Out of
Future Observations.
Technometrics 15, 897–914.
Millard, S.P. (1987). Environmental Monitoring, Statistics, and the Law: Room for Improvement (with Comment). The American Statistician 41(4), 249–259.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
predIntNormSimultaneous
,
predIntNormSimultaneousTestPower
,
predIntNorm
, predIntNormK
,
predIntLnormSimultaneous
, tolIntNorm
,
Normal, estimate.object
, enorm
# Compute the value of K for an upper 95% simultaneous prediction # interval to contain at least 1 out of the next 3 observations # given a background sample size of n=8. predIntNormSimultaneousK(n = 8, k = 1, m = 3) #[1] 0.5123091 #---------- # Compare the value of K for a 95% 1-of-3 upper prediction interval to # the value for the California and Modified California rules. # Note that the value of K for the Modified California rule is between # the value of K for the 1-of-3 rule and the California rule. predIntNormSimultaneousK(n = 8, k = 1, m = 3) #[1] 0.5123091 predIntNormSimultaneousK(n = 8, m = 3, rule = "CA") #[1] 1.252077 predIntNormSimultaneousK(n = 8, rule = "Modified.CA") #[1] 0.8380233 #---------- # Show how the value of K for an upper 95% simultaneous prediction # limit increases as the number of future sampling occasions r increases. # Here, we'll use the 1-of-3 rule. predIntNormSimultaneousK(n = 8, k = 1, m = 3) #[1] 0.5123091 predIntNormSimultaneousK(n = 8, k = 1, m = 3, r = 10) #[1] 1.363002 #========== # Example 19-1 of USEPA (2009, p. 19-17) shows how to compute an # upper simultaneous prediction limit for the 1-of-3 rule for # r = 2 future sampling occasions. The data for this example are # stored in EPA.09.Ex.19.1.sulfate.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 25 to use to construct the prediction limit. # There are 50 compliance wells and we will monitor 10 different # constituents at each well at each of the r=2 future sampling # occasions. To determine the confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the individual Type I Error level at each well to # 1 - (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the confidence level is given by: nc <- 10 nw <- 50 SWFPR <- 0.1 conf.level <- (1 - SWFPR)^(1 / (nc * nw)) conf.level #[1] 0.9997893 #---------- # Compute the value of K for the upper simultaneous prediction # limit for the 1-of-3 plan. predIntNormSimultaneousK(n = 25, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) #[1] 2.014365 #========== ## Not run: # Try to compute K for a two-sided simultaneous prediction interval: predIntNormSimultaneousK(n = 25, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "two-sided", conf.level = conf.level) #Error in predIntNormSimultaneousK(n = 25, k = 1, m = 3, r = 2, rule = "k.of.m", : # Two-sided simultaneous prediction intervals are not currently available. # NOTE: Two-sided simultaneous prediction intervals computed using # Versions 2.4.0 - 2.8.1 of EnvStats are *NOT* valid. ## End(Not run) #========== # Cleanup #-------- rm(nc, nw, SWFPR, conf.level)
# Compute the value of K for an upper 95% simultaneous prediction # interval to contain at least 1 out of the next 3 observations # given a background sample size of n=8. predIntNormSimultaneousK(n = 8, k = 1, m = 3) #[1] 0.5123091 #---------- # Compare the value of K for a 95% 1-of-3 upper prediction interval to # the value for the California and Modified California rules. # Note that the value of K for the Modified California rule is between # the value of K for the 1-of-3 rule and the California rule. predIntNormSimultaneousK(n = 8, k = 1, m = 3) #[1] 0.5123091 predIntNormSimultaneousK(n = 8, m = 3, rule = "CA") #[1] 1.252077 predIntNormSimultaneousK(n = 8, rule = "Modified.CA") #[1] 0.8380233 #---------- # Show how the value of K for an upper 95% simultaneous prediction # limit increases as the number of future sampling occasions r increases. # Here, we'll use the 1-of-3 rule. predIntNormSimultaneousK(n = 8, k = 1, m = 3) #[1] 0.5123091 predIntNormSimultaneousK(n = 8, k = 1, m = 3, r = 10) #[1] 1.363002 #========== # Example 19-1 of USEPA (2009, p. 19-17) shows how to compute an # upper simultaneous prediction limit for the 1-of-3 rule for # r = 2 future sampling occasions. The data for this example are # stored in EPA.09.Ex.19.1.sulfate.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 25 to use to construct the prediction limit. # There are 50 compliance wells and we will monitor 10 different # constituents at each well at each of the r=2 future sampling # occasions. To determine the confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the individual Type I Error level at each well to # 1 - (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / (Number of Constituents * Number of Wells)) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the confidence level is given by: nc <- 10 nw <- 50 SWFPR <- 0.1 conf.level <- (1 - SWFPR)^(1 / (nc * nw)) conf.level #[1] 0.9997893 #---------- # Compute the value of K for the upper simultaneous prediction # limit for the 1-of-3 plan. predIntNormSimultaneousK(n = 25, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "upper", conf.level = conf.level) #[1] 2.014365 #========== ## Not run: # Try to compute K for a two-sided simultaneous prediction interval: predIntNormSimultaneousK(n = 25, k = 1, m = 3, r = 2, rule = "k.of.m", pi.type = "two-sided", conf.level = conf.level) #Error in predIntNormSimultaneousK(n = 25, k = 1, m = 3, r = 2, rule = "k.of.m", : # Two-sided simultaneous prediction intervals are not currently available. # NOTE: Two-sided simultaneous prediction intervals computed using # Versions 2.4.0 - 2.8.1 of EnvStats are *NOT* valid. ## End(Not run) #========== # Cleanup #-------- rm(nc, nw, SWFPR, conf.level)
Compute the probability that at least one set of future observations violates the
given rule based on a simultaneous prediction interval for the next future
sampling occasions for a normal distribution. The three possible rules are:
-of-
, California, or Modified California.
predIntNormSimultaneousTestPower(n, df = n - 1, n.mean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, r.shifted = r, K.tol = .Machine$double.eps^0.5, integrate.args.list = NULL)
predIntNormSimultaneousTestPower(n, df = n - 1, n.mean = 1, k = 1, m = 2, r = 1, rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, r.shifted = r, K.tol = .Machine$double.eps^0.5, integrate.args.list = NULL)
n |
vector of positive integers greater than 2 indicating the sample size upon which the prediction interval is based. |
df |
vector of positive integers indicating the degrees of freedom associated with
the sample size. The default value is |
n.mean |
positive integer specifying the sample size associated with the future averages.
The default value is |
k |
for the |
m |
vector of positive integers specifying the maximum number of future observations (or
averages) on one future sampling “occasion”.
The default value is |
r |
vector of positive integers specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
delta.over.sigma |
numeric vector indicating the ratio |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
vector of values between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
r.shifted |
vector of positive integers specifying the number of future sampling occasions for
which the scaled mean is shifted by |
K.tol |
numeric scalar indicating the tolerance to use in the nonlinear search algorithm to
compute |
integrate.args.list |
a list of arguments to supply to the |
What is a Simultaneous Prediction Interval?
A prediction interval for some population is an interval on the real line constructed
so that it will contain future observations from that population
with some specified probability
, where
and
is some pre-specified positive integer.
The quantity
is called
the confidence coefficient or confidence level associated with the prediction
interval. The function
predIntNorm
computes a standard prediction
interval based on a sample from a normal distribution.
The function predIntNormSimultaneous
computes a simultaneous prediction
interval that will contain a certain number of future observations with probability
for each of
future sampling “occasions”,
where
is some pre-specified positive integer. The quantity
may
refer to
distinct future sampling occasions in time, or it may for example
refer to sampling at
distinct locations on one future sampling occasion,
assuming that the population standard deviation is the same at all of the
distinct locations.
The function predIntNormSimultaneous
computes a simultaneous prediction
interval based on one of three possible rules:
For the -of-
rule (
rule="k.of.m"
), at least of
the next
future observations will fall in the prediction
interval with probability
on each of the
future
sampling occasions. If obserations are being taken sequentially, for a particular
sampling occasion, up to
observations may be taken, but once
of the observations fall within the prediction interval, sampling can stop.
Note: When
and
, the results of
predIntNormSimultaneous
are equivalent to the results of
predIntNorm
.
For the California rule (rule="CA"
), with probability
, for each of the
future sampling occasions, either
the first observation will fall in the prediction interval, or else all of the next
observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise,
more observations must be taken.
For the Modified California rule (rule="Modified.CA"
), with probability
, for each of the
future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
Simultaneous prediction intervals can be extended to using averages (means) in place
of single observations (USEPA, 2009, Chapter 19). That is, you can create a
simultaneous prediction interval
that will contain a specified number of averages (based on which rule you choose) on
each of future sampling occassions, where each each average is based on
individual observations. For the function
predIntNormSimultaneous
,
the argument n.mean
corresponds to .
The Form of a Prediction Interval
Let denote a vector of
observations from a normal distribution with parameters
mean=
and
sd=
. Also, let
denote the
sample size associated with the future averages (i.e.,
n.mean=
).
When
, each average is really just a single observation, so in the rest of
this help file the term “averages” will replace the phrase
“observations or averages”.
For a normal distribution, the form of a two-sided
prediction interval is:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, the number of future sampling occassions
, and the
sample size associated with the future averages,
. Do not confuse the
constant
(uppercase K) with the number of future averages
(lowercase k) in the
-of-
rule. The symbol
is used here
to be consistent with the notation used for tolerance intervals
(see
tolIntNorm
).
Similarly, the form of a one-sided lower prediction interval is:
and the form of a one-sided upper prediction interval is:
Note: For simultaneous prediction intervals, only lower
(pi.type="lower"
) and upper
(pi.type="upper"
) prediction
intervals are available.
The derivation of the constant is explained in the help file for
predIntNormSimultaneousK
.
Computing Power
The "power" of the prediction interval is defined as the probability that
at least one set of future observations violates the given rule based on a
simultaneous prediction interval for the next future sampling occasions,
where the population mean for the future observations is allowed to differ from
the population mean for the observations used to construct the prediction interval.
The quantity (upper case delta) denotes the difference between the
mean of the population that was sampled to construct the prediction interval, and
the mean of the population that will be sampled to produce the future observations.
The quantity
(sigma) denotes the population standard deviation of both
of these populations. The argument
delta.over.sigma
corresponds to the
quantity .
Power Based on the k-of-m Rule (rule="k.of.m"
)
For the -of-
rule (
rule="k.of.m"
) with
(i.e.,
n.mean=1
), at least of the next
future
observations will fall in the prediction interval
with probability
on each of the
future sampling
occasions. If observations are being taken sequentially, for a particular
sampling occasion, up to
observations may be taken, but once
of the observations fall within the prediction interval, sampling can stop.
Note: When
and
, this kind of simultaneous prediction
interval becomes the same as a standard prediction interval for the next
observations (see
predIntNorm
).
Davis and McNichols (1987) show that for a one-sided upper prediction interval
(pi.type="upper"
), the probability that at least
of the
next
future observations will be contained in the interval given in
Equation (5) above, for each of
future sampling occasions, is given by:
where denotes the cdf of the
non-central Student's t-distribution with parameters
df=
and
ncp=
evaluated at
;
denotes the cdf of the standard normal distribution
evaluated at
;
denotes the cdf of the
beta distribution with parameters
shape1=
and
shape2=
; and
denotes the value of the
beta function with parameters
a=
and
b=
.
The quantity (upper case delta) denotes the difference between the
mean of the population that was sampled to construct the prediction interval, and
the mean of the population that will be sampled to produce the future observations.
The quantity
(sigma) denotes the population standard deviation of both
of these populations. Usually you assume
unless you are interested
in computing the power of the rule to detect a change in means between the
populations, as we are here.
If we are interested in using averages instead of single observations, with
(i.e.,
n.mean
), the first
term in the integral in Equation (6) that involves the cdf of the
non-central Student's t-distribution becomes:
For a given confidence level , the power of the rule to detect
a change in means is simply given by:
where is defined in Equation (6) above using the value of
that
corresponds to
. Thus, when the argument
delta.over.sigma=0
, the value of is
and the power is
simply
. As
delta.over.sigma
increases above 0, the power
increases.
When pi.type="lower"
, the same value of K
is used as when
pi.type="upper"
, but Equation (4) is used to construct the prediction
interval. Thus, the power increases as delta.over.sigma
decreases below 0.
Power Based on the California Rule (rule="CA"
)
For the California rule (rule="CA"
), with probability ,
for each of the
future sampling occasions, either the first observation will
fall in the prediction interval, or else all of the next
observations will
fall in the prediction interval. That is, if the first observation falls in the
prediction interval then sampling can stop. Otherwise,
more observations
must be taken.
The derivation of the power is the same as for the -of-
rule, except
that Equation (6) becomes the following (Davis, 1998b):
Power Based on the Modified California Rule (rule="Modified.CA"
)
For the Modified California rule (rule="Modified.CA"
), with probability
, for each of the
future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
The derivation of the power is the same as for the -of-
rule, except
that Equation (6) becomes the following (Davis, 1998b):
vector of values between 0 and 1 equal to the probability that the rule will be violated.
See the help file for predIntNormSimultaneous
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether two distributions differ from each other. The functions
predIntNormSimultaneousTestPower
and plotPredIntNormSimultaneousTestPowerCurve
can be
used to investigate these relationships for the case of normally-distributed
observations.
Steven P. Millard ([email protected])
See the help file for predIntNormSimultaneous
.
predIntNormSimultaneous
, predIntNormSimultaneousK
, plotPredIntNormSimultaneousTestPowerCurve
,
predIntNorm
, predIntNormK
, predIntNormTestPower
, Prediction Intervals,
Normal.
# For the k-of-m rule with n=4, k=1, m=3, and r=1, show how the power increases # as delta.over.sigma increases. Assume a 95% upper prediction interval. predIntNormSimultaneousTestPower(n = 4, m = 3, delta.over.sigma = 0:2) #[1] 0.0500000 0.2954156 0.7008558 #---------- # Look at how the power increases with sample size for an upper one-sided # prediction interval using the k-of-m rule with k=1, m=3, r=20, # delta.over.sigma=2, and a confidence level of 95%. predIntNormSimultaneousTestPower(n = c(4, 8), m = 3, r = 20, delta.over.sigma = 2) #[1] 0.6075972 0.9240924 #---------- # Compare the power for the 1-of-3 rule with the power for the California and # Modified California rules, based on a 95% upper prediction interval and # delta.over.sigma=2. Assume a sample size of n=8. Note that in this case the # power for the Modified California rule is greater than the power for the # 1-of-3 rule and California rule. predIntNormSimultaneousTestPower(n = 8, k = 1, m = 3, delta.over.sigma = 2) #[1] 0.788171 predIntNormSimultaneousTestPower(n = 8, m = 3, rule = "CA", delta.over.sigma = 2) #[1] 0.7160434 predIntNormSimultaneousTestPower(n = 8, rule = "Modified.CA", delta.over.sigma = 2) #[1] 0.8143687 #---------- # Show how the power for an upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. Here, we'll use the # 1-of-3 rule with n=8 and delta.over.sigma=1. predIntNormSimultaneousTestPower(n = 8, k = 1, m = 3, r=c(1, 2, 5, 10), delta.over.sigma = 1) #[1] 0.3492512 0.4032111 0.4503603 0.4633773 #========== # USEPA (2009) contains an example on page 19-23 that involves monitoring # nw=100 compliance wells at a large facility with minimal natural spatial # variation every 6 months for nc=20 separate chemicals. # There are n=25 background measurements for each chemical to use to create # simultaneous prediction intervals. We would like to determine which kind of # resampling plan based on normal distribution simultaneous prediction intervals to # use (1-of-m, 1-of-m based on means, or Modified California) in order to have # adequate power of detecting an increase in chemical concentration at any of the # 100 wells while at the same time maintaining a site-wide false positive rate # (SWFPR) of 10% per year over all 4,000 comparisons # (100 wells x 20 chemicals x semi-annual sampling). # The function predIntNormSimultaneousTestPower includes the argument "r" # that is the number of future sampling occasions (r=2 in this case because # we are performing semi-annual sampling), so to compute the individual test # Type I error level alpha.test (and thus the individual test confidence level), # we only need to worry about the number of wells (100) and the number of # constituents (20): alpha.test = 1-(1-alpha)^(1/(nw x nc)). The individual # confidence level is simply 1-alpha.test. Plugging in 0.1 for alpha, # 100 for nw, and 20 for nc yields an individual test confidence level of # 1-alpha.test = 0.9999473. nc <- 20 nw <- 100 conf.level <- (1 - 0.1)^(1 / (nc * nw)) conf.level #[1] 0.9999473 # Now we can compute the power of any particular sampling strategy using # predIntNormSimultaneousTestPower. For example, here is the power of # detecting an increase of three standard deviations in concentration using # the prediction interval based on the "1-of-2" resampling rule: predIntNormSimultaneousTestPower(n = 25, k = 1, m = 2, r = 2, rule = "k.of.m", delta.over.sigma = 3, pi.type = "upper", conf.level = conf.level) #[1] 0.3900202 # The following commands will reproduce the table shown in Step 2 on page # 19-23 of USEPA (2009). Because these commands can take more than a few # seconds to execute, we have commented them out here. To run this example, # just remove the pound signs (#) that are in front of R commands. #rule.vec <- c(rep("k.of.m", 3), "Modified.CA", rep("k.of.m", 3)) #m.vec <- c(2, 3, 4, 4, 1, 2, 1) #n.mean.vec <- c(rep(1, 4), 2, 2, 3) #n.scenarios <- length(rule.vec) #K.vec <- numeric(n.scenarios) #Power.vec <- numeric(n.scenarios) #K.vec <- predIntNormSimultaneousK(n = 25, k = 1, m = m.vec, n.mean = n.mean.vec, # r = 2, rule = rule.vec, pi.type = "upper", conf.level = conf.level) #Power.vec <- predIntNormSimultaneousTestPower(n = 25, k = 1, m = m.vec, # n.mean = n.mean.vec, r = 2, rule = rule.vec, delta.over.sigma = 3, # pi.type = "upper", conf.level = conf.level) #Power.df <- data.frame(Rule = rule.vec, k = rep(1, n.scenarios), m = m.vec, # N.Mean = n.mean.vec, K = round(K.vec, 2), Power = round(Power.vec, 2), # Total.Samples = m.vec * n.mean.vec) #Power.df # Rule k m N.Mean K Power Total.Samples #1 k.of.m 1 2 1 3.16 0.39 2 #2 k.of.m 1 3 1 2.33 0.65 3 #3 k.of.m 1 4 1 1.83 0.81 4 #4 Modified.CA 1 4 1 2.57 0.71 4 #5 k.of.m 1 1 2 3.62 0.41 2 #6 k.of.m 1 2 2 2.33 0.85 4 #7 k.of.m 1 1 3 2.99 0.71 3 # The above table shows the K-multipliers for each prediction interval, along with # the power of detecting a change in concentration of three standard deviations at # any of the 100 wells during the course of a year, for each of the sampling # strategies considered. The last three rows of the table correspond to sampling # strategies that involve using the mean of two or three observations. #========== # Clean up #--------- rm(nc, nw, conf.level, rule.vec, m.vec, n.mean.vec, n.scenarios, K.vec, Power.vec, Power.df)
# For the k-of-m rule with n=4, k=1, m=3, and r=1, show how the power increases # as delta.over.sigma increases. Assume a 95% upper prediction interval. predIntNormSimultaneousTestPower(n = 4, m = 3, delta.over.sigma = 0:2) #[1] 0.0500000 0.2954156 0.7008558 #---------- # Look at how the power increases with sample size for an upper one-sided # prediction interval using the k-of-m rule with k=1, m=3, r=20, # delta.over.sigma=2, and a confidence level of 95%. predIntNormSimultaneousTestPower(n = c(4, 8), m = 3, r = 20, delta.over.sigma = 2) #[1] 0.6075972 0.9240924 #---------- # Compare the power for the 1-of-3 rule with the power for the California and # Modified California rules, based on a 95% upper prediction interval and # delta.over.sigma=2. Assume a sample size of n=8. Note that in this case the # power for the Modified California rule is greater than the power for the # 1-of-3 rule and California rule. predIntNormSimultaneousTestPower(n = 8, k = 1, m = 3, delta.over.sigma = 2) #[1] 0.788171 predIntNormSimultaneousTestPower(n = 8, m = 3, rule = "CA", delta.over.sigma = 2) #[1] 0.7160434 predIntNormSimultaneousTestPower(n = 8, rule = "Modified.CA", delta.over.sigma = 2) #[1] 0.8143687 #---------- # Show how the power for an upper 95% simultaneous prediction limit increases # as the number of future sampling occasions r increases. Here, we'll use the # 1-of-3 rule with n=8 and delta.over.sigma=1. predIntNormSimultaneousTestPower(n = 8, k = 1, m = 3, r=c(1, 2, 5, 10), delta.over.sigma = 1) #[1] 0.3492512 0.4032111 0.4503603 0.4633773 #========== # USEPA (2009) contains an example on page 19-23 that involves monitoring # nw=100 compliance wells at a large facility with minimal natural spatial # variation every 6 months for nc=20 separate chemicals. # There are n=25 background measurements for each chemical to use to create # simultaneous prediction intervals. We would like to determine which kind of # resampling plan based on normal distribution simultaneous prediction intervals to # use (1-of-m, 1-of-m based on means, or Modified California) in order to have # adequate power of detecting an increase in chemical concentration at any of the # 100 wells while at the same time maintaining a site-wide false positive rate # (SWFPR) of 10% per year over all 4,000 comparisons # (100 wells x 20 chemicals x semi-annual sampling). # The function predIntNormSimultaneousTestPower includes the argument "r" # that is the number of future sampling occasions (r=2 in this case because # we are performing semi-annual sampling), so to compute the individual test # Type I error level alpha.test (and thus the individual test confidence level), # we only need to worry about the number of wells (100) and the number of # constituents (20): alpha.test = 1-(1-alpha)^(1/(nw x nc)). The individual # confidence level is simply 1-alpha.test. Plugging in 0.1 for alpha, # 100 for nw, and 20 for nc yields an individual test confidence level of # 1-alpha.test = 0.9999473. nc <- 20 nw <- 100 conf.level <- (1 - 0.1)^(1 / (nc * nw)) conf.level #[1] 0.9999473 # Now we can compute the power of any particular sampling strategy using # predIntNormSimultaneousTestPower. For example, here is the power of # detecting an increase of three standard deviations in concentration using # the prediction interval based on the "1-of-2" resampling rule: predIntNormSimultaneousTestPower(n = 25, k = 1, m = 2, r = 2, rule = "k.of.m", delta.over.sigma = 3, pi.type = "upper", conf.level = conf.level) #[1] 0.3900202 # The following commands will reproduce the table shown in Step 2 on page # 19-23 of USEPA (2009). Because these commands can take more than a few # seconds to execute, we have commented them out here. To run this example, # just remove the pound signs (#) that are in front of R commands. #rule.vec <- c(rep("k.of.m", 3), "Modified.CA", rep("k.of.m", 3)) #m.vec <- c(2, 3, 4, 4, 1, 2, 1) #n.mean.vec <- c(rep(1, 4), 2, 2, 3) #n.scenarios <- length(rule.vec) #K.vec <- numeric(n.scenarios) #Power.vec <- numeric(n.scenarios) #K.vec <- predIntNormSimultaneousK(n = 25, k = 1, m = m.vec, n.mean = n.mean.vec, # r = 2, rule = rule.vec, pi.type = "upper", conf.level = conf.level) #Power.vec <- predIntNormSimultaneousTestPower(n = 25, k = 1, m = m.vec, # n.mean = n.mean.vec, r = 2, rule = rule.vec, delta.over.sigma = 3, # pi.type = "upper", conf.level = conf.level) #Power.df <- data.frame(Rule = rule.vec, k = rep(1, n.scenarios), m = m.vec, # N.Mean = n.mean.vec, K = round(K.vec, 2), Power = round(Power.vec, 2), # Total.Samples = m.vec * n.mean.vec) #Power.df # Rule k m N.Mean K Power Total.Samples #1 k.of.m 1 2 1 3.16 0.39 2 #2 k.of.m 1 3 1 2.33 0.65 3 #3 k.of.m 1 4 1 1.83 0.81 4 #4 Modified.CA 1 4 1 2.57 0.71 4 #5 k.of.m 1 1 2 3.62 0.41 2 #6 k.of.m 1 2 2 2.33 0.85 4 #7 k.of.m 1 1 3 2.99 0.71 3 # The above table shows the K-multipliers for each prediction interval, along with # the power of detecting a change in concentration of three standard deviations at # any of the 100 wells during the course of a year, for each of the sampling # strategies considered. The last three rows of the table correspond to sampling # strategies that involve using the mean of two or three observations. #========== # Clean up #--------- rm(nc, nw, conf.level, rule.vec, m.vec, n.mean.vec, n.scenarios, K.vec, Power.vec, Power.df)
Compute the probability that at least one out of future observations
(or means) falls outside a prediction interval for
future observations
(or means) for a normal distribution.
predIntNormTestPower(n, df = n - 1, n.mean = 1, k = 1, delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95)
predIntNormTestPower(n, df = n - 1, n.mean = 1, k = 1, delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95)
n |
vector of positive integers greater than 2 indicating the sample size upon which the prediction interval is based. |
df |
vector of positive integers indicating the degrees of freedom associated with
the sample size. The default value is |
n.mean |
positive integer specifying the sample size associated with the future averages.
The default value is |
k |
vector of positive integers specifying the number of future observations that the
prediction interval should contain with confidence level |
delta.over.sigma |
vector of numbers indicating the ratio |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
What is a Prediction Interval?
A prediction interval for some population is an interval on the real line
constructed so that it will contain future observations or averages
from that population with some specified probability
,
where
and
is some pre-specified positive integer.
The quantity
is call the confidence coefficient or
confidence level associated with the prediction interval. The function
predIntNorm
computes a standard prediction interval based on a
sample from a normal distribution. The function predIntNormTestPower
computes the probability that at least one out of future observations or
averages will not be contained in the prediction interval,
where the population mean for the future observations is allowed to differ from
the population mean for the observations used to construct the prediction interval.
The Form of a Prediction Interval
Let denote a vector of
observations from a normal distribution with parameters
mean=
and
sd=
. Also, let
denote the
sample size associated with the
future averages (i.e.,
n.mean=
).
When
, each average is really just a single observation, so in the rest of
this help file the term “averages” will replace the phrase
“observations or averages”.
For a normal distribution, the form of a two-sided prediction
interval is:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, the number of future averages
, and the
sample size associated with the future averages,
. Do not confuse the
constant
(uppercase K) with the number of future averages
(lowercase k). The symbol
is used here to be consistent with the
notation used for tolerance intervals (see
tolIntNorm
).
Similarly, the form of a one-sided lower prediction interval is:
and the form of a one-sided upper prediction interval is:
but differs for one-sided versus two-sided prediction intervals.
The derivation of the constant
is explained in the help file for
predIntNormK
.
Computing Power
The "power" of the prediction interval is defined as the probability that at
least one out of the future observations or averages
will not be contained in the prediction interval, where the population mean
for the future observations is allowed to differ from the population mean for the
observations used to construct the prediction interval. The probability
that all
future observations will be contained in a one-sided upper
prediction interval (
pi.type="upper"
) is given in Equation (6) of the help
file for
predIntNormSimultaneousK
, where and
:
where denotes the cdf of the
non-central Student's t-distribution with parameters
df=
and
ncp=
evaluated at
;
denotes the cdf of the standard normal distribution
evaluated at
; and
denotes the value of the
beta function with parameters
a=
and
b=
.
The quantity (upper case delta) denotes the difference between the
mean of the population that was sampled to construct the prediction interval, and
the mean of the population that will be sampled to produce the future observations.
The quantity
(sigma) denotes the population standard deviation of both
of these populations. Usually you assume
unless you are interested
in computing the power of the rule to detect a change in means between the
populations, as we are here.
If we are interested in using averages instead of single observations, with
(i.e.,
n.mean
), the first
term in the integral in Equation (6) that involves the cdf of the
non-central Student's t-distribution becomes:
For a given confidence level , the power of the rule to detect
a change in means is simply given by:
where is defined in Equation (6) above using the value of
that
corresponds to
. Thus, when the argument
delta.over.sigma=0
, the value of is
and the power is
simply
. As
delta.over.sigma
increases above 0, the power
increases.
When pi.type="lower"
, the same value of K
is used as when
pi.type="upper"
, but Equation (4) is used to construct the prediction
interval. Thus, the power increases as delta.over.sigma
decreases below 0.
vector of values between 0 and 1 equal to the probability that at least one of
future observations or averages will fall outside the prediction interval.
See the help files for predIntNorm
and
predIntNormSimultaneous
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether two distributions differ from each other. The functions
predIntNormTestPower
and plotPredIntNormTestPowerCurve
can be
used to investigate these relationships for the case of normally-distributed
observations. In the case of a simple shift between the two means, the test based
on a prediction interval is not as powerful as the two-sample t-test. However, the
test based on a prediction interval is more efficient at detecting a shift in the
tail.
Steven P. Millard ([email protected])
See the help files for predIntNorm
and
predIntNormSimultaneous
.
predIntNorm
, predIntNormK
,
plotPredIntNormTestPowerCurve
, predIntNormSimultaneous
,
predIntNormSimultaneousK
,
predIntNormSimultaneousTestPower
, Prediction Intervals,
Normal.
# Show how the power increases as delta.over.sigma increases. # Assume a 95% upper prediction interval. predIntNormTestPower(n = 4, delta.over.sigma = 0:2) #[1] 0.0500000 0.1743014 0.3990892 #---------- # Look at how the power increases with sample size for a one-sided upper # prediction interval with k=3, delta.over.sigma=2, and a confidence level # of 95%. predIntNormTestPower(n = c(4, 8), k = 3, delta.over.sigma = 2) #[1] 0.3578250 0.5752113 #---------- # Show how the power for an upper 95% prediction limit increases as the # number of future observations k increases. Here, we'll use n=20 and # delta.over.sigma=1. predIntNormTestPower(n = 20, k = 1:3, delta.over.sigma = 1) #[1] 0.2408527 0.2751074 0.2936486
# Show how the power increases as delta.over.sigma increases. # Assume a 95% upper prediction interval. predIntNormTestPower(n = 4, delta.over.sigma = 0:2) #[1] 0.0500000 0.1743014 0.3990892 #---------- # Look at how the power increases with sample size for a one-sided upper # prediction interval with k=3, delta.over.sigma=2, and a confidence level # of 95%. predIntNormTestPower(n = c(4, 8), k = 3, delta.over.sigma = 2) #[1] 0.3578250 0.5752113 #---------- # Show how the power for an upper 95% prediction limit increases as the # number of future observations k increases. Here, we'll use n=20 and # delta.over.sigma=1. predIntNormTestPower(n = 20, k = 1:3, delta.over.sigma = 1) #[1] 0.2408527 0.2751074 0.2936486
Construct a nonparametric prediction interval to contain at least out of the
next
future observations with probability
for a
continuous distribution.
predIntNpar(x, k = m, m = 1, lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), lb = -Inf, ub = Inf, pi.type = "two-sided")
predIntNpar(x, k = m, m = 1, lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), lb = -Inf, ub = Inf, pi.type = "two-sided")
x |
a numeric vector of observations. Missing ( |
k |
positive integer specifying the minimum number of future observations out of |
m |
positive integer specifying the number of future observations. The default value is
|
lpl.rank |
positive integer indicating the rank of the order statistic to use for the lower
bound of the prediction interval. If |
n.plus.one.minus.upl.rank |
positive integer related to the rank of the order statistic to use for the upper
bound of the prediction interval. A value of |
lb , ub
|
scalars indicating lower and upper bounds on the distribution. By default, |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
What is a Nonparametric Prediction Interval?
A nonparametric prediction interval for some population is an interval on the
real line constructed so that it will contain at least of
future
observations from that population with some specified probability
, where
and
and
are
pre-specified positive integer where
.
The quantity
is called
the confidence coefficient or confidence level associated with the prediction
interval.
The Form of a Nonparametric Prediction Interval
Let denote a vector of
independent observations from some continuous distribution, and let
denote the the
'th order statistics in
.
A two-sided nonparametric prediction interval is constructed as:
where and
are positive integers between 1 and
, and
. That is,
denotes the rank of the lower prediction limit, and
denotes the rank of the upper prediction limit. To make it easier to write
some equations later on, we can also write the prediction interval (1) in a slightly
different way as:
where
so that is a positive integer between 1 and
, and
. In terms of the arguments to the function
predIntNpar
,
the argument lpl.rank
corresponds to , and the argument
n.plus.one.minus.upl.rank
corresponds to .
If we allow and
and define lower and upper bounds as:
then Equation (2) above can also represent a one-sided lower or one-sided upper prediction interval as well. That is, a one-sided lower nonparametric prediction interval is constructed as:
and a one-sided upper nonparametric prediction interval is constructed as:
Usually, or
and
.
Constructing Nonparametric Prediction Intervals for Future Observations
Danziger and Davis (1964) show that the probability that at least out of
the next
observations will fall in the interval defined in Equation (2)
is given by:
(Note that computing a nonparametric prediction interval for the case
is equivalent to computing a nonparametric
-expectation
tolerance interval with coverage
; see
tolIntNpar
).
The Special Case of Using the Minimum and the Maximum
Setting implies using the smallest and largest observed values as
the prediction limits. In this case, it can be shown that the probability that at
least
out of the next
observations will fall in the interval
is given by:
Setting in Equation (10), the probability that all of the next
observations will fall in the interval defined in Equation (9) is given by:
For one-sided prediction limits, the probability that all future
observations will fall below
(upper prediction limit;
pi.type="upper"
) and the probabilitiy that all future observations
will fall above
(lower prediction limit;
pi.type="lower"
) are
both given by:
Constructing Nonparametric Prediction Intervals for Future Medians
To construct a nonparametric prediction interval for a future median based on
future observations, where
is odd, note that this is equivalent to
constructing a nonparametric prediction interval that must hold
at least
of the next
future observations.
a list of class "estimate"
containing the prediction interval and other
information. See the help file for estimate.object
for details.
Prediction and tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973; Krishnamoorthy and Mathew, 2009). In the context of environmental statistics, prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities (e.g., Gibbons et al., 2009; Millard and Neerchal, 2001; USEPA, 2009).
Steven P. Millard ([email protected])
Danziger, L., and S. Davis. (1964). Tables of Distribution-Free Tolerance Limits. Annals of Mathematical Statistics 35(5), 1361–1365.
Davis, C.B. (1994). Environmental Regulatory Statistics. In Patil, G.P., and C.R. Rao, eds., Handbook of Statistics, Vol. 12: Environmental Statistics. North-Holland, Amsterdam, a division of Elsevier, New York, NY, Chapter 26, 817–865.
Davis, C.B., and R.J. McNichols. (1987). One-sided Intervals for at Least p of m Observations from a Normal Population on Each of r Future Occasions. Technometrics 29, 359–370.
Davis, C.B., and R.J. McNichols. (1994a). Ground Water Monitoring Statistics Update: Part I: Progress Since 1988. Ground Water Monitoring and Remediation 14(4), 148–158.
Davis, C.B., and R.J. McNichols. (1994b). Ground Water Monitoring Statistics Update: Part II: Nonparametric Prediction Limits. Ground Water Monitoring and Remediation 14(4), 159–175.
Davis, C.B., and R.J. McNichols. (1999). Simultaneous Nonparametric Prediction Limits (with Discusson). Technometrics 41(2), 89–112.
Gibbons, R.D. (1987a). Statistical Prediction Intervals for the Evaluation of Ground-Water Quality. Ground Water 25, 455–465.
Gibbons, R.D. (1991b). Statistical Tolerance Limits for Ground-Water Monitoring. Ground Water 29, 563–570.
Gibbons, R.D., and J. Baker. (1991). The Properties of Various Statistical Prediction Intervals for Ground-Water Detection Monitoring. Journal of Environmental Science and Health A26(4), 535–553.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York, 392pp.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178–188.
Hall, I.J., R.R. Prairie, and C.K. Motlagh. (1975). Non-Parametric Prediction Intervals. Journal of Quality Technology 7(3), 109–114.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
estimate.object
, predIntNparN
,
predIntNparConfLevel
, plotPredIntNparDesign
.
# Generate 20 observations from a lognormal mixture distribution with # parameters mean1=1, cv1=0.5, mean2=5, cv2=1, and p.mix=0.1. Use # predIntNpar to construct a two-sided prediction interval using the # minimum and maximum observed values. Note that the associated confidence # level is 90%. A larger sample size is required to obtain a larger # confidence level (see the help file for predIntNparN). # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormMixAlt(n = 20, mean1 = 1, cv1 = 0.5, mean2 = 5, cv2 = 1, p.mix = 0.1) predIntNpar(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Exact # #Prediction Interval Type: two-sided # #Confidence Level: 90.47619% # #Prediction Limit Rank(s): 1 20 # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.3647875 # UPL = 1.8173115 #---------- # Repeat the above example, but specify m=5 future observations should be # contained in the prediction interval. Note that the confidence level is # now only 63%. predIntNpar(dat, m = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Exact # #Prediction Interval Type: two-sided # #Confidence Level: 63.33333% # #Prediction Limit Rank(s): 1 20 # #Number of Future Observations: 5 # #Prediction Interval: LPL = 0.3647875 # UPL = 1.8173115 #---------- # Repeat the above example, but specify that a minimum of k=3 observations # out of a total of m=5 future observations should be contained in the # prediction interval. Note that the confidence level is now 98%. predIntNpar(dat, k = 3, m = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Exact # #Prediction Interval Type: two-sided # #Confidence Level: 98.37945% # #Prediction Limit Rank(s): 1 20 # #Minimum Number of #Future Observations #Interval Should Contain: 3 # #Total Number of #Future Observations: 5 # #Prediction Interval: LPL = 0.3647875 # UPL = 1.8173115 #========== # Example 18-3 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # 4 future observations of trichloroethylene (TCE) at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.3.TCE.df. # There are 6 monthly observations of TCE (ppb) at 3 background wells, # and 4 monthly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.3.TCE.df # Month Well Well.type TCE.ppb.orig TCE.ppb Censored #1 1 BW-1 Background <5 5.0 TRUE #2 2 BW-1 Background <5 5.0 TRUE #3 3 BW-1 Background 8 8.0 FALSE #... #22 4 CW-4 Compliance <5 5.0 TRUE #23 5 CW-4 Compliance 8 8.0 FALSE #24 6 CW-4 Compliance 14 14.0 FALSE longToWide(EPA.09.Ex.18.3.TCE.df, "TCE.ppb.orig", "Month", "Well", paste.row.name = TRUE) # BW-1 BW-2 BW-3 CW-4 #Month.1 <5 7 <5 #Month.2 <5 6.5 <5 #Month.3 8 <5 10.5 7.5 #Month.4 <5 6 <5 <5 #Month.5 9 12 <5 8 #Month.6 10 <5 9 14 # Construct the prediction limit based on the background well data # using the maximum value as the upper prediction limit. # Note that since all censored observations are censored at one # censoring level and the censoring level is less than all of the # uncensored observations, we can just supply the censoring level # to predIntNpar. #----------------------------------------------------------------- with(EPA.09.Ex.18.3.TCE.df, predIntNpar(TCE.ppb[Well.type == "Background"], m = 4, pi.type = "upper", lb = 0)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: TCE.ppb[Well.type == "Background"] # #Sample Size: 18 # #Prediction Interval Method: Exact # #Prediction Interval Type: upper # #Confidence Level: 81.81818% # #Prediction Limit Rank(s): 18 # #Number of Future Observations: 4 # #Prediction Interval: LPL = 0 # UPL = 12 # Since the value of 14 ppb for Month 6 at the compliance well exceeds # the upper prediction limit of 12, we might conclude that there is # statistically significant evidence of an increase over background # at CW-4. However, the confidence level associated with this # prediction limit is about 82%, which implies a Type I error level of # 18%. This means there is nearly a one in five chance of a false positive. # Only additional background data and/or use of a retesting strategy # (see predIntNparSimultaneous) would lower the false positive rate. #========== # Example 18-4 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # median of order 3 of xylene at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.4.xylene.df. # There are 8 monthly observations of xylene (ppb) at 3 background wells, # and 3 montly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.4.xylene.df # Month Well Well.type Xylene.ppb.orig Xylene.ppb Censored #1 1 Well.1 Background <5 5.0 TRUE #2 2 Well.1 Background <5 5.0 TRUE #3 3 Well.1 Background 7.5 7.5 FALSE #... #30 6 Well.4 Compliance <5 5.0 TRUE #31 7 Well.4 Compliance 7.8 7.8 FALSE #32 8 Well.4 Compliance 10.4 10.4 FALSE longToWide(EPA.09.Ex.18.4.xylene.df, "Xylene.ppb.orig", "Month", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 #Month.1 <5 9.2 <5 #Month.2 <5 <5 5.4 #Month.3 7.5 <5 6.7 #Month.4 <5 6.1 <5 #Month.5 <5 8 <5 #Month.6 <5 5.9 <5 <5 #Month.7 6.4 <5 <5 7.8 #Month.8 6 <5 <5 10.4 # Construct the prediction limit based on the background well data # using the maximum value as the upper prediction limit. # Note that since all censored observations are censored at one # censoring level and the censoring level is less than all of the # uncensored observations, we can just supply the censoring level # to predIntNpar. # # To compute a prediction interval for a median of order 3 (i.e., # a median based on 3 observations), this is equivalent to # constructing a nonparametric prediction interval that must hold # at least 2 of the next 3 future observations. #----------------------------------------------------------------- with(EPA.09.Ex.18.4.xylene.df, predIntNpar(Xylene.ppb[Well.type == "Background"], k = 2, m = 3, pi.type = "upper", lb = 0)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Xylene.ppb[Well.type == "Background"] # #Sample Size: 24 # #Prediction Interval Method: Exact # #Prediction Interval Type: upper # #Confidence Level: 99.1453% # #Prediction Limit Rank(s): 24 # #Minimum Number of #Future Observations #Interval Should Contain: 2 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = 0.0 # UPL = 9.2 # The Month 8 observation at the Complance well is 10.4 ppb of Xylene, # which is greater than the upper prediction limit of 9.2 ppb, so # conclude there is evidence of contamination at the # 100% - 99% = 1% Type I Error Level #========== # Cleanup #-------- rm(dat)
# Generate 20 observations from a lognormal mixture distribution with # parameters mean1=1, cv1=0.5, mean2=5, cv2=1, and p.mix=0.1. Use # predIntNpar to construct a two-sided prediction interval using the # minimum and maximum observed values. Note that the associated confidence # level is 90%. A larger sample size is required to obtain a larger # confidence level (see the help file for predIntNparN). # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormMixAlt(n = 20, mean1 = 1, cv1 = 0.5, mean2 = 5, cv2 = 1, p.mix = 0.1) predIntNpar(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Exact # #Prediction Interval Type: two-sided # #Confidence Level: 90.47619% # #Prediction Limit Rank(s): 1 20 # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0.3647875 # UPL = 1.8173115 #---------- # Repeat the above example, but specify m=5 future observations should be # contained in the prediction interval. Note that the confidence level is # now only 63%. predIntNpar(dat, m = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Exact # #Prediction Interval Type: two-sided # #Confidence Level: 63.33333% # #Prediction Limit Rank(s): 1 20 # #Number of Future Observations: 5 # #Prediction Interval: LPL = 0.3647875 # UPL = 1.8173115 #---------- # Repeat the above example, but specify that a minimum of k=3 observations # out of a total of m=5 future observations should be contained in the # prediction interval. Note that the confidence level is now 98%. predIntNpar(dat, k = 3, m = 5) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: Exact # #Prediction Interval Type: two-sided # #Confidence Level: 98.37945% # #Prediction Limit Rank(s): 1 20 # #Minimum Number of #Future Observations #Interval Should Contain: 3 # #Total Number of #Future Observations: 5 # #Prediction Interval: LPL = 0.3647875 # UPL = 1.8173115 #========== # Example 18-3 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # 4 future observations of trichloroethylene (TCE) at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.3.TCE.df. # There are 6 monthly observations of TCE (ppb) at 3 background wells, # and 4 monthly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.3.TCE.df # Month Well Well.type TCE.ppb.orig TCE.ppb Censored #1 1 BW-1 Background <5 5.0 TRUE #2 2 BW-1 Background <5 5.0 TRUE #3 3 BW-1 Background 8 8.0 FALSE #... #22 4 CW-4 Compliance <5 5.0 TRUE #23 5 CW-4 Compliance 8 8.0 FALSE #24 6 CW-4 Compliance 14 14.0 FALSE longToWide(EPA.09.Ex.18.3.TCE.df, "TCE.ppb.orig", "Month", "Well", paste.row.name = TRUE) # BW-1 BW-2 BW-3 CW-4 #Month.1 <5 7 <5 #Month.2 <5 6.5 <5 #Month.3 8 <5 10.5 7.5 #Month.4 <5 6 <5 <5 #Month.5 9 12 <5 8 #Month.6 10 <5 9 14 # Construct the prediction limit based on the background well data # using the maximum value as the upper prediction limit. # Note that since all censored observations are censored at one # censoring level and the censoring level is less than all of the # uncensored observations, we can just supply the censoring level # to predIntNpar. #----------------------------------------------------------------- with(EPA.09.Ex.18.3.TCE.df, predIntNpar(TCE.ppb[Well.type == "Background"], m = 4, pi.type = "upper", lb = 0)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: TCE.ppb[Well.type == "Background"] # #Sample Size: 18 # #Prediction Interval Method: Exact # #Prediction Interval Type: upper # #Confidence Level: 81.81818% # #Prediction Limit Rank(s): 18 # #Number of Future Observations: 4 # #Prediction Interval: LPL = 0 # UPL = 12 # Since the value of 14 ppb for Month 6 at the compliance well exceeds # the upper prediction limit of 12, we might conclude that there is # statistically significant evidence of an increase over background # at CW-4. However, the confidence level associated with this # prediction limit is about 82%, which implies a Type I error level of # 18%. This means there is nearly a one in five chance of a false positive. # Only additional background data and/or use of a retesting strategy # (see predIntNparSimultaneous) would lower the false positive rate. #========== # Example 18-4 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # median of order 3 of xylene at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.4.xylene.df. # There are 8 monthly observations of xylene (ppb) at 3 background wells, # and 3 montly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.4.xylene.df # Month Well Well.type Xylene.ppb.orig Xylene.ppb Censored #1 1 Well.1 Background <5 5.0 TRUE #2 2 Well.1 Background <5 5.0 TRUE #3 3 Well.1 Background 7.5 7.5 FALSE #... #30 6 Well.4 Compliance <5 5.0 TRUE #31 7 Well.4 Compliance 7.8 7.8 FALSE #32 8 Well.4 Compliance 10.4 10.4 FALSE longToWide(EPA.09.Ex.18.4.xylene.df, "Xylene.ppb.orig", "Month", "Well", paste.row.name = TRUE) # Well.1 Well.2 Well.3 Well.4 #Month.1 <5 9.2 <5 #Month.2 <5 <5 5.4 #Month.3 7.5 <5 6.7 #Month.4 <5 6.1 <5 #Month.5 <5 8 <5 #Month.6 <5 5.9 <5 <5 #Month.7 6.4 <5 <5 7.8 #Month.8 6 <5 <5 10.4 # Construct the prediction limit based on the background well data # using the maximum value as the upper prediction limit. # Note that since all censored observations are censored at one # censoring level and the censoring level is less than all of the # uncensored observations, we can just supply the censoring level # to predIntNpar. # # To compute a prediction interval for a median of order 3 (i.e., # a median based on 3 observations), this is equivalent to # constructing a nonparametric prediction interval that must hold # at least 2 of the next 3 future observations. #----------------------------------------------------------------- with(EPA.09.Ex.18.4.xylene.df, predIntNpar(Xylene.ppb[Well.type == "Background"], k = 2, m = 3, pi.type = "upper", lb = 0)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Xylene.ppb[Well.type == "Background"] # #Sample Size: 24 # #Prediction Interval Method: Exact # #Prediction Interval Type: upper # #Confidence Level: 99.1453% # #Prediction Limit Rank(s): 24 # #Minimum Number of #Future Observations #Interval Should Contain: 2 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = 0.0 # UPL = 9.2 # The Month 8 observation at the Complance well is 10.4 ppb of Xylene, # which is greater than the upper prediction limit of 9.2 ppb, so # conclude there is evidence of contamination at the # 100% - 99% = 1% Type I Error Level #========== # Cleanup #-------- rm(dat)
Compute the confidence level associated with a nonparametric prediction interval
that should contain at least out of the next
future observations
for a continuous distribution.
predIntNparConfLevel(n, k = m, m = 1, lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "two.sided")
predIntNparConfLevel(n, k = m, m = 1, lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "two.sided")
n |
vector of positive integers specifying the sample sizes.
Missing ( |
k |
vector of positive integers specifying the minimum number of future
observations out of |
m |
vector of positive integers specifying the number of future observations.
The default value is |
lpl.rank |
vector of positive integers indicating the rank of the order statistic to use for
the lower bound of the prediction interval. If |
n.plus.one.minus.upl.rank |
vector of positive integers related to the rank of the order statistic to use for
the upper bound of the prediction interval. A value of
|
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
If the arguments n
, k
, m
, lpl.rank
, and
n.plus.one.minus.upl.rank
are not all the same length, they are replicated
to be the same length as the length of the longest argument.
The help file for predIntNpar
explains how nonparametric prediction
intervals are constructed and how the confidence level
associated with the prediction interval is computed based on specified values
for the sample size and the ranks of the order statistics used for
the bounds of the prediction interval.
vector of values between 0 and 1 indicating the confidence level associated with the specified nonparametric prediction interval.
See the help file for predIntNpar
.
Steven P. Millard ([email protected])
See the help file for predIntNpar
.
predIntNpar
, predIntNparN
,
plotPredIntNparDesign
.
# Look at how the confidence level of a nonparametric prediction interval # increases with increasing sample size: seq(5, 25, by = 5) #[1] 5 10 15 20 25 round(predIntNparConfLevel(n = seq(5, 25, by = 5)), 2) #[1] 0.67 0.82 0.87 0.90 0.92 #--------- # Look at how the confidence level of a nonparametric prediction interval # decreases as the number of future observations increases: round(predIntNparConfLevel(n = 10, m = 1:5), 2) #[1] 0.82 0.68 0.58 0.49 0.43 #---------- # Look at how the confidence level of a nonparametric prediction interval # decreases with minimum number of observations that must be contained within # the interval (k): round(predIntNparConfLevel(n = 10, k = 1:5, m = 5), 2) #[1] 1.00 0.98 0.92 0.76 0.43 #---------- # Look at how the confidence level of a nonparametric prediction interval # decreases with the rank of the lower prediction limit: round(predIntNparConfLevel(n = 10, lpl.rank = 1:5), 2) #[1] 0.82 0.73 0.64 0.55 0.45 #========== # Example 18-3 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # 4 future observations of trichloroethylene (TCE) at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.3.TCE.df. # There are 6 monthly observations of TCE (ppb) at 3 background wells, # and 4 monthly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.3.TCE.df # Month Well Well.type TCE.ppb.orig TCE.ppb Censored #1 1 BW-1 Background <5 5.0 TRUE #2 2 BW-1 Background <5 5.0 TRUE #3 3 BW-1 Background 8 8.0 FALSE #... #22 4 CW-4 Compliance <5 5.0 TRUE #23 5 CW-4 Compliance 8 8.0 FALSE #24 6 CW-4 Compliance 14 14.0 FALSE longToWide(EPA.09.Ex.18.3.TCE.df, "TCE.ppb.orig", "Month", "Well", paste.row.name = TRUE) # BW-1 BW-2 BW-3 CW-4 #Month.1 <5 7 <5 #Month.2 <5 6.5 <5 #Month.3 8 <5 10.5 7.5 #Month.4 <5 6 <5 <5 #Month.5 9 12 <5 8 #Month.6 10 <5 9 14 # If we construct the prediction limit based on the background well # data using the maximum value as the upper prediction limit, # the associated confidence level is only 82%. #----------------------------------------------------------------- predIntNparConfLevel(n = 18, m = 4, pi.type = "upper") #[1] 0.8181818 # We would have to collect an additional 18 observations to achieve a # confidence level of at least 90%: predIntNparN(m = 4, pi.type = "upper", conf.level = 0.9) #[1] 36 predIntNparConfLevel(n = 36, m = 4, pi.type = "upper") #[1] 0.9
# Look at how the confidence level of a nonparametric prediction interval # increases with increasing sample size: seq(5, 25, by = 5) #[1] 5 10 15 20 25 round(predIntNparConfLevel(n = seq(5, 25, by = 5)), 2) #[1] 0.67 0.82 0.87 0.90 0.92 #--------- # Look at how the confidence level of a nonparametric prediction interval # decreases as the number of future observations increases: round(predIntNparConfLevel(n = 10, m = 1:5), 2) #[1] 0.82 0.68 0.58 0.49 0.43 #---------- # Look at how the confidence level of a nonparametric prediction interval # decreases with minimum number of observations that must be contained within # the interval (k): round(predIntNparConfLevel(n = 10, k = 1:5, m = 5), 2) #[1] 1.00 0.98 0.92 0.76 0.43 #---------- # Look at how the confidence level of a nonparametric prediction interval # decreases with the rank of the lower prediction limit: round(predIntNparConfLevel(n = 10, lpl.rank = 1:5), 2) #[1] 0.82 0.73 0.64 0.55 0.45 #========== # Example 18-3 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # 4 future observations of trichloroethylene (TCE) at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.3.TCE.df. # There are 6 monthly observations of TCE (ppb) at 3 background wells, # and 4 monthly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.3.TCE.df # Month Well Well.type TCE.ppb.orig TCE.ppb Censored #1 1 BW-1 Background <5 5.0 TRUE #2 2 BW-1 Background <5 5.0 TRUE #3 3 BW-1 Background 8 8.0 FALSE #... #22 4 CW-4 Compliance <5 5.0 TRUE #23 5 CW-4 Compliance 8 8.0 FALSE #24 6 CW-4 Compliance 14 14.0 FALSE longToWide(EPA.09.Ex.18.3.TCE.df, "TCE.ppb.orig", "Month", "Well", paste.row.name = TRUE) # BW-1 BW-2 BW-3 CW-4 #Month.1 <5 7 <5 #Month.2 <5 6.5 <5 #Month.3 8 <5 10.5 7.5 #Month.4 <5 6 <5 <5 #Month.5 9 12 <5 8 #Month.6 10 <5 9 14 # If we construct the prediction limit based on the background well # data using the maximum value as the upper prediction limit, # the associated confidence level is only 82%. #----------------------------------------------------------------- predIntNparConfLevel(n = 18, m = 4, pi.type = "upper") #[1] 0.8181818 # We would have to collect an additional 18 observations to achieve a # confidence level of at least 90%: predIntNparN(m = 4, pi.type = "upper", conf.level = 0.9) #[1] 36 predIntNparConfLevel(n = 36, m = 4, pi.type = "upper") #[1] 0.9
Compute the sample size necessary for a nonparametric prediction interval to
contain at least out of the next
future observations with
probability
for a continuous distribution.
predIntNparN(k = m, m = 1, lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "two.sided", conf.level = 0.95, n.max = 5000, maxiter = 1000)
predIntNparN(k = m, m = 1, lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "two.sided", conf.level = 0.95, n.max = 5000, maxiter = 1000)
k |
vector of positive integers specifying the minimum number of future
observations out of |
m |
vector of positive integers specifying the number of future observations.
The default value is |
lpl.rank |
vector of positive integers indicating the rank of the order statistic to use for
the lower bound of the prediction interval. If |
n.plus.one.minus.upl.rank |
vector of positive integers related to the rank of the order statistic to use for
the upper bound of the prediction interval. A value of
|
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level
associated with the prediction interval. The default value is
|
n.max |
positive integer greater than 1 indicating the maximum possible sample size. The
default value is |
maxiter |
positive integer indicating the maximum number of iterations to use in the
|
If the arguments k
, m
, lpl.rank
, and
n.plus.one.minus.upl.rank
are not all the same length, they are replicated
to be the same length as the length of the longest argument.
The function predIntNparN
initially computes the required sample size
by solving Equation (11) or (12) in the help file for
predIntNpar
for
, depending on the value of the argument
pi.type
. If k < m
,
lpl.rank > 1
(two-sided and lower prediction intervals only), or n.plus.one.minus.upl.rank > 1
(two-sided and upper prediction intervals only),
then this initial value of is used as the upper bound in a binary search
based on Equation (8) in the help file for
predIntNpar
and is
implemented via the R function uniroot
with the argument
tolerance
set to 1
.
vector of positive integers indicating the required sample size(s) for the specified nonparametric prediction interval(s).
See the help file for predIntNpar
.
Steven P. Millard ([email protected])
See the help file for predIntNpar
.
predIntNpar
, predIntNparConfLevel
,
plotPredIntNparDesign
.
# Look at how the required sample size for a nonparametric prediction interval # increases with increasing confidence level: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 predIntNparN(conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 3 4 6 9 19 #---------- # Look at how the required sample size for a nonparametric prediction interval # increases with number of future observations (m): 1:5 #[1] 1 2 3 4 5 predIntNparN(m = 1:5) #[1] 39 78 116 155 193 #---------- # Look at how the required sample size for a nonparametric prediction interval # increases with minimum number of observations that must be contained within # the interval (k): predIntNparN(k = 1:5, m = 5) #[1] 4 7 13 30 193 #---------- # Look at how the required sample size for a nonparametric prediction interval # increases with the rank of the lower prediction limit: predIntNparN(lpl.rank = 1:5) #[1] 39 59 79 100 119 #========== # Example 18-3 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # 4 future observations of trichloroethylene (TCE) at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.3.TCE.df. # There are 6 monthly observations of TCE (ppb) at 3 background wells, # and 4 monthly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.3.TCE.df # Month Well Well.type TCE.ppb.orig TCE.ppb Censored #1 1 BW-1 Background <5 5.0 TRUE #2 2 BW-1 Background <5 5.0 TRUE #3 3 BW-1 Background 8 8.0 FALSE #... #22 4 CW-4 Compliance <5 5.0 TRUE #23 5 CW-4 Compliance 8 8.0 FALSE #24 6 CW-4 Compliance 14 14.0 FALSE longToWide(EPA.09.Ex.18.3.TCE.df, "TCE.ppb.orig", "Month", "Well", paste.row.name = TRUE) # BW-1 BW-2 BW-3 CW-4 #Month.1 <5 7 <5 #Month.2 <5 6.5 <5 #Month.3 8 <5 10.5 7.5 #Month.4 <5 6 <5 <5 #Month.5 9 12 <5 8 #Month.6 10 <5 9 14 # If we construct the prediction limit based on the background well # data using the maximum value as the upper prediction limit, # the associated confidence level is only 82%. #----------------------------------------------------------------- predIntNparConfLevel(n = 18, m = 4, pi.type = "upper") #[1] 0.8181818 # We would have to collect an additional 18 observations to achieve a # confidence level of at least 90%: predIntNparN(m = 4, pi.type = "upper", conf.level = 0.9) #[1] 36 predIntNparConfLevel(n = 36, m = 4, pi.type = "upper") #[1] 0.9
# Look at how the required sample size for a nonparametric prediction interval # increases with increasing confidence level: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 predIntNparN(conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 3 4 6 9 19 #---------- # Look at how the required sample size for a nonparametric prediction interval # increases with number of future observations (m): 1:5 #[1] 1 2 3 4 5 predIntNparN(m = 1:5) #[1] 39 78 116 155 193 #---------- # Look at how the required sample size for a nonparametric prediction interval # increases with minimum number of observations that must be contained within # the interval (k): predIntNparN(k = 1:5, m = 5) #[1] 4 7 13 30 193 #---------- # Look at how the required sample size for a nonparametric prediction interval # increases with the rank of the lower prediction limit: predIntNparN(lpl.rank = 1:5) #[1] 39 59 79 100 119 #========== # Example 18-3 of USEPA (2009, p.18-19) shows how to construct # a one-sided upper nonparametric prediction interval for the next # 4 future observations of trichloroethylene (TCE) at a downgradient well. # The data for this example are stored in EPA.09.Ex.18.3.TCE.df. # There are 6 monthly observations of TCE (ppb) at 3 background wells, # and 4 monthly observations of TCE at a compliance well. # Look at the data #----------------- EPA.09.Ex.18.3.TCE.df # Month Well Well.type TCE.ppb.orig TCE.ppb Censored #1 1 BW-1 Background <5 5.0 TRUE #2 2 BW-1 Background <5 5.0 TRUE #3 3 BW-1 Background 8 8.0 FALSE #... #22 4 CW-4 Compliance <5 5.0 TRUE #23 5 CW-4 Compliance 8 8.0 FALSE #24 6 CW-4 Compliance 14 14.0 FALSE longToWide(EPA.09.Ex.18.3.TCE.df, "TCE.ppb.orig", "Month", "Well", paste.row.name = TRUE) # BW-1 BW-2 BW-3 CW-4 #Month.1 <5 7 <5 #Month.2 <5 6.5 <5 #Month.3 8 <5 10.5 7.5 #Month.4 <5 6 <5 <5 #Month.5 9 12 <5 8 #Month.6 10 <5 9 14 # If we construct the prediction limit based on the background well # data using the maximum value as the upper prediction limit, # the associated confidence level is only 82%. #----------------------------------------------------------------- predIntNparConfLevel(n = 18, m = 4, pi.type = "upper") #[1] 0.8181818 # We would have to collect an additional 18 observations to achieve a # confidence level of at least 90%: predIntNparN(m = 4, pi.type = "upper", conf.level = 0.9) #[1] 36 predIntNparConfLevel(n = 36, m = 4, pi.type = "upper") #[1] 0.9
Construct a nonparametric simultaneous prediction interval for the next
sampling “occasions” based on one of three
possible rules: k-of-m, California, or Modified California. The simultaneous
prediction interval assumes the observations from from a continuous distribution.
predIntNparSimultaneous(x, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), lb = -Inf, ub = Inf, pi.type = "upper", integrate.args.list = NULL)
predIntNparSimultaneous(x, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), lb = -Inf, ub = Inf, pi.type = "upper", integrate.args.list = NULL)
x |
a numeric vector of observations. Missing ( |
n.median |
positive odd integer specifying the sample size associated with the future medians.
The default value is |
k |
for the |
m |
positive integer specifying the maximum number of future observations (or
medians) on one future sampling “occasion”.
The default value is |
r |
positive integer specifying the number of future sampling “occasions”.
The default value is |
rule |
character string specifying which rule to use. The possible values are
|
lpl.rank |
positive integer indicating the rank of the order statistic to use for the lower
bound of the prediction interval. When |
n.plus.one.minus.upl.rank |
positive integer related to the rank of the order statistic to use for the upper
bound of the prediction interval. A value of |
lb , ub
|
scalars indicating lower and upper bounds on the distribution. By default,
|
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
integrate.args.list |
a list of arguments to supply to the |
What is a Nonparametric Simultaneous Prediction Interval?
A nonparametric prediction interval for some population is an interval on the real line
constructed so that it will contain at least of
future observations from
that population with some specified probability
, where
and
and
are some pre-specified positive integers
and
. The quantity
is called
the confidence coefficient or confidence level associated with the prediction
interval. The function
predIntNpar
computes a standard
nonparametric prediction interval.
The function predIntNparSimultaneous
computes a nonparametric simultaneous
prediction interval that will contain a certain number of future observations
with probability for each of
future sampling
“occasions”,
where
is some pre-specified positive integer. The quantity
may
refer to
distinct future sampling occasions in time, or it may for example
refer to sampling at
distinct locations on one future sampling occasion,
assuming that the population standard deviation is the same at all of the
distinct locations.
The function predIntNparSimultaneous
computes a nonparametric simultaneous
prediction interval based on one of three possible rules:
For the -of-
rule (
rule="k.of.m"
), at least of
the next
future observations will fall in the prediction
interval with probability
on each of the
future
sampling occasions. If obserations are being taken sequentially, for a particular
sampling occasion, up to
observations may be taken, but once
of the observations fall within the prediction interval, sampling can stop.
Note: For this rule, when
, the results of
predIntNparSimultaneous
are equivalent to the results of predIntNpar
.
For the California rule (rule="CA"
), with probability
, for each of the
future sampling occasions, either
the first observation will fall in the prediction interval, or else all of the next
observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise,
more observations must be taken.
For the Modified California rule (rule="Modified.CA"
), with probability
, for each of the
future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
Nonparametric simultaneous prediction intervals can be extended to using medians
in place of single observations (USEPA, 2009, Chapter 19). That is, you can
create a nonparametric simultaneous prediction interval that will contain a
specified number of medians (based on which rule you choose) on each of
future sampling occassions, where each each median is based on
individual
observations. For the function
predIntNparSimultaneous
, the argument
n.median
corresponds to .
The Form of a Nonparametric Prediction Interval
Let denote a vector of
independent observations from some continuous distribution, and let
denote the the
'th order statistics in
.
A two-sided nonparametric prediction interval is constructed as:
where and
are positive integers between 1 and
, and
. That is,
denotes the rank of the lower prediction limit, and
denotes the rank of the upper prediction limit. To make it easier to write
some equations later on, we can also write the prediction interval (1) in a slightly
different way as:
where
so that is a positive integer between 1 and
, and
. In terms of the arguments to the function
predIntNparSimultaneous
,
the argument lpl.rank
corresponds to , and the argument
n.plus.one.minus.upl.rank
corresponds to .
If we allow and
and define lower and upper bounds as:
then Equation (2) above can also represent a one-sided lower or one-sided upper prediction interval as well. That is, a one-sided lower nonparametric prediction interval is constructed as:
and a one-sided upper nonparametric prediction interval is constructed as:
Usually, or
and
.
Note: For nonparametric simultaneous prediction intervals, only lower
(pi.type="lower"
) and upper (pi.type="upper"
) prediction
intervals are available.
Constructing Nonparametric Simultaneous Prediction Intervals for Future Observations
First we will show how to construct a nonparametric simultaneous prediction interval based on
future observations (i.e., ,
n.median=1
), and then extend the formulas to
future medians.
Simultaneous Prediction Intervals for the k-of-m Rule (rule="k.of.m"
)
For the -of-
rule (
rule="k.of.m"
) with
(i.e.,
n.median=1
), at least of the next
future
observations will fall in the prediction interval
with probability
on each of the
future sampling
occasions. If observations are being taken sequentially, for a particular
sampling occasion, up to
observations may be taken, but once
of the observations fall within the prediction interval, sampling can stop.
Note: When
, this kind of simultaneous prediction
interval becomes the same as a standard nonparametric prediction interval
(see
predIntNpar
).
Chou and Owen (1986) developed the theory for nonparametric simultaneous prediction limits
for various rules, including the 1-of- rule. Their theory, however, does not cover
the California or Modified California rules, and uses an
-fold summation involving a
minimum of
terms. Davis and McNichols (1994b; 1999) extended the results of
Chou and Owen (1986) to include the California and Modified California rule, and developed
algorithms that involve summing far fewer terms.
Davis and McNichols (1999) give formulas for the probabilities associated with the
one-sided upper simultaneous prediction interval shown in Equation (7). For the
-of-
rule, the probability that at least
of the next
future observations will be contained in the interval given in Equation (7) for each
of
future sampling occasions is given by:
|
|
|
|
|
|
where denotes a random variable with a beta distribution
with parameters
and
, and
denotes the pdf of this
distribution. Note that
denotes the rank of the order statistic used as the
upper prediction limit (i.e.,
n.plus.one.minus.upl.rank=
), and
that
is usually equal to
.
Also note that the summation term in Equation (8) corresponds to the cumulative
distribution function of a Negative Binomial distribution
with parameters size=
and
prob=
evaluated at
q=
.
When pi.type="lower"
, denotes a random variable with a
beta distribution with parameters
and
. Note that
denotes the rank of the order statistic used as the
lower prediction limit (i.e.,
lpl.rank=
), and
that
is usually equal to
.
Simultaneous Prediction Intervals for the California Rule (rule="CA"
)
For the California rule (rule="CA"
), with probability ,
for each of the
future sampling occasions, either the first observation will
fall in the prediction interval, or else all of the next
observations will
fall in the prediction interval. That is, if the first observation falls in the
prediction interval then sampling can stop. Otherwise,
more observations
must be taken.
In this case, the probability is given by:
|
|
|
|
|
|
Simultaneous Prediction Intervals for the Modified California Rule (rule="Modified.CA"
)
For the Modified California rule (rule="Modified.CA"
), with probability
, for each of the
future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
In this case, the probability is given by:
|
|
|
|
|
|
where and
.
Davis and McNichols (1999) provide algorithms for computing the probabilities based on expanding
polynomials and the formula for the expected value of a beta random variable. In the discussion
section of Davis and McNichols (1999), however, Vangel points out that numerical integration is
adequate, and this is how these probabilities are computed in the function
predIntNparSimultaneous
.
Constructing Nonparametric Simultaneous Prediction Intervals for Future Medians
USEPA (2009, Chapter 19; Cameron, 2011) extends nonparametric simultaneous
prediction intervals to testing future medians for the case of the 1-of-1 and
1-of-2 plans for medians of order 3. In general, each of the rules
(-of-
, California, and Modified California) can be easily
extended to the case of using medians as long as the medians are based on an
odd (as opposed to even) sample size.
For each of the above rules, if we are interested in using medians instead of
single observations (i.e., ;
n.median
), and we
force
to be odd, then a median will be less than a prediction limit
once
observations are less than the prediction limit. Thus,
Equations (8) - (10) are modified by replacing
with the term:
where
a list of class "estimate"
containing the simultaneous prediction interval
and other information. See the help file for estimate.object
for
details.
Prediction and tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973; Krishnamoorthy and Mathew, 2009). In the context of environmental statistics, prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities (e.g., Gibbons et al., 2009; Millard and Neerchal, 2001; USEPA, 2009).
Steven P. Millard ([email protected])
Cameron, Kirk. (2011). Personal communication, February 16, 2011. MacStat Consulting, Ltd., Colorado Springs, Colorado.
Chew, V. (1968). Simultaneous Prediction Intervals. Technometrics 10(2), 323–331.
Danziger, L., and S. Davis. (1964). Tables of Distribution-Free Tolerance Limits. Annals of Mathematical Statistics 35(5), 1361–1365.
Davis, C.B. (1994). Environmental Regulatory Statistics. In Patil, G.P., and C.R. Rao, eds., Handbook of Statistics, Vol. 12: Environmental Statistics. North-Holland, Amsterdam, a division of Elsevier, New York, NY, Chapter 26, 817–865.
Davis, C.B., and R.J. McNichols. (1987). One-sided Intervals for at Least p of m Observations from a Normal Population on Each of r Future Occasions. Technometrics 29, 359–370.
Davis, C.B., and R.J. McNichols. (1994a). Ground Water Monitoring Statistics Update: Part I: Progress Since 1988. Ground Water Monitoring and Remediation 14(4), 148–158.
Davis, C.B., and R.J. McNichols. (1994b). Ground Water Monitoring Statistics Update: Part II: Nonparametric Prediction Limits. Ground Water Monitoring and Remediation 14(4), 159–175.
Davis, C.B., and R.J. McNichols. (1999). Simultaneous Nonparametric Prediction Limits (with Discusson). Technometrics 41(2), 89–112.
Gibbons, R.D. (1987a). Statistical Prediction Intervals for the Evaluation of Ground-Water Quality. Ground Water 25, 455–465.
Gibbons, R.D. (1991b). Statistical Tolerance Limits for Ground-Water Monitoring. Ground Water 29, 563–570.
Gibbons, R.D., and J. Baker. (1991). The Properties of Various Statistical Prediction Intervals for Ground-Water Detection Monitoring. Journal of Environmental Science and Health A26(4), 535–553.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York, 392pp.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178–188.
Hall, I.J., R.R. Prairie, and C.K. Motlagh. (1975). Non-Parametric Prediction Intervals. Journal of Quality Technology 7(3), 109–114.
Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
predIntNparSimultaneousConfLevel
,
predIntNparSimultaneousN
, plotPredIntNparSimultaneousDesign
,
predIntNparSimultaneousTestPower
,
predIntNpar
, tolIntNpar
,
estimate.object
.
# Generate 20 observations from a lognormal mixture distribution with # parameters mean1=1, cv1=0.5, mean2=5, cv2=1, and p.mix=0.1. Use # predIntNparSimultaneous to construct an upper one-sided prediction interval # using the maximum observed value using the 1-of-3 rule. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormMixAlt(n = 20, mean1 = 1, cv1 = 0.5, mean2 = 5, cv2 = 1, p.mix = 0.1) predIntNparSimultaneous(dat, k = 1, m = 3, lb = 0) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99.94353% # #Prediction Limit Rank(s): 20 # #Minimum Number of #Future Observations #Interval Should Contain: 1 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = 0.000000 # UPL = 1.817311 #---------- # Compare the confidence levels for the 1-of-3 rule, California Rule, and # Modified California Rule. predIntNparSimultaneous(dat, k = 1, m = 3, lb = 0)$interval$conf.level #[1] 0.9994353 predIntNparSimultaneous(dat, m = 3, rule = "CA", lb = 0)$interval$conf.level #[1] 0.9919066 predIntNparSimultaneous(dat, rule = "Modified.CA", lb = 0)$interval$conf.level #[1] 0.9984943 #========= # Repeat the above example, but create the baseline data using just # n=8 observations and set r to 4 future sampling occasions set.seed(598) dat <- rlnormMixAlt(n = 8, mean1 = 1, cv1 = 0.5, mean2 = 5, cv2 = 1, p.mix = 0.1) predIntNparSimultaneous(dat, k = 1, m = 3, r = 4, lb = 0) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 8 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 97.7599% # #Prediction Limit Rank(s): 8 # #Minimum Number of #Future Observations #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Observations #(per Sampling Occasion): 3 # #Number of Future #Sampling Occasions: 4 # #Prediction Interval: LPL = 0.000000 # UPL = 5.683453 #---------- # Compare the confidence levels for the 1-of-3 rule, California Rule, and # Modified California Rule. predIntNparSimultaneous(dat, k = 1, m = 3, r = 4, lb = 0)$interval$conf.level #[1] 0.977599 predIntNparSimultaneous(dat, m = 3, r = 4, rule = "CA", lb = 0)$interval$conf.level #[1] 0.8737798 predIntNparSimultaneous(dat, r = 4, rule = "Modified.CA", lb = 0)$interval$conf.level #[1] 0.9510178 #========== # Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will compute the confidence level associated with two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest background value. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # n = 20 based on 4 Background Wells # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year n <- 20 nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 alpha <- 1 - conf.level alpha #[1] 0.02085164 #---------- # Look at the data: head(EPA.09.Ex.19.5.mercury.df) # Event Well Well.type Mercury.ppb.orig Mercury.ppb Censored #1 1 BG-1 Background 0.21 0.21 FALSE #2 2 BG-1 Background <.2 0.20 TRUE #3 3 BG-1 Background <.2 0.20 TRUE #4 4 BG-1 Background <.2 0.20 TRUE #5 5 BG-1 Background <.2 0.20 TRUE #6 6 BG-1 Background NA FALSE longToWide(EPA.09.Ex.19.5.mercury.df, "Mercury.ppb.orig", "Event", "Well", paste.row.name = TRUE) # BG-1 BG-2 BG-3 BG-4 CW-1 CW-2 #Event.1 0.21 <.2 <.2 <.2 0.22 0.36 #Event.2 <.2 <.2 0.23 0.25 0.2 0.41 #Event.3 <.2 <.2 <.2 0.28 <.2 0.28 #Event.4 <.2 0.21 0.23 <.2 0.25 0.45 #Event.5 <.2 <.2 0.24 <.2 0.24 0.43 #Event.6 <.2 0.54 # Construct the upper simultaneous prediction limit using the 1-of-2 # retesting plan for a median of order 3 based on the background maximum Hg.Back <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb[Well.type == "Background"]) pred.int.1.of.2.med.3 <- predIntNparSimultaneous(Hg.Back, n.median = 3, k = 1, m = 2, r = r, lb = 0) pred.int.1.of.2.med.3 #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Hg.Back # #Sample Size: 20 # #Number NA/NaN/Inf's: 4 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99.40354% # #Prediction Limit Rank(s): 20 # #Minimum Number of #Future Medians #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Medians #(per Sampling Occasion): 2 # #Number of Future #Sampling Occasions: 10 # #Sample Size for Medians: 3 # #Prediction Interval: LPL = 0.00 # UPL = 0.28 # Note that the achieved confidence level of 99.4% is greater than the # required confidence level of 97.9%. # Now determine whether either compliance well indicates evidence of # Mercury contamination. # Compliance Well 1 #------------------ Hg.CW.1 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-1"]) Hg.CW.1 #[1] "0.22" "0.2" "<.2" "0.25" "0.24" "<.2" # The median of the first 3 observations is 0.2, which is less than # the UPL of 0.28, so there is no evidence of contamination. # Compliance Well 2 #------------------ Hg.CW.2 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-2"]) Hg.CW.2 #[1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54" # The median of the first 3 observations is 0.36, so 3 more observations have to # be looked at. The median of the second 3 observations is 0.45, which is # larger than the UPL of 0.28, so there is evidence of contamination. #---------- # Now create the upper simultaneous prediction limit using the 1-of-4 plan # on individual observations using the 3rd highest background value. pred.int.1.of.4.3rd <- predIntNparSimultaneous(Hg.Back, k = 1, m = 4, r = r, lb = 0, n.plus.one.minus.upl.rank = 3) pred.int.1.of.4.3rd #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Hg.Back # #Sample Size: 20 # #Number NA/NaN/Inf's: 4 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 98.64909% # #Prediction Limit Rank(s): 18 # #Minimum Number of #Future Observations #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Observations #(per Sampling Occasion): 4 # #Number of Future #Sampling Occasions: 10 # #Prediction Interval: LPL = 0.00 # UPL = 0.24 # Note that the achieved confidence level of 98.6% is greater than the # required confidence level of 97.9%. # Now determine whether either compliance well indicates evidence of # Mercury contamination. # Compliance Well 1 #------------------ Hg.CW.1 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-1"]) Hg.CW.1 #[1] "0.22" "0.2" "<.2" "0.25" "0.24" "<.2" # The first observation is less than the UPL of 0.24, which is less than # the UPL of 0.28, so there is no evidence of contamination. # Compliance Well 2 #------------------ Hg.CW.2 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-2"]) Hg.CW.2 #[1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54" # All of the first 4 observations are greater than the UPL of 0.24, so there # is evidence of contamination. #========== # Cleanup #-------- rm(dat, n, nw, nc, ne, r, conf.level, alpha, Hg.Back, pred.int.1.of.2.med.3, pred.int.1.of.4.3rd, Hg.CW.1, Hg.CW.2)
# Generate 20 observations from a lognormal mixture distribution with # parameters mean1=1, cv1=0.5, mean2=5, cv2=1, and p.mix=0.1. Use # predIntNparSimultaneous to construct an upper one-sided prediction interval # using the maximum observed value using the 1-of-3 rule. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormMixAlt(n = 20, mean1 = 1, cv1 = 0.5, mean2 = 5, cv2 = 1, p.mix = 0.1) predIntNparSimultaneous(dat, k = 1, m = 3, lb = 0) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99.94353% # #Prediction Limit Rank(s): 20 # #Minimum Number of #Future Observations #Interval Should Contain: 1 # #Total Number of #Future Observations: 3 # #Prediction Interval: LPL = 0.000000 # UPL = 1.817311 #---------- # Compare the confidence levels for the 1-of-3 rule, California Rule, and # Modified California Rule. predIntNparSimultaneous(dat, k = 1, m = 3, lb = 0)$interval$conf.level #[1] 0.9994353 predIntNparSimultaneous(dat, m = 3, rule = "CA", lb = 0)$interval$conf.level #[1] 0.9919066 predIntNparSimultaneous(dat, rule = "Modified.CA", lb = 0)$interval$conf.level #[1] 0.9984943 #========= # Repeat the above example, but create the baseline data using just # n=8 observations and set r to 4 future sampling occasions set.seed(598) dat <- rlnormMixAlt(n = 8, mean1 = 1, cv1 = 0.5, mean2 = 5, cv2 = 1, p.mix = 0.1) predIntNparSimultaneous(dat, k = 1, m = 3, r = 4, lb = 0) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 8 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 97.7599% # #Prediction Limit Rank(s): 8 # #Minimum Number of #Future Observations #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Observations #(per Sampling Occasion): 3 # #Number of Future #Sampling Occasions: 4 # #Prediction Interval: LPL = 0.000000 # UPL = 5.683453 #---------- # Compare the confidence levels for the 1-of-3 rule, California Rule, and # Modified California Rule. predIntNparSimultaneous(dat, k = 1, m = 3, r = 4, lb = 0)$interval$conf.level #[1] 0.977599 predIntNparSimultaneous(dat, m = 3, r = 4, rule = "CA", lb = 0)$interval$conf.level #[1] 0.8737798 predIntNparSimultaneous(dat, r = 4, rule = "Modified.CA", lb = 0)$interval$conf.level #[1] 0.9510178 #========== # Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will compute the confidence level associated with two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest background value. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # n = 20 based on 4 Background Wells # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year n <- 20 nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 alpha <- 1 - conf.level alpha #[1] 0.02085164 #---------- # Look at the data: head(EPA.09.Ex.19.5.mercury.df) # Event Well Well.type Mercury.ppb.orig Mercury.ppb Censored #1 1 BG-1 Background 0.21 0.21 FALSE #2 2 BG-1 Background <.2 0.20 TRUE #3 3 BG-1 Background <.2 0.20 TRUE #4 4 BG-1 Background <.2 0.20 TRUE #5 5 BG-1 Background <.2 0.20 TRUE #6 6 BG-1 Background NA FALSE longToWide(EPA.09.Ex.19.5.mercury.df, "Mercury.ppb.orig", "Event", "Well", paste.row.name = TRUE) # BG-1 BG-2 BG-3 BG-4 CW-1 CW-2 #Event.1 0.21 <.2 <.2 <.2 0.22 0.36 #Event.2 <.2 <.2 0.23 0.25 0.2 0.41 #Event.3 <.2 <.2 <.2 0.28 <.2 0.28 #Event.4 <.2 0.21 0.23 <.2 0.25 0.45 #Event.5 <.2 <.2 0.24 <.2 0.24 0.43 #Event.6 <.2 0.54 # Construct the upper simultaneous prediction limit using the 1-of-2 # retesting plan for a median of order 3 based on the background maximum Hg.Back <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb[Well.type == "Background"]) pred.int.1.of.2.med.3 <- predIntNparSimultaneous(Hg.Back, n.median = 3, k = 1, m = 2, r = r, lb = 0) pred.int.1.of.2.med.3 #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Hg.Back # #Sample Size: 20 # #Number NA/NaN/Inf's: 4 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 99.40354% # #Prediction Limit Rank(s): 20 # #Minimum Number of #Future Medians #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Medians #(per Sampling Occasion): 2 # #Number of Future #Sampling Occasions: 10 # #Sample Size for Medians: 3 # #Prediction Interval: LPL = 0.00 # UPL = 0.28 # Note that the achieved confidence level of 99.4% is greater than the # required confidence level of 97.9%. # Now determine whether either compliance well indicates evidence of # Mercury contamination. # Compliance Well 1 #------------------ Hg.CW.1 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-1"]) Hg.CW.1 #[1] "0.22" "0.2" "<.2" "0.25" "0.24" "<.2" # The median of the first 3 observations is 0.2, which is less than # the UPL of 0.28, so there is no evidence of contamination. # Compliance Well 2 #------------------ Hg.CW.2 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-2"]) Hg.CW.2 #[1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54" # The median of the first 3 observations is 0.36, so 3 more observations have to # be looked at. The median of the second 3 observations is 0.45, which is # larger than the UPL of 0.28, so there is evidence of contamination. #---------- # Now create the upper simultaneous prediction limit using the 1-of-4 plan # on individual observations using the 3rd highest background value. pred.int.1.of.4.3rd <- predIntNparSimultaneous(Hg.Back, k = 1, m = 4, r = r, lb = 0, n.plus.one.minus.upl.rank = 3) pred.int.1.of.4.3rd #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Hg.Back # #Sample Size: 20 # #Number NA/NaN/Inf's: 4 # #Prediction Interval Method: exact # #Prediction Interval Type: upper # #Confidence Level: 98.64909% # #Prediction Limit Rank(s): 18 # #Minimum Number of #Future Observations #Interval Should Contain #(per Sampling Occasion): 1 # #Total Number of #Future Observations #(per Sampling Occasion): 4 # #Number of Future #Sampling Occasions: 10 # #Prediction Interval: LPL = 0.00 # UPL = 0.24 # Note that the achieved confidence level of 98.6% is greater than the # required confidence level of 97.9%. # Now determine whether either compliance well indicates evidence of # Mercury contamination. # Compliance Well 1 #------------------ Hg.CW.1 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-1"]) Hg.CW.1 #[1] "0.22" "0.2" "<.2" "0.25" "0.24" "<.2" # The first observation is less than the UPL of 0.24, which is less than # the UPL of 0.28, so there is no evidence of contamination. # Compliance Well 2 #------------------ Hg.CW.2 <- with(EPA.09.Ex.19.5.mercury.df, Mercury.ppb.orig[Well == "CW-2"]) Hg.CW.2 #[1] "0.36" "0.41" "0.28" "0.45" "0.43" "0.54" # All of the first 4 observations are greater than the UPL of 0.24, so there # is evidence of contamination. #========== # Cleanup #-------- rm(dat, n, nw, nc, ne, r, conf.level, alpha, Hg.Back, pred.int.1.of.2.med.3, pred.int.1.of.4.3rd, Hg.CW.1, Hg.CW.2)
Compute the confidence level associated with a nonparametric simultaneous prediction interval based on one of three possible rules: k-of-m, California, or Modified California. Observations are assumed to come from from a continuous distribution.
predIntNparSimultaneousConfLevel(n, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "upper", integrate.args.list = NULL)
predIntNparSimultaneousConfLevel(n, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "upper", integrate.args.list = NULL)
n |
vector of positive integers specifying the sample sizes.
Missing ( |
n.median |
vector of positive odd integers specifying the sample size associated with the
future medians. The default value is |
k |
for the |
m |
vector of positive integers specifying the maximum number of future observations (or
medians) on one future sampling “occasion”.
The default value is |
r |
vector of positive integers specifying the number of future sampling
“occasions”. The default value is |
rule |
character string specifying which rule to use. The possible values are
|
lpl.rank |
vector of positive integers indicating the rank of the order statistic to use for
the lower bound of the prediction interval. When |
n.plus.one.minus.upl.rank |
vector of positive integers related to the rank of the order statistic to use for
the upper
bound of the prediction interval. A value of |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
integrate.args.list |
list of arguments to supply to the |
If the arguments n
, k
, m
, r
, lpl.rank
, and
n.plus.one.minus.upl.rank
are not all the same length, they are replicated
to be the same length as the length of the longest argument.
The function predIntNparSimultaneousConfLevel
computes the confidence level
based on Equation (8), (9), or (10) in the help file for
predIntNparSimultaneous
, depending on the value of the argument
rule
.
Note that when rule="k.of.m"
and r=1
, this is equivalent to a
standard nonparametric prediction interval and you can use the function
predIntNparConfLevel
instead.
vector of values between 0 and 1 indicating the confidence level associated with the specified simultaneous nonparametric prediction interval.
See the help file for predIntNparSimultaneous
.
Steven P. Millard ([email protected])
See the help file for predIntNparSimultaneous
.
predIntNparSimultaneous
,
predIntNparSimultaneousN
,
plotPredIntNparSimultaneousDesign
,
predIntNparSimultaneousTestPower
,
predIntNpar
, tolIntNpar
.
# For the 1-of-3 rule with r=20 future sampling occasions, look at how the # confidence level of a simultaneous nonparametric prediction interval # increases with increasing sample size: seq(5, 25, by = 5) #[1] 5 10 15 20 25 conf <- predIntNparSimultaneousConfLevel(n = seq(5, 25, by = 5), k = 1, m = 3, r = 20) round(conf, 2) #[1] 0.82 0.95 0.98 0.99 0.99 #---------- # For the 1-of-m rule with r=20 future sampling occasions, look at how the # confidence level of a simultaneous nonparametric prediction interval # increases as the number of future observations increases: 1:5 #[1] 1 2 3 4 5 conf <- predIntNparSimultaneousConfLevel(n = 10, k = 1, m = 1:5, r = 20) round(conf, 2) #[1] 0.33 0.81 0.95 0.98 0.99 #---------- # For the 1-of-3 rule, look at how the confidence level of a simultaneous # nonparametric prediction interval decreases with number of future sampling # occasions (r): seq(5, 20, by = 5) #[1] 5 10 15 20 conf <- predIntNparSimultaneousConfLevel(n = 10, k = 1, m = 3, r = seq(5, 20, by = 5)) round(conf, 2) #[1] 0.98 0.97 0.96 0.95 #---------- # For the 1-of-3 rule with r=20 future sampling occasions, look at how the # confidence level of a simultaneous nonparametric prediction interval # decreases as the rank of the upper prediction limit decreases: conf <- predIntNparSimultaneousConfLevel(n = 10, k = 1, m = 3, r = 20, n.plus.one.minus.upl.rank = 1:5) round(conf, 2) #[1] 0.95 0.82 0.63 0.43 0.25 #---------- # Clean up #--------- rm(conf) #========== # Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will compute the confidence level associated with two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest background value. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # n = 20 based on 4 Background Wells # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year n <- 20 nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 # So the required confidence level is 0.98, or 98%. # Now determine the confidence level associated with each plan. # Note that both plans achieve the required confidence level. # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousConfLevel(n = 20, n.median = 3, k = 1, m = 2, r = r) #[1] 0.9940354 # 2) the 1-of-4 plan on individual observations using the 3rd highest # background value. predIntNparSimultaneousConfLevel(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3) #[1] 0.9864909 #========== # Cleanup #-------- rm(n, nw, nc, ne, r, conf.level)
# For the 1-of-3 rule with r=20 future sampling occasions, look at how the # confidence level of a simultaneous nonparametric prediction interval # increases with increasing sample size: seq(5, 25, by = 5) #[1] 5 10 15 20 25 conf <- predIntNparSimultaneousConfLevel(n = seq(5, 25, by = 5), k = 1, m = 3, r = 20) round(conf, 2) #[1] 0.82 0.95 0.98 0.99 0.99 #---------- # For the 1-of-m rule with r=20 future sampling occasions, look at how the # confidence level of a simultaneous nonparametric prediction interval # increases as the number of future observations increases: 1:5 #[1] 1 2 3 4 5 conf <- predIntNparSimultaneousConfLevel(n = 10, k = 1, m = 1:5, r = 20) round(conf, 2) #[1] 0.33 0.81 0.95 0.98 0.99 #---------- # For the 1-of-3 rule, look at how the confidence level of a simultaneous # nonparametric prediction interval decreases with number of future sampling # occasions (r): seq(5, 20, by = 5) #[1] 5 10 15 20 conf <- predIntNparSimultaneousConfLevel(n = 10, k = 1, m = 3, r = seq(5, 20, by = 5)) round(conf, 2) #[1] 0.98 0.97 0.96 0.95 #---------- # For the 1-of-3 rule with r=20 future sampling occasions, look at how the # confidence level of a simultaneous nonparametric prediction interval # decreases as the rank of the upper prediction limit decreases: conf <- predIntNparSimultaneousConfLevel(n = 10, k = 1, m = 3, r = 20, n.plus.one.minus.upl.rank = 1:5) round(conf, 2) #[1] 0.95 0.82 0.63 0.43 0.25 #---------- # Clean up #--------- rm(conf) #========== # Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will compute the confidence level associated with two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest background value. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # n = 20 based on 4 Background Wells # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year n <- 20 nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 # So the required confidence level is 0.98, or 98%. # Now determine the confidence level associated with each plan. # Note that both plans achieve the required confidence level. # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousConfLevel(n = 20, n.median = 3, k = 1, m = 2, r = r) #[1] 0.9940354 # 2) the 1-of-4 plan on individual observations using the 3rd highest # background value. predIntNparSimultaneousConfLevel(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3) #[1] 0.9864909 #========== # Cleanup #-------- rm(n, nw, nc, ne, r, conf.level)
Compute the sample size necessary for a nonparametric simultaneous prediction interval to achieve a specified confidence level based on one of three possible rules: k-of-m, California, or Modified California. Observations are assumed to come from from a continuous distribution.
predIntNparSimultaneousN(n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "upper", conf.level = 0.95, n.max = 5000, integrate.args.list = NULL, maxiter = 1000)
predIntNparSimultaneousN(n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), pi.type = "upper", conf.level = 0.95, n.max = 5000, integrate.args.list = NULL, maxiter = 1000)
n.median |
vector of positive odd integers specifying the sample size associated with the
future medians. The default value is |
k |
for the |
m |
vector of positive integers specifying the maximum number of future observations (or
medians) on one future sampling “occasion”.
The default value is |
r |
vector of positive integers specifying the number of future sampling
“occasions”. The default value is |
rule |
character string specifying which rule to use. The possible values are
|
lpl.rank |
vector of positive integers indicating the rank of the order statistic to use for
the lower bound of the prediction interval. When |
n.plus.one.minus.upl.rank |
vector of positive integers related to the rank of the order statistic to use for
the upper bound of the prediction interval. A value of |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level
associated with the prediction interval. The default value is |
n.max |
numeric scalar indicating the maximum sample size to consider. This argument
is used in the search algorithm to determine the required sample size. The
default value is |
integrate.args.list |
list of arguments to supply to the |
maxiter |
positive integer indicating the maximum number of iterations to use in the
|
If the arguments k
, m
, r
, lpl.rank
, and
n.plus.one.minus.upl.rank
are not all the same length, they are replicated
to be the same length as the length of the longest argument.
The function predIntNparSimultaneousN
computes the required sample size
by solving Equation (8), (9), or (10) in the help file for
predIntNparSimultaneous
for , depending on the value of the
argument
rule
.
Note that when rule="k.of.m"
and r=1
, this is equivalent to a
standard nonparametric prediction interval and you can use the function
predIntNparN
instead.
vector of positive integers indicating the required sample size(s) for the specified nonparametric simultaneous prediction interval(s).
See the help file for predIntNparSimultaneous
.
Steven P. Millard ([email protected])
See the help file for predIntNparSimultaneous
.
predIntNparSimultaneous
,
predIntNparSimultaneousConfLevel
, plotPredIntNparSimultaneousDesign
,
predIntNparSimultaneousTestPower
,
predIntNpar
, tolIntNpar
.
# For the 1-of-2 rule, look at how the required sample size for a one-sided # upper simultaneous nonparametric prediction interval for r=20 future # sampling occasions increases with increasing confidence level: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 predIntNparSimultaneousN(r = 20, conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 4 5 7 10 17 #---------- # For the 1-of-m rule, look at how the required sample size for a one-sided # upper simultaneous nonparametric prediction interval decreases with increasing # number of future observations (m), given r=20 future sampling occasions: predIntNparSimultaneousN(k = 1, m = 1:5, r = 20) #[1] 380 26 11 7 5 #---------- # For the 1-of-3 rule, look at how the required sample size for a one-sided # upper simultaneous nonparametric prediction interval increases with number # of future sampling occasions (r): predIntNparSimultaneousN(k = 1, m = 3, r = c(5, 10, 15, 20)) #[1] 7 8 10 11 #---------- # For the 1-of-3 rule, look at how the required sample size for a one-sided # upper simultaneous nonparametric prediction interval increases as the rank # of the upper prediction limit decreases, given r=20 future sampling occasions: predIntNparSimultaneousN(k = 1, m = 3, r = 20, n.plus.one.minus.upl.rank = 1:5) #[1] 11 19 26 34 41 #---------- # Compare the required sample size for r=20 future sampling occasions based # on the 1-of-3 rule, the CA rule with m=3, and the Modified CA rule. predIntNparSimultaneousN(k = 1, m = 3, r = 20, rule = "k.of.m") #[1] 11 predIntNparSimultaneousN(m = 3, r = 20, rule = "CA") #[1] 36 predIntNparSimultaneousN(r = 20, rule = "Modified.CA") #[1] 15 #========== # Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will modify the example to compute the required number of background # observations for two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest background value. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 # So the required confidence level is 0.98, or 98%. # Now determine the required number of background observations for each plan. # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousN(n.median = 3, k = 1, m = 2, r = r, conf.level = conf.level) #[1] 14 # 2) the 1-of-4 plan on individual observations using the 3rd highest # background value. predIntNparSimultaneousN(k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3, conf.level = conf.level) #[1] 18 #========== # Cleanup #-------- rm(nw, nc, ne, r, conf.level)
# For the 1-of-2 rule, look at how the required sample size for a one-sided # upper simultaneous nonparametric prediction interval for r=20 future # sampling occasions increases with increasing confidence level: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 predIntNparSimultaneousN(r = 20, conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 4 5 7 10 17 #---------- # For the 1-of-m rule, look at how the required sample size for a one-sided # upper simultaneous nonparametric prediction interval decreases with increasing # number of future observations (m), given r=20 future sampling occasions: predIntNparSimultaneousN(k = 1, m = 1:5, r = 20) #[1] 380 26 11 7 5 #---------- # For the 1-of-3 rule, look at how the required sample size for a one-sided # upper simultaneous nonparametric prediction interval increases with number # of future sampling occasions (r): predIntNparSimultaneousN(k = 1, m = 3, r = c(5, 10, 15, 20)) #[1] 7 8 10 11 #---------- # For the 1-of-3 rule, look at how the required sample size for a one-sided # upper simultaneous nonparametric prediction interval increases as the rank # of the upper prediction limit decreases, given r=20 future sampling occasions: predIntNparSimultaneousN(k = 1, m = 3, r = 20, n.plus.one.minus.upl.rank = 1:5) #[1] 11 19 26 34 41 #---------- # Compare the required sample size for r=20 future sampling occasions based # on the 1-of-3 rule, the CA rule with m=3, and the Modified CA rule. predIntNparSimultaneousN(k = 1, m = 3, r = 20, rule = "k.of.m") #[1] 11 predIntNparSimultaneousN(m = 3, r = 20, rule = "CA") #[1] 36 predIntNparSimultaneousN(r = 20, rule = "Modified.CA") #[1] 15 #========== # Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will modify the example to compute the required number of background # observations for two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest background value. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 # So the required confidence level is 0.98, or 98%. # Now determine the required number of background observations for each plan. # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousN(n.median = 3, k = 1, m = 2, r = r, conf.level = conf.level) #[1] 14 # 2) the 1-of-4 plan on individual observations using the 3rd highest # background value. predIntNparSimultaneousN(k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3, conf.level = conf.level) #[1] 18 #========== # Cleanup #-------- rm(nw, nc, ne, r, conf.level)
Compute the probability that at least one set of future observations violates the
given rule based on a nonparametric simultaneous prediction interval for the next
future sampling occasions. The three possible rules are:
-of-
, California, or Modified California. The probability is based
on assuming the true distribution of the observations is normal.
predIntNparSimultaneousTestPower(n, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), delta.over.sigma = 0, pi.type = "upper", r.shifted = r, method = "approx", NMC = 100, ci = FALSE, ci.conf.level = 0.95, integrate.args.list = NULL, evNormOrdStats.method = "royston")
predIntNparSimultaneousTestPower(n, n.median = 1, k = 1, m = 2, r = 1, rule = "k.of.m", lpl.rank = ifelse(pi.type == "upper", 0, 1), n.plus.one.minus.upl.rank = ifelse(pi.type == "lower", 0, 1), delta.over.sigma = 0, pi.type = "upper", r.shifted = r, method = "approx", NMC = 100, ci = FALSE, ci.conf.level = 0.95, integrate.args.list = NULL, evNormOrdStats.method = "royston")
n |
vector of positive integers specifying the sample sizes.
Missing ( |
n.median |
vector of positive odd integers specifying the sample size associated with the
future medians. The default value is |
k |
for the |
m |
vector of positive integers specifying the maximum number of future observations (or
medians) on one future sampling “occasion”.
The default value is |
r |
vector of positive integers specifying the number of future sampling
“occasions”. The default value is |
rule |
character string specifying which rule to use. The possible values are
|
lpl.rank |
vector of non-negative integers indicating the rank of the order statistic to use for
the lower bound of the prediction interval. When |
n.plus.one.minus.upl.rank |
vector of non-negative integers related to the rank of the order statistic to use for
the upper bound of the prediction interval. A value of |
delta.over.sigma |
numeric vector indicating the ratio |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
r.shifted |
vector of positive integers specifying the number of future sampling occasions for
which the scaled mean is shifted by |
method |
character string indicating what method to use to compute the power. The possible
values are |
NMC |
positive integer indicating the number of Monte Carlo trials to run when |
ci |
logical scalar indicating whether to compute a confidence interval for the power
when |
ci.conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated with the
confidence interval for the power. The argument is ignored if |
integrate.args.list |
list of arguments to supply to the |
evNormOrdStats.method |
character string indicating which method to use in the call to
|
What is a Nonparametric Simultaneous Prediction Interval?
A nonparametric prediction interval for some population is an interval on the real line
constructed so that it will contain at least of
future observations from
that population with some specified probability
, where
and
and
are some pre-specified positive integers
and
. The quantity
is called
the confidence coefficient or confidence level associated with the prediction
interval. The function
predIntNpar
computes a standard
nonparametric prediction interval.
The function predIntNparSimultaneous
computes a nonparametric simultaneous
prediction interval that will contain a certain number of future observations
with probability for each of
future sampling
“occasions”,
where
is some pre-specified positive integer. The quantity
may
refer to
distinct future sampling occasions in time, or it may for example
refer to sampling at
distinct locations on one future sampling occasion,
assuming that the population standard deviation is the same at all of the
distinct locations.
The function predIntNparSimultaneous
computes a nonparametric simultaneous
prediction interval based on one of three possible rules:
For the -of-
rule (
rule="k.of.m"
), at least of
the next
future observations will fall in the prediction
interval with probability
on each of the
future
sampling occasions. If obserations are being taken sequentially, for a particular
sampling occasion, up to
observations may be taken, but once
of the observations fall within the prediction interval, sampling can stop.
Note: For this rule, when
, the results of
predIntNparSimultaneous
are equivalent to the results of predIntNpar
.
For the California rule (rule="CA"
), with probability
, for each of the
future sampling occasions, either
the first observation will fall in the prediction interval, or else all of the next
observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise,
more observations must be taken.
For the Modified California rule (rule="Modified.CA"
), with probability
, for each of the
future sampling occasions, either the
first observation will fall in the prediction interval, or else at least 2 out of
the next 3 observations will fall in the prediction interval. That is, if the first
observation falls in the prediction interval then sampling can stop. Otherwise, up
to 3 more observations must be taken.
Nonparametric simultaneous prediction intervals can be extended to using medians
in place of single observations (USEPA, 2009, Chapter 19). That is, you can
create a nonparametric simultaneous prediction interval that will contain a
specified number of medians (based on which rule you choose) on each of
future sampling occassions, where each each median is based on
individual
observations. For the function
predIntNparSimultaneous
, the argument
n.median
corresponds to .
The Form of a Nonparametric Prediction Interval
Let denote a vector of
independent observations from some continuous distribution, and let
denote the the
'th order statistics in
.
A two-sided nonparametric prediction interval is constructed as:
where and
are positive integers between 1 and
, and
. That is,
denotes the rank of the lower prediction limit, and
denotes the rank of the upper prediction limit. To make it easier to write
some equations later on, we can also write the prediction interval (1) in a slightly
different way as:
where
so that is a positive integer between 1 and
, and
. In terms of the arguments to the function
predIntNparSimultaneous
,
the argument lpl.rank
corresponds to , and the argument
n.plus.one.minus.upl.rank
corresponds to .
If we allow and
and define lower and upper bounds as:
then Equation (2) above can also represent a one-sided lower or one-sided upper prediction interval as well. That is, a one-sided lower nonparametric prediction interval is constructed as:
and a one-sided upper nonparametric prediction interval is constructed as:
Usually, or
and
.
Note: For nonparametric simultaneous prediction intervals, only lower
(pi.type="lower"
) and upper (pi.type="upper"
) prediction
intervals are available.
Computing Power
The "power" of the prediction interval is defined as the probability that
at least one set of future observations violates the given rule based on a
simultaneous prediction interval for the next future sampling occasions,
where the population for the future observations is allowed to differ from
the population for the observations used to construct the prediction interval.
For the function predIntNparSimultaneousTestPower
, power is computed assuming
both the background and future the observations come from normal distributions with
the same standard deviation, but the means of the distributions are allowed to differ.
The quantity (upper case delta) denotes the difference between the
mean of the population that was sampled to construct the prediction interval, and
the mean of the population that will be sampled to produce the future observations.
The quantity
(sigma) denotes the population standard deviation of both
of these populations. The argument
delta.over.sigma
corresponds to the
quantity .
Approximate Power (method="approx"
)
Based on Gansecki (2009), the power of a nonparametric simultaneous prediction
interval when the underlying observations come from a nomral distribution
can be approximated by the power of a normal simultaneous prediction
interval (see predIntNormSimultaneousTestPower
) where the multiplier
is replaced with the expected value of the normal order statistic that
corresponds to the rank of the order statistic used for the upper or lower bound
of the prediction interval. Gansecki (2009) uses the approximation:
where denotes the cumulative distribution function of the standard
normal distribution and
denotes the rank of the order statistic used
as the prediction limit. By default, the value of the argument
evNormOrdStats.method="royston"
, so the function
predIntNparSimultaneousTestPower
uses the exact value of the
expected value of the normal order statistic in the call to
evNormOrdStatsScalar
. You can change the
method of computing the expected value of the normal order statistic by
changing the value of the argument evNormOrdStats.method
.
Power Based on Monte Carlo Simulation (method="simulate"
)
When method="simulate"
, the power of the nonparametric simultaneous
prediction interval is estimated based on a Monte Carlo simulation. The argument
NMC
determines the number of Monte Carlo trials. If ci=TRUE
, a
confidence interval for the power is created based on the NMC
Monte Carlo
estimates of power.
vector of values between 0 and 1 equal to the probability that the rule will be violated.
See the help file for predIntNparSimultaneous
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether two distributions differ from each other. The functions
predIntNparSimultaneousTestPower
and plotPredIntNparSimultaneousTestPowerCurve
can be
used to investigate these relationships for the case of normally-distributed
observations.
Steven P. Millard ([email protected])
See the help file for predIntNparSimultaneous
.
Gansecki, M. (2009). Using the Optimal Rank Values Calculator. US Environmental Protection Agency, Region 8, March 10, 2009.
plotPredIntNparSimultaneousTestPowerCurve
,
predIntNparSimultaneous
, predIntNparSimultaneousN
,
predIntNparSimultaneousConfLevel
, plotPredIntNparSimultaneousDesign
,
predIntNpar
, tolIntNpar
.
# Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will compute the confidence levels and powers associated with # two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest # background value. # Power will be computed assuming a normal distribution and setting # delta.over.sigma equal to 2, 3, and 4. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # n = 20 based on 4 Background Wells # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year n <- 20 nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 # So the required confidence level is 0.98, or 98%. # Now determine the confidence level associated with each plan. # Note that both plans achieve the required confidence level. # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousConfLevel(n = 20, n.median = 3, k = 1, m = 2, r = r) #[1] 0.9940354 # 2) the 1-of-4 plan based on individual observations using the 3rd highest # background value. predIntNparSimultaneousConfLevel(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3) #[1] 0.9864909 #------------------------------------------------------------------------------ # Compute approximate power of each plan to detect contamination at just 1 well # assuming true underying distribution of Hg is Normal at all wells and # using delta.over.sigma equal to 2, 3, and 4. #------------------------------------------------------------------------------ # Computer aproximate power for # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousTestPower(n = 20, n.median = 3, k = 1, m = 2, r = r, delta.over.sigma = 2:4, r.shifted = 1) #[1] 0.3953712 0.9129671 0.9983054 # Compute approximate power for # 2) the 1-of-4 plan based on individual observations using the 3rd highest # background value. predIntNparSimultaneousTestPower(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3, delta.over.sigma = 2:4, r.shifted = 1) #[1] 0.4367972 0.8694664 0.9888779 #---------- ## Not run: # Compare estimated power using approximation method with estimated power # using Monte Carlo simulation for the 1-of-4 plan based on individual # observations using the 3rd highest background value. predIntNparSimultaneousTestPower(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3, delta.over.sigma = 2:4, r.shifted = 1, method = "simulate", ci = TRUE, NMC = 1000) #[1] 0.437 0.863 0.989 #attr(,"conf.int") # [,1] [,2] [,3] #LCL 0.4111999 0.8451148 0.9835747 #UCL 0.4628001 0.8808852 0.9944253 ## End(Not run) #========== # Cleanup #-------- rm(n, nw, nc, ne, r, conf.level)
# Example 19-5 of USEPA (2009, p. 19-33) shows how to compute nonparametric upper # simultaneous prediction limits for various rules based on trace mercury data (ppb) # collected in the past year from a site with four background wells and 10 compliance # wells (data for two of the compliance wells are shown in the guidance document). # The facility must monitor the 10 compliance wells for five constituents # (including mercury) annually. # Here we will compute the confidence levels and powers associated with # two different sampling plans: # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum and # 2) the 1-of-4 plan on individual observations using the 3rd highest # background value. # Power will be computed assuming a normal distribution and setting # delta.over.sigma equal to 2, 3, and 4. # The data for this example are stored in EPA.09.Ex.19.5.mercury.df. # We will pool data from 4 background wells that were sampled on # a number of different occasions, giving us a sample size of # n = 20 to use to construct the prediction limit. # There are 10 compliance wells and we will monitor 5 different # constituents at each well annually. For this example, USEPA (2009) # recommends setting r to the product of the number of compliance wells and # the number of evaluations per year. # To determine the minimum confidence level we require for # the simultaneous prediction interval, USEPA (2009) recommends # setting the maximum allowed individual Type I Error level per constituent to: # 1 - (1 - SWFPR)^(1 / Number of Constituents) # which translates to setting the confidence limit to # (1 - SWFPR)^(1 / Number of Constituents) # where SWFPR = site-wide false positive rate. For this example, we # will set SWFPR = 0.1. Thus, the required individual Type I Error level # and confidence level per constituent are given as follows: # n = 20 based on 4 Background Wells # nw = 10 Compliance Wells # nc = 5 Constituents # ne = 1 Evaluation per year n <- 20 nw <- 10 nc <- 5 ne <- 1 # Set number of future sampling occasions r to # Number Compliance Wells x Number Evaluations per Year r <- nw * ne conf.level <- (1 - 0.1)^(1 / nc) conf.level #[1] 0.9791484 # So the required confidence level is 0.98, or 98%. # Now determine the confidence level associated with each plan. # Note that both plans achieve the required confidence level. # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousConfLevel(n = 20, n.median = 3, k = 1, m = 2, r = r) #[1] 0.9940354 # 2) the 1-of-4 plan based on individual observations using the 3rd highest # background value. predIntNparSimultaneousConfLevel(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3) #[1] 0.9864909 #------------------------------------------------------------------------------ # Compute approximate power of each plan to detect contamination at just 1 well # assuming true underying distribution of Hg is Normal at all wells and # using delta.over.sigma equal to 2, 3, and 4. #------------------------------------------------------------------------------ # Computer aproximate power for # 1) the 1-of-2 retesting plan for a median of order 3 using the # background maximum predIntNparSimultaneousTestPower(n = 20, n.median = 3, k = 1, m = 2, r = r, delta.over.sigma = 2:4, r.shifted = 1) #[1] 0.3953712 0.9129671 0.9983054 # Compute approximate power for # 2) the 1-of-4 plan based on individual observations using the 3rd highest # background value. predIntNparSimultaneousTestPower(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3, delta.over.sigma = 2:4, r.shifted = 1) #[1] 0.4367972 0.8694664 0.9888779 #---------- ## Not run: # Compare estimated power using approximation method with estimated power # using Monte Carlo simulation for the 1-of-4 plan based on individual # observations using the 3rd highest background value. predIntNparSimultaneousTestPower(n = 20, k = 1, m = 4, r = r, n.plus.one.minus.upl.rank = 3, delta.over.sigma = 2:4, r.shifted = 1, method = "simulate", ci = TRUE, NMC = 1000) #[1] 0.437 0.863 0.989 #attr(,"conf.int") # [,1] [,2] [,3] #LCL 0.4111999 0.8451148 0.9835747 #UCL 0.4628001 0.8808852 0.9944253 ## End(Not run) #========== # Cleanup #-------- rm(n, nw, nc, ne, r, conf.level)
Estimate the mean of a Poisson distribution
, and
construct a prediction interval for the next observations or
next set of
sums.
predIntPois(x, k = 1, n.sum = 1, method = "conditional", pi.type = "two-sided", conf.level = 0.95, round.limits = TRUE)
predIntPois(x, k = 1, n.sum = 1, method = "conditional", pi.type = "two-sided", conf.level = 0.95, round.limits = TRUE)
x |
numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a Poisson distribution
(i.e., |
k |
positive integer specifying the number of future observations or sums the
prediction interval should contain with confidence level |
n.sum |
positive integer specifying the sample size associated with the |
method |
character string specifying the method to use. The possible values are: See the DETAILS section for more information on these methods. The |
pi.type |
character string indicating what kind of prediction interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level of the prediction interval.
The default value is |
round.limits |
logical scalar indicating whether to round the computed prediction limits to the
nearest integer. The default value is |
A prediction interval for some population is an interval on the real line constructed so
that it will contain future observations or averages from that population with
some specified probability
, where 0 <
< 1 and
is some pre-specified positive integer. The quantity
is call
the confidence coefficient or confidence level associated with the prediction interval.
In the case of a Poisson distribution, we have modified the
usual meaning of a prediction interval and instead construct an interval that will
contain future observations or
future sums with a certain
confidence level.
A prediction interval is a random interval; that is, the lower and/or
upper bounds are random variables computed based on sample statistics in the
baseline sample. Prior to taking one specific baseline sample, the probability
that the prediction interval will contain the next averages is
. Once a specific baseline sample is taken and the
prediction interval based on that sample is computed, the probability that that
prediction interval will contain the next
averages is not necessarily
, but it should be close.
If an experiment is repeated times, and for each experiment:
A sample is taken and a prediction interval for
future observation is computed, and
One future observation is generated and compared to the prediction interval,
then the number of prediction intervals that actually contain the future observation
generated in step 2 above is a binomial random variable with parameters
size=
and
prob=
(see Binomial).
If, on the other hand, only one baseline sample is taken and only one prediction
interval for future observation is computed, then the number of
future observations out of a total of
future observations that will be
contained in that one prediction interval is a binomial random variable with
parameters
size=
and
prob=
, where
depends on the true population parameters and the computed
bounds of the prediction interval.
Because of the discrete nature of the Poisson distribution,
even if the true mean of the distribution were known exactly, the
actual confidence level associated with a prediction limit will usually not be exactly equal to
. For example, for the Poisson distribution with parameter
lambda=2
, the interval [0, 4] contains 94.7% of this distribution and
the interval [0,5] contains 98.3% of this distribution. Thus, no interval can
contain exactly 95% of this distribution, so it is impossible to construct an
exact 95% prediction interval for the next observation for a
Poisson distribution with parameter
lambda=2
.
The Form of a Poisson Prediction Interval
Let denote a vector of
observations from a Poisson distribution with parameter
lambda=
. Also, let
denote the sum of these
random variables, i.e.,
Finally, let denote the sample size associated with the
future
sums (i.e.,
n.sum=
). When
, each sum is really just a
single observation, so in the rest of this help file the term “sums”
replaces the phrase “observations or sums”.
Let denote a vector of
future observations from a Poisson distribution with parameter
lambda=
, and set
equal to the sum of these
random variables, i.e.,
Then has a Poisson distribution with parameter
lambda=
(Johnson et al., 1992, p.160). We are interested
in constructing a prediction limit for the next value of
, or else the next
sums of
Poisson random variables, based on the observed value of
and assuming
.
For a Poisson distribution, the form of a two-sided prediction interval is:
where
and is a constant that depends on the sample size
, the
confidence level
, the number of future sums
,
and the sample size associated with the future sums
. Do not confuse
the constant
(uppercase
) with the number of future sums
(lowercase
). The symbol
is used here to be consistent
with the notation used for prediction intervals for the normal distribution
(see
predIntNorm
).
Similarly, the form of a one-sided lower prediction interval is:
and the form of a one-sided upper prediction interval is:
The derivation of the constant is explained below.
Conditional Distribution (method="conditional"
)
Nelson (1970) derives a prediction interval for the case based on the
conditional distribution of
given
. He notes that the conditional
distribution of
given the quantity
is
binomial with parameters
size=
and
prob=
(Johnson et al., 1992, p.161). When
, the prediction limits are computed
as those most extreme values of
that still yield a non-significant test of
the hypothesis
, which for the conditional
distribution of
is equivalent to the hypothesis
:
prob=[m /(m + n)]
.
Using the relationship between the binomial and
F-distribution (see the explanation of exact confidence
intervals in the help file for ebinom
), Nelson (1982, p. 203) states
that exact two-sided prediction limits [LPL, UPL] are the
closest integer solutions to the following equations:
where denotes the
'th quantile of the
F-distribution with
and
degrees of
freedom.
If ci.type="lower"
, is replaced with
in
Equation (8) above for
, and
is set to
.
If ci.type="upper"
, is replaced with
in
Equation (9) above for
, and
is set to 0.
NOTE: This method is not extended to the case .
Conditional Distribution Approximation Based on Normal Distribution
(method="conditional.approx.normal"
)
Cox and Hinkley (1974, p.245) derive an approximate prediction interval for the case
. Like Nelson (1970), they note that the conditional distribution of
given the quantity
is binomial with
parameters
size=
and
prob=
, and that the
hypothesis
is equivalent to the hypothesis
:
prob=[m /(m + n)]
.
Cox and Hinkley (1974, p.245) suggest using the normal approximation to the
binomial distribution (in this case, without the continuity correction;
see Zar, 2010, pp.534-536 for information on the continuity correction associated
with the normal approximation to the binomial distribution). Under the null
hypothesis , the quantity
is approximately distributed as a standard normal random variable.
The Case When k = 1
When and
pi.type="two-sided"
, the prediction limits are computed
by solving the equation
where denotes the
'th quantile of the standard normal distribution.
In this case, Gibbons (1987b) notes that the quantity
in Equation (3) above
is given by:
where .
When pi.type="lower"
or pi.type="upper"
, is computed exactly
as above, except
is set to
.
The Case When k > 1
When , Gibbons (1987b) suggests using the Bonferroni inequality.
That is, the value of
is computed exactly as for the case
described above, except that the Bonferroni value of
is used in place of the
usual value of
:
When pi.type="two-side"
, .
When pi.type="lower"
or pi.type="upper"
, .
Conditional Distribution Approximation Based on Student's t-Distribution
(method="conditional.approx.t"
)
When method="conditional.approx.t"
, the exact same procedure is used as when method="conditional.approx.normal"
, except that the quantity in
Equation (10) is assumed to follow a Student's t-distribution with
degrees of freedom. Thus, all occurrences of
are replaced with
, where
denotes the
'th quantile of
Student's t-distribution with
degrees of freedom.
Normal Approximation (method="normal.approx"
)
The normal approximation for Poisson prediction limits was given by
Nelson (1970; 1982, p.203) and is based on the fact that the mean and variance of a
Poisson distribution are the same (Johnson et al, 1992, p.157), and for
“large” values of and
, both
and
are
approximately normally distributed.
The Case When k = 1
The quantity is approximately normally distributed with expectation and
variance given by:
so the quantity
is approximately distributed as a standard normal random variable. The function
predIntPois
, however, assumes this quantity is distributed as approximately
a Student's t-distribution with degrees of freedom.
Thus, following the idea of prediction intervals for a normal distribution
(see predIntNorm
), when pi.type="two-sided"
, the constant
for a
prediction interval for the next
sum
of
observations is computed as:
where denotes the
'th quantile of a
Student's t-distribution with
degrees of freedom.
Similarly, when pi.type="lower"
or pi.type="upper"
, the constant
is computed as:
The Case When k > 1
When , the value of
is computed exactly as for the case
described above, except that the Bonferroni value of
is used
in place of the usual value of
:
When pi.type="two-sided"
,
When pi.type="lower"
or pi.type="upper"
,
Hahn and Nelson (1973, p.182) discuss another method of computing when
, but this method is not implemented here.
If x
is a numeric vector, predIntPois
returns a list of class
"estimate"
containing the estimated parameter, the prediction interval,
and other information. See the help file for estimate.object
for details.
If x
is the result of calling an estimation function,
predIntPois
returns a list whose class is the same as x
.
The list contains the same components as x
, as well as a component called
interval
containing the prediction interval information.
If x
already has a component called interval
, this component is
replaced with the prediction interval information.
Prediction and tolerance intervals have long been applied to quality control and life testing problems. Nelson (1970) notes that his development of confidence and prediction limits for the Poisson distribution is based on well-known results dating back to the 1950's. Hahn and Nelson (1973) review predicion intervals for several distributions, including Poisson prediction intervals. The mongraph by Hahn and Meeker (1991) includes a discussion of Poisson prediction intervals.
Gibbons (1987b) uses the Poisson distribution to model the number of detected
compounds per scan of the 32 volatile organic priority pollutants (VOC), and also
to model the distribution of chemical concentration (in ppb), and presents formulas
for prediction and tolerance intervals. The formulas for prediction intervals are
based on Cox and Hinkley (1974, p.245). Gibbons (1987b) only deals with
the case where n.sum=1
.
Gibbons et al. (2009, pp. 72–76) discuss methods for Poisson prediction limits.
Steven P. Millard ([email protected])
Cox, D.R., and D.V. Hinkley. (1974). Theoretical Statistics. Chapman and Hall, New York, pp.242–245.
Gibbons, R.D. (1987b). Statistical Models for the Analysis of Volatile Organic Compounds in Waste Disposal Sites. Ground Water 25, 572–580.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken, pp. 72–76.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178–188.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 4.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Miller, R.G. (1981a). Simultaneous Statistical Inference. McGraw-Hill, New York, pp.8, 76–81.
Nelson, W.R. (1970). Confidence Intervals for the Ratio of Two Poisson Means and Poisson Predictor Intervals. IEEE Transactions of Reliability R-19, 42–49.
Nelson, W.R. (1982). Applied Life Data Analysis. John Wiley and Sons, New York, pp.200–204.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, pp. 585–586.
Poisson
, epois
,
estimate.object
, Prediction Intervals,
tolIntPois
, Estimating Distribution Parameters.
# Generate 20 observations from a Poisson distribution with parameter # lambda=2. The interval [0, 4] contains 94.7% of this distribution and # the interval [0,5] contains 98.3% of this distribution. Thus, because # of the discrete nature of the Poisson distribution, no interval contains # exactly 95% of this distribution. Use predIntPois to estimate the mean # parameter of the true distribution, and construct a one-sided upper # 95% prediction interval for the next single observation from this distribution. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpois(20, lambda = 2) predIntPois(dat, pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: conditional # #Prediction Interval Type: upper # #Confidence Level: 95% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0 # UPL = 5 #---------- # Compare results above with the other approximation methods: predIntPois(dat, method = "conditional.approx.normal", pi.type = "upper")$interval$limits #LPL UPL # 0 4 predIntPois(dat, method = "conditional.approx.t", pi.type = "upper")$interval$limits #LPL UPL # 0 4 predIntPois(dat, method = "normal.approx", pi.type = "upper")$interval$limits #LPL UPL # 0 4 #Warning message: #In predIntPois(dat, method = "normal.approx", pi.type = "upper") : # Estimated value of 'lambda' and/or number of future observations # is/are probably too small for the normal approximation to work well. #========== # Using the same data as in the previous example, compute a one-sided # upper 95% prediction limit for k=10 future observations. # Using conditional approximation method based on the normal distribution. predIntPois(dat, k = 10, method = "conditional.approx.normal", pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: conditional.approx.normal # #Prediction Interval Type: upper # #Confidence Level: 95% # #Number of Future Observations: 10 # #Prediction Interval: LPL = 0 # UPL = 6 # Using method based on approximating conditional distribution with # Student's t-distribution predIntPois(dat, k = 10, method = "conditional.approx.t", pi.type = "upper")$interval$limits #LPL UPL # 0 6 #========== # Repeat the above example, but set k=5 and n.sum=3. Thus, we want a # 95% upper prediction limit for the next 5 sets of sums of 3 observations. predIntPois(dat, k = 5, n.sum = 3, method = "conditional.approx.t", pi.type = "upper")$interval$limits #LPL UPL # 0 12 #========== # Reproduce Example 3.6 in Gibbons et al. (2009, p. 75) # A 32-constituent VOC scan was performed for n=16 upgradient # samples and there were 5 detections out of these 16. We # want to construct a one-sided upper 95% prediction limit # for 20 monitoring wells (so k=20 future observations) based # on these data. # First we need to create a data set that will yield a mean # of 5/16 based on a sample size of 16. Any number of data # sets will do. Here are two possible ones: dat <- c(rep(1, 5), rep(0, 11)) dat <- c(2, rep(1, 3), rep(0, 12)) # Now call predIntPois. Don't round the limits so we can # compare to the example in Gibbons et al. (2009). predIntPois(dat, k = 20, method = "conditional.approx.t", pi.type = "upper", round.limits = FALSE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 0.3125 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 16 # #Prediction Interval Method: conditional.approx.t # #Prediction Interval Type: upper # #Confidence Level: 95% # #Number of Future Observations: 20 # #Prediction Interval: LPL = 0.000000 # UPL = 2.573258 #========== # Cleanup #-------- rm(dat)
# Generate 20 observations from a Poisson distribution with parameter # lambda=2. The interval [0, 4] contains 94.7% of this distribution and # the interval [0,5] contains 98.3% of this distribution. Thus, because # of the discrete nature of the Poisson distribution, no interval contains # exactly 95% of this distribution. Use predIntPois to estimate the mean # parameter of the true distribution, and construct a one-sided upper # 95% prediction interval for the next single observation from this distribution. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpois(20, lambda = 2) predIntPois(dat, pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: conditional # #Prediction Interval Type: upper # #Confidence Level: 95% # #Number of Future Observations: 1 # #Prediction Interval: LPL = 0 # UPL = 5 #---------- # Compare results above with the other approximation methods: predIntPois(dat, method = "conditional.approx.normal", pi.type = "upper")$interval$limits #LPL UPL # 0 4 predIntPois(dat, method = "conditional.approx.t", pi.type = "upper")$interval$limits #LPL UPL # 0 4 predIntPois(dat, method = "normal.approx", pi.type = "upper")$interval$limits #LPL UPL # 0 4 #Warning message: #In predIntPois(dat, method = "normal.approx", pi.type = "upper") : # Estimated value of 'lambda' and/or number of future observations # is/are probably too small for the normal approximation to work well. #========== # Using the same data as in the previous example, compute a one-sided # upper 95% prediction limit for k=10 future observations. # Using conditional approximation method based on the normal distribution. predIntPois(dat, k = 10, method = "conditional.approx.normal", pi.type = "upper") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 20 # #Prediction Interval Method: conditional.approx.normal # #Prediction Interval Type: upper # #Confidence Level: 95% # #Number of Future Observations: 10 # #Prediction Interval: LPL = 0 # UPL = 6 # Using method based on approximating conditional distribution with # Student's t-distribution predIntPois(dat, k = 10, method = "conditional.approx.t", pi.type = "upper")$interval$limits #LPL UPL # 0 6 #========== # Repeat the above example, but set k=5 and n.sum=3. Thus, we want a # 95% upper prediction limit for the next 5 sets of sums of 3 observations. predIntPois(dat, k = 5, n.sum = 3, method = "conditional.approx.t", pi.type = "upper")$interval$limits #LPL UPL # 0 12 #========== # Reproduce Example 3.6 in Gibbons et al. (2009, p. 75) # A 32-constituent VOC scan was performed for n=16 upgradient # samples and there were 5 detections out of these 16. We # want to construct a one-sided upper 95% prediction limit # for 20 monitoring wells (so k=20 future observations) based # on these data. # First we need to create a data set that will yield a mean # of 5/16 based on a sample size of 16. Any number of data # sets will do. Here are two possible ones: dat <- c(rep(1, 5), rep(0, 11)) dat <- c(2, rep(1, 3), rep(0, 12)) # Now call predIntPois. Don't round the limits so we can # compare to the example in Gibbons et al. (2009). predIntPois(dat, k = 20, method = "conditional.approx.t", pi.type = "upper", round.limits = FALSE) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 0.3125 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 16 # #Prediction Interval Method: conditional.approx.t # #Prediction Interval Type: upper # #Confidence Level: 95% # #Number of Future Observations: 20 # #Prediction Interval: LPL = 0.000000 # UPL = 2.573258 #========== # Cleanup #-------- rm(dat)
Formats and prints the results of calling the function boxcox
.
This method is automatically called by print
when given an
object of class "boxcox"
. The names of other functions involved in
Box-Cox transformations are listed under Data Transformations.
## S3 method for class 'boxcox' print(x, ...)
## S3 method for class 'boxcox' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "boxcox"
method for the generic function print
.
Prints the objective name, the name of the data object used, the sample size,
the values of the powers, and the values of the objective. In the case of
optimization, also prints the range of powers over which the optimization
took place.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
boxcox
, boxcox.object
, plot.boxcox
,
Data Transformations, print
.
Formats and prints the results of calling the function boxcoxCensored
.
This method is automatically called by print
when given an
object of class "boxcoxCensored"
.
## S3 method for class 'boxcoxCensored' print(x, ...)
## S3 method for class 'boxcoxCensored' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "boxcoxCensored"
method for the generic function
print
.
Prints the objective name, the name of the data object used, the sample size,
the percentage of censored observations, the values of the powers, and the
values of the objective. In the case of
optimization, also prints the range of powers over which the optimization
took place.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
boxcoxCensored
, boxcoxCensored.object
,
plot.boxcoxCensored
,
Data Transformations, print
.
Formats and prints the results of calling the function boxcox
when the argument x
supplied to boxcox
is an object of
class "lm"
. This method is automatically called by print
when given an object of class "boxcoxLm"
. The names of other functions
involved in Box-Cox transformations are listed under Data Transformations.
## S3 method for class 'boxcoxLm' print(x, ...)
## S3 method for class 'boxcoxLm' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "boxcoxLm"
method for the generic function print
.
Prints the objective name, the details of the "lm"
object used,
the sample size,
the values of the powers, and the values of the objective. In the case of
optimization, also prints the range of powers over which the optimization
took place.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
boxcox
, boxcoxLm.object
, plot.boxcoxLm
,
Data Transformations, print
.
Formats and prints the results of calling the function distChoose
, which
uses a series of goodness-of-fit tests to choose among candidate distributions.
This method is automatically called by print
when given an
object of class "distChoose"
.
## S3 method for class 'distChoose' print(x, ...)
## S3 method for class 'distChoose' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "distChoose"
method for the generic function print
.
Prints the candidate distributions, method used to choose among the candidate distributions,
chosen distribution, Type I error associated with each goodness-of-fit test,
estimated population parameter(s) associated with the chosen distribution,
estimation method, goodness-of-fit test results for each candidate distribution,
and the data name.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
distChoose
, distChoose.object
,
Goodness-of-Fit Tests, print
.
Formats and prints the results of calling the function distChooseCensored
, which
uses a series of goodness-of-fit tests to choose among candidate distributions based on
censored data.
This method is automatically called by print
when given an
object of class "distChooseCensored"
.
## S3 method for class 'distChooseCensored' print(x, show.cen.levels = TRUE, pct.censored.digits = .Options$digits, ...)
## S3 method for class 'distChooseCensored' print(x, show.cen.levels = TRUE, pct.censored.digits = .Options$digits, ...)
x |
an object of class |
show.cen.levels |
logical scalar indicating whether to print the censoring levels. The default is
|
pct.censored.digits |
numeric scalar indicating the number of significant digits to print for the percent of censored observations. |
... |
arguments that can be supplied to the |
This is the "distChooseCensored"
method for the generic function
print
.
Prints the candidate distributions,
method used to choose among the candidate distributions,
chosen distribution, Type I error associated with each goodness-of-fit test,
estimated population parameter(s) associated with the chosen distribution,
estimation method, goodness-of-fit test results for each candidate distribution,
and the data name and censoring variable.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
distChooseCensored
, distChooseCensored.object
,
Censored Data, print
.
Formats and prints the results of EnvStats functions that estimate
the parameters or quantiles of a probability distribution and optionally
construct confidence, prediction, or tolerance intervals based on a sample
of data assumed to come from that distribution.
This method is automatically called by print
when given an
object of class "estimate"
.
See the help files Estimating Distribution Parameters and Estimating Distribution Quantiles for lists of functions that estimate distribution parameters and quantiles. See the help files Prediction Intervals and Tolerance Intervals for lists of functions that create prediction and tolerance intervals.
## S3 method for class 'estimate' print(x, conf.cov.sig.digits = .Options$digits, limits.sig.digits = .Options$digits, ...)
## S3 method for class 'estimate' print(x, conf.cov.sig.digits = .Options$digits, limits.sig.digits = .Options$digits, ...)
x |
an object of class |
conf.cov.sig.digits |
numeric scalar indicating the number of significant digits to print for the confidence level or coverage of a confidence, prediction, or tolerance interval. |
limits.sig.digits |
numeric scalar indicating the number of significant digits to print for the upper and lower limits of a confidence, prediction, or tolerance interval. |
... |
arguments that can be supplied to the |
This is the "estimate"
method for the generic function
print
.
Prints estimated parameters and, if present in the object, information regarding
confidence, prediction, or tolerance intervals.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
estimate.object
,
Estimating Distribution Parameters,
Estimating Distribution Quantiles,
Prediction Intervals, Tolerance Intervals,
print
.
Formats and prints the results of EnvStats functions that estimate
the parameters or quantiles of a probability distribution and optionally
construct confidence, prediction, or tolerance intervals based on a sample
of Tyep I censored data assumed to come from that distribution.
This method is automatically called by print
when given an
object of class "estimateCensored"
.
See the subsections Estimating Distribution Parameters and Estimating Distribution Quantiles in the help file Censored Data for lists of functions that estimate distribution parameters and quantiles based on Type I censored data.
See the subsection Prediction and Tolerance Intervals in the help file Censored Data for lists of functions that create prediction and tolerance intervals.
## S3 method for class 'estimateCensored' print(x, show.cen.levels = TRUE, pct.censored.digits = .Options$digits, conf.cov.sig.digits = .Options$digits, limits.sig.digits = .Options$digits, ...)
## S3 method for class 'estimateCensored' print(x, show.cen.levels = TRUE, pct.censored.digits = .Options$digits, conf.cov.sig.digits = .Options$digits, limits.sig.digits = .Options$digits, ...)
x |
an object of class |
show.cen.levels |
logical scalar indicating whether to print the censoring levels. The default is
|
pct.censored.digits |
numeric scalar indicating the number of significant digits to print for the percent of censored observations. |
conf.cov.sig.digits |
numeric scalar indicating the number of significant digits to print for the confidence level or coverage of a confidence, prediction, or tolerance interval. |
limits.sig.digits |
numeric scalar indicating the number of significant digits to print for the upper and lower limits of a confidence, prediction, or tolerance interval. |
... |
arguments that can be supplied to the |
This is the "estimateCensored"
method for the generic function
print
.
Prints estimated parameters and, if present in the object, information regarding
confidence, prediction, or tolerance intervals.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
estimateCensored.object
,
Censored Data, print
.
Formats and prints the results of performing a goodness-of-fit test. This method is
automatically called by print
when given an object of class "gof"
.
The names of the functions that perform goodness-of-fit tests and that produce objects of class
"gof"
are listed under Goodness-of-Fit Tests.
## S3 method for class 'gof' print(x, ...)
## S3 method for class 'gof' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "gof"
method for the generic function print
.
Prints name of the test, hypothesized distribution, estimated population parameter(s),
estimation method, data name, sample size, value of the test statistic, parameters
associated with the null distribution of the test statistic, p-value associated with the
test statistic, and the alternative hypothesis.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
Goodness-of-Fit Tests, gof.object
,
print
.
Formats and prints the results of performing a goodness-of-fit test. This method is
automatically called by print
when given an object of class
"gofCensored"
. Currently, the only function that produces an object of
this class is gofTestCensored
.
## S3 method for class 'gofCensored' print(x, show.cen.levels = TRUE, pct.censored.digits = .Options$digits, ...)
## S3 method for class 'gofCensored' print(x, show.cen.levels = TRUE, pct.censored.digits = .Options$digits, ...)
x |
an object of class |
show.cen.levels |
logical scalar indicating whether to print the censoring levels. The default is
|
pct.censored.digits |
numeric scalar indicating the number of significant digits to print for the percent of censored observations. |
... |
arguments that can be supplied to the |
This is the "gofCensored"
method for the generic function
print
.
Prints name of the test, hypothesized distribution, estimated population parameter(s),
estimation method, data name, sample size, censoring information, value of the test
statistic, parameters associated with the null distribution of the test statistic,
p-value associated with the test statistic, and the alternative hypothesis.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
Censored Data, gofCensored.object
,
print
.
Formats and prints the results of performing a group goodness-of-fit test.
This method is automatically called by print
when given an
object of class "gofGroup"
. Currently,
the only EnvStats function that performs a group goodness-of-fit test
that produces an object of class "gofGroup"
is gofGroupTest
.
## S3 method for class 'gofGroup' print(x, ...)
## S3 method for class 'gofGroup' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "gofGroup"
method for the generic function
print
.
See the help file for gofGroup.object
for information
on the information contained in this kind of object.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
Goodness-of-Fit Tests,
gofGroup.object
,
print
.
Formats and prints the results of performing a goodness-of-fit test for outliers.
This method is automatically called by print
when given an object of
class "gofOutlier"
. The names of the functions that perform goodness-of-fit tests
for outliers and that produce objects of class "gofOutlier"
are listed under
Tests for Outliers.
## S3 method for class 'gofOutlier' print(x, ...)
## S3 method for class 'gofOutlier' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "gofOutlier"
method for the generic function print
.
Prints name of the test, hypothesized distribution, data name, sample size,
value of the test statistic, parameters associated with the null distribution of the
test statistic, Type I error, critical values, and the alternative hypothesis.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
Tests for Outliers, gofOutlier.object
,
print
.
Formats and prints the results of performing a two-sample goodness-of-fit test.
This method is automatically called by print
when given an
object of class "gofTwoSample"
. Currently,
the only EnvStats function that performs a two-sample goodness-of-fit test
that produces an object of class "gofTwoSample"
is gofTest
.
## S3 method for class 'gofTwoSample' print(x, ...)
## S3 method for class 'gofTwoSample' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "gofTwoSample"
method for the generic function
print
.
See the help file for gofTwoSample.object
for information on the
information contained in this kind of object.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
Goodness-of-Fit Tests, gofTwoSample.object
,
print
.
Formats and prints the results of performing a hypothesis test based on censored data.
This method is automatically called by print
when given an
object of class "htestCensored"
. The names of the EnvStats functions
that perform hypothesis tests based on censored data and that produce objects of
class "htestCensored"
are listed in the section Hypothesis Tests
in the help file
EnvStats Functions for Censored Data.
Currently, the only function listed is
twoSampleLinearRankTestCensored
.
## S3 method for class 'htestCensored' print(x, show.cen.levels = TRUE, ...)
## S3 method for class 'htestCensored' print(x, show.cen.levels = TRUE, ...)
x |
an object of class |
show.cen.levels |
logical scalar indicating whether to print the censoring levels. The default value
is |
... |
arguments that can be supplied to the |
This is the "htestCensored"
method for the generic function
print
.
Prints null and alternative hypotheses, name of the test, censoring side,
estimated population parameter(s) involved in the null hypothesis,
estimation method (if present),
data name, censoring variable, sample size (if present),
percent of observations that are censored,
number of missing observations removed prior to performing the test (if present),
value of the test statistic,
parameters associated with the null distribution of the test statistic,
p-value associated with the test statistic, and confidence interval for the
population parameter (if present).
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
Censored Data, htestCensored.object
,
print
.
This is a modification of the R function print.htest
that formats and
prints the results of performing a hypothesis test. This method is
automatically called by the EnvStats generic function print
when
given an object of class "htestEnvStats"
. The names of the EnvStats functions
that perform hypothesis tests and that produce objects of class
"htestEnvStats"
are listed under Hypothesis Tests.
## S3 method for class 'htestEnvStats' print(x, ...)
## S3 method for class 'htestEnvStats' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "htestEnvStats"
method for the EnvStats generic function
print
.
Prints null and alternative hypotheses, name of the test, estimated population
parameter(s) involved in the null hypothesis, estimation method (if present),
data name, sample size (if present), number of missing observations removed
prior to performing the test (if present), value of the test statistic,
parameters associated with the null distribution of the test statistic,
p-value associated with the test statistic, and confidence interval for the
population parameter (if present).
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
Hypothesis Tests, htest.object
,
print
.
Formats and prints the results of performing a permutation test. This method is
automatically called by print
when given an object of class
"permutationTest"
. Currently, the EnvStats functions that perform
permutation tests and produce objects of class "permutationTest"
are:
oneSamplePermutationTest
,
twoSamplePermutationTestLocation
, and twoSamplePermutationTestProportion
.
## S3 method for class 'permutationTest' print(x, ...)
## S3 method for class 'permutationTest' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "permutationTest"
method for the generic function
print
. Prints null and alternative hypotheses,
name of the test, estimated population
parameter(s) involved in the null hypothesis, estimation method (if present),
data name, sample size (if present), number of missing observations removed
prior to performing the test (if present), value of the test statistic,
parameters associated with the null distribution of the test statistic,
and p-value associated with the test statistic.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
permutationTest.object
, oneSamplePermutationTest
,
twoSamplePermutationTestLocation
,
twoSamplePermutationTestProportion
, Hypothesis Tests,
print
.
Formats and prints the results of calling summaryStats
or
summaryFull
. This method is automatically called by
print
when given an object of class "summaryStats"
.
## S3 method for class 'summaryStats' print(x, ...)
## S3 method for class 'summaryStats' print(x, ...)
x |
an object of class |
... |
arguments that can be supplied to the |
This is the "summaryStats"
method for the generic function
print
. Prints summary statistics.
Invisibly returns the input x
.
Steven P. Millard ([email protected])
Chambers, J. M. and Hastie, T. J. (1992). Statistical Models in S. Wadsworth & Brooks/Cole.
summaryStats
, summaryFull
,
summaryStats.object
, print
.
Compute the minimal detectable difference associated with a one- or two-sample proportion test, given the sample size, power, and significance level.
propTestMdd(n.or.n1, n2 = n.or.n1, p0.or.p2 = 0.5, alpha = 0.05, power = 0.95, sample.type = "one.sample", alternative = "two.sided", two.sided.direction = "greater", approx = TRUE, correct = sample.type == "two.sample", warn = TRUE, return.exact.list = TRUE, tol = 1e-07, maxiter = 1000)
propTestMdd(n.or.n1, n2 = n.or.n1, p0.or.p2 = 0.5, alpha = 0.05, power = 0.95, sample.type = "one.sample", alternative = "two.sided", two.sided.direction = "greater", approx = TRUE, correct = sample.type == "two.sample", warn = TRUE, return.exact.list = TRUE, tol = 1e-07, maxiter = 1000)
n.or.n1 |
numeric vector of sample sizes. When |
n2 |
numeric vector of sample sizes for group 2. The default value is |
p0.or.p2 |
numeric vector of proportions. When |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
power |
numeric vector of numbers between 0 and 1 indicating the power associated with
the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis.
The possible values are |
two.sided.direction |
character string indicating the direction (positive or negative) for the
minimal detectable difference when |
approx |
logical scalar indicating whether to compute the power based on the normal
approximation to the binomial distribution. The default value is |
correct |
logical scalar indicating whether to use the continuity correction when |
warn |
logical scalar indicating whether to issue a warning. The default value is |
return.exact.list |
logical scalar relevant to the case when |
tol |
numeric scalar passed to the |
maxiter |
integer passed to the |
If the arguments n.or.n1
, n2
, p0.or.p2
, alpha
, and
power
are not all the same length, they are replicated to be the same
length as the length of the longest argument.
One-Sample Case (sample.type="one.sample"
)
The help file for propTestPower
gives references that explain
how the power of the one-sample proportion test is computed based on the values of
(the hypothesized value for
, the probability of “success”),
(the true value of
), the sample size
, and the Type
I error level
. The function
propTestMdd
computes the value
of the minimal detectable difference for specified values of
sample size, power, and Type I error level by calling the
uniroot
function to perform a search.
Two-Sample Case (sample.type="two.sample"
)
The help file for propTestPower
gives references that explain
how the power of the two-sample proportion test is computed based on the values of
(the value of the probability of “success” for group 1),
(the value of the probability of “success” for group 2),
the sample sizes for groups 1 and 2 (
and
), and the Type
I error level
. The function
propTestMdd
computes the value
of the minimal detectable difference for specified values of
sample size, power, and Type I error level by calling the
uniroot
function to perform a search.
Approximate Test (approx=TRUE
).
numeric vector of minimal detectable differences.
Exact Test (approx=FALSE
).
If return.exact.list=FALSE
, propTestMdd
returns a numeric vector of
minimal detectable differences.
If return.exact.list=TRUE
, propTestMdd
returns a list with the
following components:
delta |
numeric vector of minimal detectable differences. |
power |
numeric vector of powers. |
alpha |
numeric vector containing the true significance levels.
Because of the discrete nature of the binomial distribution, the true significance
levels usually do not equal the significance level supplied by the user in the
argument |
q.critical.lower |
numeric vector of lower critical values for rejecting the null
hypothesis. If the observed number of "successes" is less than or equal to these values,
the null hypothesis is rejected. (Not present if |
q.critical.upper |
numeric vector of upper critical values for rejecting the null
hypothesis. If the observed number of "successes" is greater than these values,
the null hypothesis is rejected. (Not present if |
See the help file for propTestPower
.
Steven P. Millard ([email protected])
See the help file for propTestPower
.
propTestPower
, propTestN
,
plotPropTestDesign
, prop.test
, binom.test
.
# Look at how the minimal detectable difference of the one-sample # proportion test increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 mdd <- propTestMdd(n.or.n1 = 50, power = seq(0.5, 0.9, by=0.1)) round(mdd, 2) #[1] 0.14 0.16 0.17 0.19 0.22 #---------- # Repeat the last example, but compute the minimal detectable difference # based on the exact test instead of the approximation. Note that with a # sample size of 50, the largest significance level less than or equal to # 0.05 for the two-sided alternative is 0.03. mdd.list <- propTestMdd(n.or.n1 = 50, power = seq(0.5, 0.9, by = 0.1), approx = FALSE) lapply(mdd.list, round, 2) #$delta #[1] 0.15 0.17 0.18 0.20 0.23 # #$power #[1] 0.5 0.6 0.7 0.8 0.9 # #$alpha #[1] 0.03 0.03 0.03 0.03 0.03 # #$q.critical.lower #[1] 17 17 17 17 17 # #$q.critical.upper #[1] 32 32 32 32 32 #========== # Look at how the minimal detectable difference for the two-sample # proportion test decreases with increasing sample sizes. Note that for # the specified significance level, power, and true proportion in group 2, # no minimal detectable difference is attainable for a sample size of 10 in # each group. seq(10, 50, by=10) #[1] 10 20 30 40 50 propTestMdd(n.or.n1 = seq(10, 50, by = 10), p0.or.p2 = 0.5, sample.type = "two", alternative="greater") #[1] NA 0.4726348 0.4023564 0.3557916 0.3221412 #Warning messages: #1: In propTestMdd(n.or.n1 = seq(10, 50, by = 10), p0.or.p2 = 0.5, # sample.type = "two", : # Elements with a missing value (NA) indicate no attainable minimal detectable # difference for the given values of 'n1', 'n2', 'p2', 'alpha', and 'power' #2: In propTestMdd(n.or.n1 = seq(10, 50, by = 10), p0.or.p2 = 0.5, # sample.type = "two", : # The sample sizes 'n1' and 'n2' are too small, relative to the computed value # of 'p1' and the given value of 'p2', for the normal approximation to work # well for the following element indices: # 2 3 #---------- # Look at how the minimal detectable difference for the two-sample proportion # test decreases with increasing values of Type I error: mdd <- propTestMdd(n.or.n1 = 100, n2 = 120, p0.or.p2 = 0.4, sample.type = "two", alpha = c(0.01, 0.05, 0.1, 0.2)) round(mdd, 2) #[1] 0.29 0.25 0.23 0.20 #---------- # Clean up #--------- rm(mdd, mdd.list) #========== # Modifying the example on pages 8-5 to 8-7 of USEPA (1989b), determine the # minimal detectable difference to detect a difference in the proportion of # detects of cadmium between the background and compliance wells. Set the # compliance well to "group 1" and the background well to "group 2". Assume # the true probability of a "detect" at the background well is 1/3, use a # 5% significance level, use 80%, 90%, and 95% power, use the given sample # sizes of 64 observations at the compliance well and 24 observations at the # background well, and use the upper one-sided alternative (probability of a # "detect" at the compliance well is greater than the probability of a "detect" # at the background well). # (The data are stored in EPA.89b.cadmium.df.) # # Note that the minimal detectable difference increases from 0.32 to 0.37 to 0.40 as # the required power increases from 80% to 90% to 95%. Thus, in order to detect a # difference in probability of detection between the compliance and background # wells, the probability of detection at the compliance well must be 0.65, 0.70, # or 0.74 (depending on the required power). EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background # .......................................... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance p.hat.back <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Background"])) p.hat.back #[1] 0.3333333 p.hat.comp <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Compliance"])) p.hat.comp #[1] 0.375 n.back <- with(EPA.89b.cadmium.df, sum(Well.type == "Background")) n.back #[1] 24 n.comp <- with(EPA.89b.cadmium.df, sum(Well.type == "Compliance")) n.comp #[1] 64 mdd <- propTestMdd(n.or.n1 = n.comp, n2 = n.back, p0.or.p2 = p.hat.back, power = c(.80, .90, .95), sample.type = "two", alternative = "greater") round(mdd, 2) #[1] 0.32 0.37 0.40 round(mdd + p.hat.back, 2) #[1] 0.65 0.70 0.73 #---------- # Clean up #--------- rm(p.hat.back, p.hat.comp, n.back, n.comp, mdd)
# Look at how the minimal detectable difference of the one-sample # proportion test increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 mdd <- propTestMdd(n.or.n1 = 50, power = seq(0.5, 0.9, by=0.1)) round(mdd, 2) #[1] 0.14 0.16 0.17 0.19 0.22 #---------- # Repeat the last example, but compute the minimal detectable difference # based on the exact test instead of the approximation. Note that with a # sample size of 50, the largest significance level less than or equal to # 0.05 for the two-sided alternative is 0.03. mdd.list <- propTestMdd(n.or.n1 = 50, power = seq(0.5, 0.9, by = 0.1), approx = FALSE) lapply(mdd.list, round, 2) #$delta #[1] 0.15 0.17 0.18 0.20 0.23 # #$power #[1] 0.5 0.6 0.7 0.8 0.9 # #$alpha #[1] 0.03 0.03 0.03 0.03 0.03 # #$q.critical.lower #[1] 17 17 17 17 17 # #$q.critical.upper #[1] 32 32 32 32 32 #========== # Look at how the minimal detectable difference for the two-sample # proportion test decreases with increasing sample sizes. Note that for # the specified significance level, power, and true proportion in group 2, # no minimal detectable difference is attainable for a sample size of 10 in # each group. seq(10, 50, by=10) #[1] 10 20 30 40 50 propTestMdd(n.or.n1 = seq(10, 50, by = 10), p0.or.p2 = 0.5, sample.type = "two", alternative="greater") #[1] NA 0.4726348 0.4023564 0.3557916 0.3221412 #Warning messages: #1: In propTestMdd(n.or.n1 = seq(10, 50, by = 10), p0.or.p2 = 0.5, # sample.type = "two", : # Elements with a missing value (NA) indicate no attainable minimal detectable # difference for the given values of 'n1', 'n2', 'p2', 'alpha', and 'power' #2: In propTestMdd(n.or.n1 = seq(10, 50, by = 10), p0.or.p2 = 0.5, # sample.type = "two", : # The sample sizes 'n1' and 'n2' are too small, relative to the computed value # of 'p1' and the given value of 'p2', for the normal approximation to work # well for the following element indices: # 2 3 #---------- # Look at how the minimal detectable difference for the two-sample proportion # test decreases with increasing values of Type I error: mdd <- propTestMdd(n.or.n1 = 100, n2 = 120, p0.or.p2 = 0.4, sample.type = "two", alpha = c(0.01, 0.05, 0.1, 0.2)) round(mdd, 2) #[1] 0.29 0.25 0.23 0.20 #---------- # Clean up #--------- rm(mdd, mdd.list) #========== # Modifying the example on pages 8-5 to 8-7 of USEPA (1989b), determine the # minimal detectable difference to detect a difference in the proportion of # detects of cadmium between the background and compliance wells. Set the # compliance well to "group 1" and the background well to "group 2". Assume # the true probability of a "detect" at the background well is 1/3, use a # 5% significance level, use 80%, 90%, and 95% power, use the given sample # sizes of 64 observations at the compliance well and 24 observations at the # background well, and use the upper one-sided alternative (probability of a # "detect" at the compliance well is greater than the probability of a "detect" # at the background well). # (The data are stored in EPA.89b.cadmium.df.) # # Note that the minimal detectable difference increases from 0.32 to 0.37 to 0.40 as # the required power increases from 80% to 90% to 95%. Thus, in order to detect a # difference in probability of detection between the compliance and background # wells, the probability of detection at the compliance well must be 0.65, 0.70, # or 0.74 (depending on the required power). EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background # .......................................... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance p.hat.back <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Background"])) p.hat.back #[1] 0.3333333 p.hat.comp <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Compliance"])) p.hat.comp #[1] 0.375 n.back <- with(EPA.89b.cadmium.df, sum(Well.type == "Background")) n.back #[1] 24 n.comp <- with(EPA.89b.cadmium.df, sum(Well.type == "Compliance")) n.comp #[1] 64 mdd <- propTestMdd(n.or.n1 = n.comp, n2 = n.back, p0.or.p2 = p.hat.back, power = c(.80, .90, .95), sample.type = "two", alternative = "greater") round(mdd, 2) #[1] 0.32 0.37 0.40 round(mdd + p.hat.back, 2) #[1] 0.65 0.70 0.73 #---------- # Clean up #--------- rm(p.hat.back, p.hat.comp, n.back, n.comp, mdd)
Compute the sample size necessary to achieve a specified power for a one- or two-sample proportion test, given the true proportion(s) and significance level.
propTestN(p.or.p1, p0.or.p2, alpha = 0.05, power = 0.95, sample.type = "one.sample", alternative = "two.sided", ratio = 1, approx = TRUE, correct = sample.type == "two.sample", round.up = TRUE, warn = TRUE, return.exact.list = TRUE, n.min = 2, n.max = 10000, tol.alpha = 0.1 * alpha, tol = 1e-7, maxiter = 1000)
propTestN(p.or.p1, p0.or.p2, alpha = 0.05, power = 0.95, sample.type = "one.sample", alternative = "two.sided", ratio = 1, approx = TRUE, correct = sample.type == "two.sample", round.up = TRUE, warn = TRUE, return.exact.list = TRUE, n.min = 2, n.max = 10000, tol.alpha = 0.1 * alpha, tol = 1e-7, maxiter = 1000)
p.or.p1 |
numeric vector of proportions. When |
p0.or.p2 |
numeric vector of proportions. When |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
power |
numeric vector of numbers between 0 and 1 indicating the power associated with
the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute sample size based on a one-sample or
two-sample hypothesis test. |
alternative |
character string indicating the kind of alternative hypothesis.
The possible values are |
ratio |
numeric vector indicating the ratio of sample size in group 2 to sample size
in group 1 ( |
approx |
logical scalar indicating whether to compute the sample size based on the normal
approximation to the binomial distribution. The default value is |
correct |
logical scalar indicating whether to use the continuity correction when |
round.up |
logical scalar indicating whether to round up the values of the computed sample size(s)
to the next smallest integer. The default value is |
warn |
logical scalar indicating whether to issue a warning. The default value is |
return.exact.list |
logical scalar relevant to the case when |
n.min |
integer relevant to the case when |
n.max |
integer relevant to the case when |
tol.alpha |
numeric vector relevant to the case when |
tol |
numeric scalar relevant to the case when |
maxiter |
integer relevant to the case when |
If the arguments p.or.p1
, p0.or.p2
, alpha
, power
, ratio
,
and tol.alpha
are not all the same length, they are replicated to be the same length
as the length of the longest argument.
The computed sample size is based on the difference p.or.p1 - p0.or.p2
.
One-Sample Case (sample.type="one.sample"
).
approx=TRUE
. When sample.type="one.sample"
and approx=TRUE
,
sample size is computed based on the test that uses the normal approximation to the
binomial distribution; see the help file for prop.test
.
The formula for this test and the associated power is presented in
standard statistics texts, including Zar (2010, pp. 534-537, 539-541).
These equations can be inverted to solve for the sample size, given a specified power,
significance level, hypothesized proportion, and true proportion.
approx=FALSE
. When sample.type="one.sample"
and approx=FALSE
,
sample size is computed based on the exact binomial test; see the help file for binom.test
.
The formula for this test and its associated power is presented in standard statistics texts,
including Zar (2010, pp. 532-534, 539) and
Millard and Neerchal (2001, pp. 385-386, 504-506). The formula for the power involves
five quantities: the hypothesized proportion (), the true proportion (
),
the significance level (
), the power, and the sample size (
).
In this case the function
propTestN
uses a search algorithm to determine the
required sample size to attain a specified power, given the values of the
hypothesized and true proportions and the significance level.
Two-Sample Case (sample.type="two.sample"
).
When sample.type="two.sample"
, sample size is computed based on the test that uses the
normal approximation to the binomial distribution;
see the help file for prop.test
.
The formula for this test and its associated power is presented in standard statistics texts,
including Zar (2010, pp. 549-550, 552-553) and
Millard and Neerchal (2001, pp. 443-445, 508-510).
These equations can be inverted to solve for the sample size, given a specified power,
significance level, true proportions, and ratio of sample size in group 2 to sample size in
group 1.
Approximate Test (approx=TRUE
).
When sample.type="one.sample"
, or sample.type="two.sample"
and ratio=1
(i.e., equal sample sizes for each group), propTestN
returns a numeric vector of sample sizes. When sample.type="two.sample"
and at least one element of ratio
is
greater than 1, propTestN
returns a list with two components called
n1
and n2
, specifying the sample sizes for each group.
Exact Test (approx=FALSE
).
If return.exact.list=FALSE
, propTestN
returns a numeric vector of sample sizes.
If return.exact.list=TRUE
, propTestN
returns a list with the following components:
n |
numeric vector of sample sizes. |
power |
numeric vector of powers. |
alpha |
numeric vector containing the true significance levels.
Because of the discrete nature of the binomial distribution, the true significance
levels usually do not equal the significance level supplied by the user in the
argument |
q.critical.lower |
numeric vector of lower critical values for rejecting the null
hypothesis. If the observed number of "successes" is less than or equal to these values,
the null hypothesis is rejected. (Not present if |
q.critical.upper |
numeric vector of upper critical values for rejecting the null
hypothesis. If the observed number of "successes" is greater than these values,
the null hypothesis is rejected. (Not present if |
The binomial distribution is used to model processes with binary (Yes-No, Success-Failure,
Heads-Tails, etc.) outcomes. It is assumed that the outcome of any one trial is independent
of any other trial, and that the probability of “success”, , is the same on each trial.
A binomial discrete random variable
is the number of "successes" in
independent
trials. A special case of the binomial distribution occurs when
, in which case
is also called a Bernoulli random variable.
In the context of environmental statistics, the binomial distribution is sometimes used to model the proportion of times a chemical concentration exceeds a set standard in a given period of time (e.g., Gilbert, 1987, p.143), or to compare the proportion of detects in a compliance well vs. a background well (e.g., USEPA, 1989b, Chapter 8, p.3-7).
In the course of designing a sampling program, an environmental scientist may wish to determine the
relationship between sample size, power, significance level, and the difference between the
hypothesized and true proportions if one of the objectives of the sampling program is to
determine whether a proprtion differs from a specified level or two proportions differ from each other.
The functions propTestPower
, propTestN
, propTestMdd
, and
plotPropTestDesign
can be used to investigate these relationships for the case of
binomial proportions.
Studying the two-sample proportion test, Haseman (1978) found that the formulas used to estimate the power that do not incorporate the continuity correction tend to underestimate the power. Casagrande, Pike, and Smith (1978) found that the formulas that do incorporate the continuity correction provide an excellent approximation.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, FL, Chapter 15.
Casagrande, J.T., M.C. Pike, and P.G. Smith. (1978). An Improved Approximation Formula for Calculating Sample Sizes for Comparing Two Binomial Distributions. Biometrics 34, 483-486.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. Second Edition. John Wiley and Sons, New York, Chapters 1-2.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.
Haseman, J.K. (1978). Exact Sample Sizes for Use with the Fisher-Irwin Test for 2x2 Tables. Biometrics 34, 106-109.
Millard, S.P., and N. Neerchal. (2001). Environmental Statistics with S-Plus. CRC Press, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
propTestPower
, propTestMdd
, plotPropTestDesign
,
prop.test
, binom.test
.
# Look at how the required sample size of the one-sample # proportion test with a two-sided alternative and Type I error # set to 5% increases with increasing power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 propTestN(p.or.p1 = 0.7, p0.or.p2 = 0.5, power = seq(0.5, 0.9, by = 0.1)) #[1] 25 31 38 47 62 #---------- # Repeat the last example, but compute the sample size based on # the exact test instead of the approximation. Note that because # we require the actual Type I error (alpha) to be within # 10% of the supplied value of alpha (which is 0.05 by default), # due to the discrete nature of the exact binomial test # we end up with more power then we specified. n.list <- propTestN(p.or.p1 = 0.7, p0.or.p2 = 0.5, power = seq(0.5, 0.9, by = 0.1), approx = FALSE) lapply(n.list, round, 3) #$n #[1] 37 37 44 51 65 # #$power #[1] 0.698 0.698 0.778 0.836 0.910 # #$alpha #[1] 0.047 0.047 0.049 0.049 0.046 # #$q.critical.lower #[1] 12 12 15 18 24 # #$q.critical.upper #[1] 24 24 28 32 40 #---------- # Using the example above, see how the sample size changes # if we allow the Type I error to deviate by more than 10 percent # of the value of alpha (i.e., by more than 0.005). n.list <- propTestN(p.or.p1 = 0.7, p0.or.p2 = 0.5, power = seq(0.5, 0.9, by = 0.1), approx = FALSE, tol.alpha = 0.01) lapply(n.list, round, 3) #$n #[1] 25 35 42 49 65 # #$power #[1] 0.512 0.652 0.743 0.810 0.910 # #$alpha #[1] 0.043 0.041 0.044 0.044 0.046 # #$q.critical.lower #[1] 7 11 14 17 24 # #$q.critical.upper #[1] 17 23 27 31 40 #---------- # Clean up #--------- rm(n.list) #========== # Look at how the required sample size for the two-sample # proportion test decreases with increasing difference between # the two population proportions: seq(0.4, 0.1, by = -0.1) #[1] 0.4 0.3 0.2 0.1 propTestN(p.or.p1 = seq(0.4, 0.1, by = -0.1), p0.or.p2 = 0.5, sample.type = "two") #[1] 661 163 70 36 #Warning message: #In propTestN(p.or.p1 = seq(0.4, 0.1, by = -0.1), p0.or.p2 = 0.5, : # The computed sample sizes 'n1' and 'n2' are too small, # relative to the given values of 'p1' and 'p2', for the normal # approximation to work well for the following element indices: # 4 #---------- # Look at how the required sample size for the two-sample # proportion test decreases with increasing values of Type I error: propTestN(p.or.p1 = 0.7, p0.or.p2 = 0.5, sample.type = "two", alpha = c(0.001, 0.01, 0.05, 0.1)) #[1] 299 221 163 137 #========== # Modifying the example on pages 8-5 to 8-7 of USEPA (1989b), # determine the required sample size to detect a difference in the # proportion of detects of cadmium between the background and # compliance wells. Set the complicance well to "group 1" and # the backgound well to "group 2". Assume the true probability # of a "detect" at the background well is 1/3, set the probability # of a "detect" at the compliance well to 0.4 and 0.5, use a 5% # significance level and 95% power, and use the upper # one-sided alternative (probability of a "detect" at the compliance # well is greater than the probability of a "detect" at the background # well). (The original data are stored in EPA.89b.cadmium.df.) # # Note that the required sample size decreases from about # 1160 at each well to about 200 at each well as the difference in # proportions changes from (0.4 - 1/3) to (0.5 - 1/3), but both of # these sample sizes are enormous compared to the number of samples # usually collected in the field. EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background # .......................................... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance p.hat.back <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Background"])) p.hat.back #[1] 0.3333333 p.hat.comp <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Compliance"])) p.hat.comp #[1] 0.375 n.back <- with(EPA.89b.cadmium.df, sum(Well.type == "Background")) n.back #[1] 24 n.comp <- with(EPA.89b.cadmium.df, sum(Well.type == "Compliance")) n.comp #[1] 64 propTestN(p.or.p1 = c(0.4, 0.50), p0.or.p2 = p.hat.back, alt="greater", sample.type="two") #[1] 1159 199 #---------- # Clean up #--------- rm(p.hat.back, p.hat.comp, n.back, n.comp)
# Look at how the required sample size of the one-sample # proportion test with a two-sided alternative and Type I error # set to 5% increases with increasing power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 propTestN(p.or.p1 = 0.7, p0.or.p2 = 0.5, power = seq(0.5, 0.9, by = 0.1)) #[1] 25 31 38 47 62 #---------- # Repeat the last example, but compute the sample size based on # the exact test instead of the approximation. Note that because # we require the actual Type I error (alpha) to be within # 10% of the supplied value of alpha (which is 0.05 by default), # due to the discrete nature of the exact binomial test # we end up with more power then we specified. n.list <- propTestN(p.or.p1 = 0.7, p0.or.p2 = 0.5, power = seq(0.5, 0.9, by = 0.1), approx = FALSE) lapply(n.list, round, 3) #$n #[1] 37 37 44 51 65 # #$power #[1] 0.698 0.698 0.778 0.836 0.910 # #$alpha #[1] 0.047 0.047 0.049 0.049 0.046 # #$q.critical.lower #[1] 12 12 15 18 24 # #$q.critical.upper #[1] 24 24 28 32 40 #---------- # Using the example above, see how the sample size changes # if we allow the Type I error to deviate by more than 10 percent # of the value of alpha (i.e., by more than 0.005). n.list <- propTestN(p.or.p1 = 0.7, p0.or.p2 = 0.5, power = seq(0.5, 0.9, by = 0.1), approx = FALSE, tol.alpha = 0.01) lapply(n.list, round, 3) #$n #[1] 25 35 42 49 65 # #$power #[1] 0.512 0.652 0.743 0.810 0.910 # #$alpha #[1] 0.043 0.041 0.044 0.044 0.046 # #$q.critical.lower #[1] 7 11 14 17 24 # #$q.critical.upper #[1] 17 23 27 31 40 #---------- # Clean up #--------- rm(n.list) #========== # Look at how the required sample size for the two-sample # proportion test decreases with increasing difference between # the two population proportions: seq(0.4, 0.1, by = -0.1) #[1] 0.4 0.3 0.2 0.1 propTestN(p.or.p1 = seq(0.4, 0.1, by = -0.1), p0.or.p2 = 0.5, sample.type = "two") #[1] 661 163 70 36 #Warning message: #In propTestN(p.or.p1 = seq(0.4, 0.1, by = -0.1), p0.or.p2 = 0.5, : # The computed sample sizes 'n1' and 'n2' are too small, # relative to the given values of 'p1' and 'p2', for the normal # approximation to work well for the following element indices: # 4 #---------- # Look at how the required sample size for the two-sample # proportion test decreases with increasing values of Type I error: propTestN(p.or.p1 = 0.7, p0.or.p2 = 0.5, sample.type = "two", alpha = c(0.001, 0.01, 0.05, 0.1)) #[1] 299 221 163 137 #========== # Modifying the example on pages 8-5 to 8-7 of USEPA (1989b), # determine the required sample size to detect a difference in the # proportion of detects of cadmium between the background and # compliance wells. Set the complicance well to "group 1" and # the backgound well to "group 2". Assume the true probability # of a "detect" at the background well is 1/3, set the probability # of a "detect" at the compliance well to 0.4 and 0.5, use a 5% # significance level and 95% power, and use the upper # one-sided alternative (probability of a "detect" at the compliance # well is greater than the probability of a "detect" at the background # well). (The original data are stored in EPA.89b.cadmium.df.) # # Note that the required sample size decreases from about # 1160 at each well to about 200 at each well as the difference in # proportions changes from (0.4 - 1/3) to (0.5 - 1/3), but both of # these sample sizes are enormous compared to the number of samples # usually collected in the field. EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background # .......................................... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance p.hat.back <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Background"])) p.hat.back #[1] 0.3333333 p.hat.comp <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Compliance"])) p.hat.comp #[1] 0.375 n.back <- with(EPA.89b.cadmium.df, sum(Well.type == "Background")) n.back #[1] 24 n.comp <- with(EPA.89b.cadmium.df, sum(Well.type == "Compliance")) n.comp #[1] 64 propTestN(p.or.p1 = c(0.4, 0.50), p0.or.p2 = p.hat.back, alt="greater", sample.type="two") #[1] 1159 199 #---------- # Clean up #--------- rm(p.hat.back, p.hat.comp, n.back, n.comp)
Compute the power of a one- or two-sample proportion test, given the sample size(s), true proportion(s), and significance level.
propTestPower(n.or.n1, p.or.p1 = 0.5, n2 = n.or.n1, p0.or.p2 = 0.5, alpha = 0.05, sample.type = "one.sample", alternative = "two.sided", approx = TRUE, correct = sample.type == "two.sample", warn = TRUE, return.exact.list = TRUE)
propTestPower(n.or.n1, p.or.p1 = 0.5, n2 = n.or.n1, p0.or.p2 = 0.5, alpha = 0.05, sample.type = "one.sample", alternative = "two.sided", approx = TRUE, correct = sample.type == "two.sample", warn = TRUE, return.exact.list = TRUE)
n.or.n1 |
numeric vector of sample sizes. When |
p.or.p1 |
numeric vector of proportions. When |
n2 |
numeric vector of sample sizes for group 2. The default value is |
p0.or.p2 |
numeric vector of proportions. When |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis.
The possible values are |
approx |
logical scalar indicating whether to compute the power based on the normal
approximation to the binomial distribution. The default value is |
correct |
logical scalar indicating whether to use the continuity correction when |
warn |
logical scalar indicating whether to issue a warning. The default value is |
return.exact.list |
logical scalar relevant to the case when |
If the arguments n.or.n1
, p.or.p1
, n2
, p0.or.p2
, and
alpha
are not all the same length, they are replicated to be the same length
as the length of the longest argument.
The power is based on the difference p.or.p1 - p0.or.p2
.
One-Sample Case (sample.type="one.sample"
).
approx=TRUE
When sample.type="one.sample"
and approx=TRUE
,
power is computed based on the test that uses the normal approximation to the
binomial distribution; see the help file for prop.test
.
The formula for this test and its associated power is presented in most standard statistics
texts, including Zar (2010, pp. 534-537, 539-541).
approx=FALSE
When sample.type="one.sample"
and approx=FALSE
,
power is computed based on the exact binomial test; see the help file for binom.test
.
The formula for this test and its associated power is presented in most standard statistics
texts, including Zar (2010, pp. 532-534, 539) and
Millard and Neerchal (2001, pp. 385-386, 504-506).
Two-Sample Case (sample.type="two.sample"
).
When sample.type="two.sample"
, power is computed based on the test that uses the
normal approximation to the binomial distribution;
see the help file for prop.test
.
The formula for this test and its associated power is presented in standard statistics texts,
including Zar (2010, pp. 549-550, 552-553) and
Millard and Neerchal (2001, pp. 443-445, 508-510).
By default, propTestPower
returns a numeric vector of powers.
For the one-sample proportion test (sample.type="one.sample"
),
when approx=FALSE
and return.exact.list=TRUE
, propTestPower
returns a list with the following components:
power |
numeric vector of powers. |
alpha |
numeric vector containing the true significance levels.
Because of the discrete nature of the binomial distribution, the true significance
levels usually do not equal the significance level supplied by the user in the
argument |
q.critical.lower |
numeric vector of lower critical values for rejecting the null
hypothesis. If the observed number of "successes" is less than or equal to these values,
the null hypothesis is rejected. (Not present if |
q.critical.upper |
numeric vector of upper critical values for rejecting the null
hypothesis. If the observed number of "successes" is greater than these values,
the null hypothesis is rejected. (Not present if |
The binomial distribution is used to model processes with binary (Yes-No, Success-Failure,
Heads-Tails, etc.) outcomes. It is assumed that the outcome of any one trial is independent
of any other trial, and that the probability of “success”, , is the same on each trial.
A binomial discrete random variable
is the number of "successes" in
independent
trials. A special case of the binomial distribution occurs when
, in which case
is also called a Bernoulli random variable.
In the context of environmental statistics, the binomial distribution is sometimes used to model the proportion of times a chemical concentration exceeds a set standard in a given period of time (e.g., Gilbert, 1987, p.143), or to compare the proportion of detects in a compliance well vs. a background well (e.g., USEPA, 1989b, Chapter 8, p.3-7).
In the course of designing a sampling program, an environmental scientist may wish to determine the
relationship between sample size, power, significance level, and the difference between the
hypothesized and true proportions if one of the objectives of the sampling program is to
determine whether a proprtion differs from a specified level or two proportions differ from each other.
The functions propTestPower
, propTestN
, propTestMdd
, and
plotPropTestDesign
can be used to investigate these relationships for the case of
binomial proportions.
Studying the two-sample proportion test, Haseman (1978) found that the formulas used to estimate the power that do not incorporate the continuity correction tend to underestimate the power. Casagrande, Pike, and Smith (1978) found that the formulas that do incorporate the continuity correction provide an excellent approximation.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, FL, Chapter 15.
Casagrande, J.T., M.C. Pike, and P.G. Smith. (1978). An Improved Approximation Formula for Calculating Sample Sizes for Comparing Two Binomial Distributions. Biometrics 34, 483-486.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. Second Edition. John Wiley and Sons, New York, Chapters 1-2.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York, NY.
Haseman, J.K. (1978). Exact Sample Sizes for Use with the Fisher-Irwin Test for 2x2 Tables. Biometrics 34, 106-109.
Millard, S.P., and N. Neerchal. (2001). Environmental Statistics with S-Plus. CRC Press, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
propTestN
, propTestMdd
, plotPropTestDesign
,
prop.test
, binom.test
.
# Look at how the power of the one-sample proportion test # increases with increasing sample size: seq(20, 50, by=10) #[1] 20 30 40 50 power <- propTestPower(n.or.n1 = seq(20, 50, by=10), p.or.p1 = 0.7, p0 = 0.5) round(power, 2) #[1] 0.43 0.60 0.73 0.83 #---------- # Repeat the last example, but compute the power based on # the exact test instead of the approximation. # Note that the significance level varies with sample size and # never attains the requested level of 0.05. prop.test.list <- propTestPower(n.or.n1 = seq(20, 50, by=10), p.or.p1 = 0.7, p0 = 0.5, approx=FALSE) lapply(prop.test.list, round, 2) #$power: #[1] 0.42 0.59 0.70 0.78 # #$alpha: #[1] 0.04 0.04 0.04 0.03 # #$q.critical.lower: #[1] 5 9 13 17 # #$q.critical.upper: #[1] 14 20 26 32 #========== # Look at how the power of the two-sample proportion test # increases with increasing difference between the two # population proportions: seq(0.5, 0.1, by=-0.1) #[1] 0.5 0.4 0.3 0.2 0.1 power <- propTestPower(30, sample.type = "two", p.or.p1 = seq(0.5, 0.1, by=-0.1)) #Warning message: #In propTestPower(30, sample.type = "two", p.or.p1 = seq(0.5, 0.1, : #The sample sizes 'n1' and 'n2' are too small, relative to the given # values of 'p1' and 'p2', for the normal approximation to work well # for the following element indices: # 5 round(power, 2) #[1] 0.05 0.08 0.26 0.59 0.90 #---------- # Look at how the power of the two-sample proportion test # increases with increasing values of Type I error: power <- propTestPower(30, sample.type = "two", p.or.p1 = 0.7, alpha = c(0.001, 0.01, 0.05, 0.1)) round(power, 2) #[1] 0.02 0.10 0.26 0.37 #========== # Clean up #--------- rm(power, prop.test.list) #========== # Modifying the example on pages 8-5 to 8-7 of USEPA (1989b), # determine how adding another 20 observations to the background # well to increase the sample size from 24 to 44 will affect the # power of detecting a difference in the proportion of detects of # cadmium between the background and compliance wells. Set the # compliance well to "group 1" and set the background well to # "group 2". Assume the true probability of a "detect" at the # background well is 1/3, set the probability of a "detect" at the # compliance well to 0.4, use a 5% significance level, and use the # upper one-sided alternative (probability of a "detect" at the # compliance well is greater than the probability of a "detect" at # the background well). # (The original data are stored in EPA.89b.cadmium.df.) # # Note that the power does increase (from 9% to 12%), but is relatively # very small. EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background # .......................................... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance p.hat.back <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Background"])) p.hat.back #[1] 0.3333333 p.hat.comp <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Compliance"])) p.hat.comp #[1] 0.375 n.back <- with(EPA.89b.cadmium.df, sum(Well.type == "Background")) n.back #[1] 24 n.comp <- with(EPA.89b.cadmium.df, sum(Well.type == "Compliance")) n.comp #[1] 64 propTestPower(n.or.n1 = n.comp, p.or.p1 = 0.4, n2 = c(n.back, 44), p0.or.p2 = p.hat.back, alt="greater", sample.type="two") #[1] 0.08953013 0.12421135 #---------- # Clean up #--------- rm(p.hat.back, p.hat.comp, n.back, n.comp)
# Look at how the power of the one-sample proportion test # increases with increasing sample size: seq(20, 50, by=10) #[1] 20 30 40 50 power <- propTestPower(n.or.n1 = seq(20, 50, by=10), p.or.p1 = 0.7, p0 = 0.5) round(power, 2) #[1] 0.43 0.60 0.73 0.83 #---------- # Repeat the last example, but compute the power based on # the exact test instead of the approximation. # Note that the significance level varies with sample size and # never attains the requested level of 0.05. prop.test.list <- propTestPower(n.or.n1 = seq(20, 50, by=10), p.or.p1 = 0.7, p0 = 0.5, approx=FALSE) lapply(prop.test.list, round, 2) #$power: #[1] 0.42 0.59 0.70 0.78 # #$alpha: #[1] 0.04 0.04 0.04 0.03 # #$q.critical.lower: #[1] 5 9 13 17 # #$q.critical.upper: #[1] 14 20 26 32 #========== # Look at how the power of the two-sample proportion test # increases with increasing difference between the two # population proportions: seq(0.5, 0.1, by=-0.1) #[1] 0.5 0.4 0.3 0.2 0.1 power <- propTestPower(30, sample.type = "two", p.or.p1 = seq(0.5, 0.1, by=-0.1)) #Warning message: #In propTestPower(30, sample.type = "two", p.or.p1 = seq(0.5, 0.1, : #The sample sizes 'n1' and 'n2' are too small, relative to the given # values of 'p1' and 'p2', for the normal approximation to work well # for the following element indices: # 5 round(power, 2) #[1] 0.05 0.08 0.26 0.59 0.90 #---------- # Look at how the power of the two-sample proportion test # increases with increasing values of Type I error: power <- propTestPower(30, sample.type = "two", p.or.p1 = 0.7, alpha = c(0.001, 0.01, 0.05, 0.1)) round(power, 2) #[1] 0.02 0.10 0.26 0.37 #========== # Clean up #--------- rm(power, prop.test.list) #========== # Modifying the example on pages 8-5 to 8-7 of USEPA (1989b), # determine how adding another 20 observations to the background # well to increase the sample size from 24 to 44 will affect the # power of detecting a difference in the proportion of detects of # cadmium between the background and compliance wells. Set the # compliance well to "group 1" and set the background well to # "group 2". Assume the true probability of a "detect" at the # background well is 1/3, set the probability of a "detect" at the # compliance well to 0.4, use a 5% significance level, and use the # upper one-sided alternative (probability of a "detect" at the # compliance well is greater than the probability of a "detect" at # the background well). # (The original data are stored in EPA.89b.cadmium.df.) # # Note that the power does increase (from 9% to 12%), but is relatively # very small. EPA.89b.cadmium.df # Cadmium.orig Cadmium Censored Well.type #1 0.1 0.100 FALSE Background #2 0.12 0.120 FALSE Background #3 BDL 0.000 TRUE Background # .......................................... #86 BDL 0.000 TRUE Compliance #87 BDL 0.000 TRUE Compliance #88 BDL 0.000 TRUE Compliance p.hat.back <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Background"])) p.hat.back #[1] 0.3333333 p.hat.comp <- with(EPA.89b.cadmium.df, mean(!Censored[Well.type=="Compliance"])) p.hat.comp #[1] 0.375 n.back <- with(EPA.89b.cadmium.df, sum(Well.type == "Background")) n.back #[1] 24 n.comp <- with(EPA.89b.cadmium.df, sum(Well.type == "Compliance")) n.comp #[1] 64 propTestPower(n.or.n1 = n.comp, p.or.p1 = 0.4, n2 = c(n.back, 44), p0.or.p2 = p.hat.back, alt="greater", sample.type="two") #[1] 0.08953013 0.12421135 #---------- # Clean up #--------- rm(p.hat.back, p.hat.comp, n.back, n.comp)
A real data set of size n=55 with 18.8% Nondetects (=10). The name of the Excel file that comes with ProUCL 5.2.0 and contains these data is TRS-Real-data-with-NDs.xls.
ProUCL.5.2.TRS.df data(ProUCL.5.2.TRS.df)
ProUCL.5.2.TRS.df data(ProUCL.5.2.TRS.df)
A data frame with 55 observations on the following 3 variables.
Value
numeric vector indicating the concentration.
Detect
numeric vector of 0s (nondetects) and 1s (detects) indicating censoring status.
Censored
logical vector indicating censoring status.
USEPA. (2022a). ProUCL Version 5.2.0 Technical Guide: Statistical Software for Environmental Applications for Data Sets with and without Nondetect Observations. Prepared by: Neptune and Company, Inc., 1435 Garrison Street, Suite 201, Lakewood, CO 80215. p. 143. https://www.epa.gov/land-research/proucl-software.
USEPA. (2022b). ProUCL Version 5.2.0 User Guide: Statistical Software for Environmental Applications for Data Sets with and without Nondetect Observations. Prepared by: Neptune and Company, Inc., 1435 Garrison Street, Suite 201, Lakewood, CO 80215. p. 6-115. https://www.epa.gov/land-research/proucl-software.
Critical Values for the Anderson-Darling Goodness-of-Fit Test for a Gamma Distribution, as presented in Tables A-1, A-3, and A-5 on pages 283, 285, and 287, respectively, of USEPA (2015).
data("ProUCL.Crit.Vals.for.AD.Test.for.Gamma.array")
data("ProUCL.Crit.Vals.for.AD.Test.for.Gamma.array")
An array of dimensions 32 by 11 by 3, with the first dimension indicating the sample size (between 5 and 1000), the second dimension indicating the value of the maximum likelihood estimate of the shape parameter (between 0.025 and 50), and the third dimension indicating the assumed significance level (0.01, 0.05, and 0.10).
See USEPA (2015, pp.281-282) and the help file for gofTest
for more information. The data in this array are used when
the function gofTest
is called with test="proucl.ad.gamma"
.
The letter k is used to indicate the value of the estimated shape parameter.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C., pp. 283, 285, and 287.
USEPA. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.
Critical Values for the Kolmogorov-Smirnov Goodness-of-Fit Test for a Gamma Distribution, as presented in Tables A-2, A-4, and A-6 on pages 284, 286, and 288, respectively, of USEPA (2015).
data("ProUCL.Crit.Vals.for.KS.Test.for.Gamma.array")
data("ProUCL.Crit.Vals.for.KS.Test.for.Gamma.array")
An array of dimensions 32 by 11 by 3, with the first dimension indicating the sample size (between 5 and 1000), the second dimension indicating the value of the maximum likelihood estimate of the shape parameter (between 0.025 and 50), and the third dimension indicating the assumed significance level (0.01, 0.05, and 0.10).
See USEPA (2015, pp.281-282) for more information. The data in this array are used when
the function gofTest
is called with test="proucl.ks.gamma"
.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C., pp. 284, 286, and 288.
USEPA. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.
Estimate the 'th probability-weighted moment from a random sample,
where either
,
, or both.
pwMoment(x, j = 0, k = 0, method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), na.rm = FALSE)
pwMoment(x, j = 0, k = 0, method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), na.rm = FALSE)
x |
numeric vector of observations. |
j , k
|
non-negative integers specifying the order of the moment. |
method |
character string specifying what method to use to compute the
probability-weighted moment. The possible values are |
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for the
plotting positions when |
na.rm |
logical scalar indicating whether to remove missing values from |
The definition of a probability-weighted moment, introduced by
Greenwood et al. (1979), is as follows. Let denote a random variable
with cdf
, and let
denote the
'th quantile of the
distribution. Then the
'th probability-weighted moment is given by:
where ,
, and
are real numbers. Note that if
is a
nonnegative integer, then
is the conventional
'th moment
about the origin.
Greenwood et al. (1979) state that in the special case where ,
, and
are nonnegative integers:
where denotes the beta function evaluated at
and
, and
denotes the 'th moment about the origin of the
'th order
statistic for a sample of size
. In particular,
where
denotes the expected value of the first order statistic (i.e., the minimum) in a
sample of size , and
denotes the expected value of the 'th order statistic (i.e., the maximum)
in a sample of size
.
Unbiased Estimators (method="unbiased"
)
Landwehr et al. (1979) show that, given a random sample of values from
some arbitrary distribution, an unbiased, distribution-free, and parameter-free
estimator of
is given by:
where the quantity denotes the
'th order statistic in the
random sample of size
. Hosking et al. (1985) note that this estimator is
closely related to U-statistics (Hoeffding, 1948; Lehmann, 1975, pp. 362-371).
Hosking et al. (1985) note that an unbiased, distribution-free, and parameter-free
estimator of
is given by:
Plotting-Position Estimators (method="plotting.position"
)
Hosking et al. (1985) propose alternative estimators of and
based on plotting positions:
where
denotes the plotting position of the 'th order statistic in the random
sample of size
, that is, a distribution-free estimate of the cdf of
evaluated at the
'th order statistic. Typically, plotting
positions have the form:
where . For this form of plotting position, the
plotting-position estimators are asymptotically equivalent to the U-statistic
estimators.
A numeric scalar–the value of the 'th probability-weighted moment
as defined by Greenwood et al. (1979).
Greenwood et al. (1979) introduced the concept of probability-weighted moments
as a tool to derive estimates of distribution parameters for distributions that
can be (perhaps only be) expressed in inverse form. The term “inverse form”
simply means that instead of characterizing the distribution by the formula for
its cumulative distribution function (cdf), the distribution is characterized by
the formula for the 'th quantile (
).
For distributions that can only be expressed in inverse form, moment estimates of their parameters are not available, and maximum likelihood estimates are not easy to compute. Greenwood et al. (1979) show that in these cases, it is often possible to derive expressions for the distribution parameters in terms of probability-weighted moments. Thus, for these cases the distribution parameters can be estimated based on the sample probability-weighted moments, which are fairly easy to compute. Furthermore, for distributions whose parameters can be expressed as functions of conventional moments, the method of probability-weighted moments provides an alternative to method of moments and maximum likelihood estimators.
Landwehr et al. (1979) use the method of probability-weighted moments to estimate the parameters of the Type I Extreme Value (Gumbel) distribution.
Hosking et al. (1985) use the method of probability-weighted moments to estimate the parameters of the generalized extreme value distribution.
Hosking (1990) and Hosking and Wallis (1995) show the relationship between probabiity-weighted moments and L-moments.
Hosking and Wallis (1995) recommend using the unbiased estimators of probability-weighted moments for almost all applications.
Steven P. Millard ([email protected])
Greenwood, J.A., J.M. Landwehr, N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments: Definition and Relation to Parameters of Several Distributions Expressible in Inverse Form. Water Resources Research 15(5), 1049–1054.
Hoeffding, W. (1948). A Class of Statistics with Asymptotically Normal Distribution. Annals of Mathematical Statistics 19, 293–325.
Hosking, J.R.M. (1990). L-Moments: Analysis and Estimation of Distributions Using Linear Combinations of Order Statistics. Journal of the Royal Statistical Society, Series B 52(1), 105–124.
Hosking, J.R.M., and J.R. Wallis (1995). A Comparison of Unbiased and Plotting-Position Estimators of L Moments. Water Resources Research 31(8), 2019–2025.
Hosking, J.R.M., J.R. Wallis, and E.F. Wood. (1985). Estimation of the Generalized Extreme-Value Distribution by the Method of Probability-Weighted Moments. Technometrics 27(3), 251–261.
Landwehr, J.M., N.C. Matalas, and J.R. Wallis. (1979). Probability Weighted Moments Compared With Some Traditional Techniques in Estimating Gumbel Parameters and Quantiles. Water Resources Research 15(5), 1055–1064.
Lehmann, E.L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, Oakland, CA, pp.362-371.
# Generate 20 observations from a generalized extreme value distribution # with parameters location=10, scale=2, and shape=.25, then compute the # 0'th, 1'st and 2'nd probability-weighted moments. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgevd(20, location = 10, scale = 2, shape = 0.25) pwMoment(dat) #[1] 10.59556 pwMoment(dat, 1) #[1] 5.798481 pwMoment(dat, 2) #[1] 4.060574 pwMoment(dat, k = 1) #[1] 4.797081 pwMoment(dat, k = 2) #[1] 3.059173 pwMoment(dat, 1, method = "plotting.position") # [1] 5.852913 pwMoment(dat, 1, method = "plotting.position", plot.pos = c(.325, 1)) #[1] 5.586817 #---------- # Clean Up #--------- rm(dat)
# Generate 20 observations from a generalized extreme value distribution # with parameters location=10, scale=2, and shape=.25, then compute the # 0'th, 1'st and 2'nd probability-weighted moments. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgevd(20, location = 10, scale = 2, shape = 0.25) pwMoment(dat) #[1] 10.59556 pwMoment(dat, 1) #[1] 5.798481 pwMoment(dat, 2) #[1] 4.060574 pwMoment(dat, k = 1) #[1] 4.797081 pwMoment(dat, k = 2) #[1] 3.059173 pwMoment(dat, 1, method = "plotting.position") # [1] 5.852913 pwMoment(dat, 1, method = "plotting.position", plot.pos = c(.325, 1)) #[1] 5.586817 #---------- # Clean Up #--------- rm(dat)
Produces a quantile-quantile (Q-Q) plot, also called a probability plot.
The qqPlot
function is a modified version of the R functions
qqnorm
and qqplot
.
The EnvStats function qqPlot
allows the user to specify a number of
different distributions in addition to the normal distribution, and to optionally
estimate the distribution parameters of the fitted distribution.
qqPlot(x, y = NULL, distribution = "norm", param.list = list(mean = 0, sd = 1), estimate.params = plot.type == "Tukey Mean-Difference Q-Q", est.arg.list = NULL, plot.type = "Q-Q", plot.pos.con = NULL, plot.it = TRUE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = FALSE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
qqPlot(x, y = NULL, distribution = "norm", param.list = list(mean = 0, sd = 1), estimate.params = plot.type == "Tukey Mean-Difference Q-Q", est.arg.list = NULL, plot.type = "Q-Q", plot.pos.con = NULL, plot.it = TRUE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = FALSE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, ..., main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
x |
numeric vector of observations. When |
y |
optional numeric vector of observations (not necessarily the same lenght as |
distribution |
when |
param.list |
when |
estimate.params |
when |
est.arg.list |
when |
plot.type |
a character string denoting the kind of plot. Possible values are |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position constant.
The default value of |
plot.it |
a logical scalar indicating whether to create a plot on the current graphics device.
The default value is |
equal.axes |
a logical scalar indicating whether to use the same range on the |
add.line |
a logical scalar indicating whether to add a line to the plot. If |
qq.line.type |
character string determining what kind of line to add to the Q-Q plot. Possible values are
|
duplicate.points.method |
a character string denoting how to plot points with duplicate |
points.col |
a numeric scalar or character string determining the color of the points in the plot.
The default value is |
line.col |
a numeric scalar or character string determining the color of the line in the plot.
The default value is |
line.lwd |
a numeric scalar determining the width of the line in the plot. The default value is
|
line.lty |
a numeric scalar determining the line type of the line in the plot. The default value is
|
digits |
a scalar indicating how many significant digits to print for the distribution parameters.
The default value is |
main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters (see |
If y
is not supplied, the vector x
is assumed to be a sample from the probability
distribution specified by the argument distribution
(and param.list
if
estimate.params=FALSE
). When plot.type="Q-Q"
, the quantiles of x
are
plotted on the -axis against the quantiles of the assumed distribution on the
-axis.
If y
is supplied and plot.type="Q-Q"
, the empirical quantiles of y
are
plotted against the empirical quantiles of x
.
When plot.type="Tukey Mean-Difference Q-Q"
, the difference of the quantiles is plotted on
the -axis against the mean of the quantiles on the
-axis.
Special Distributions
When y
is not supplied and the argument distribution
specifies one of the
following distributions, the function qqPlot
behaves in the manner described below.
"lnorm"
Lognormal Distribution. The log-transformed quantiles are plotted against quantiles from a Normal (Gaussian) distribution.
"lnormAlt"
Lognormal Distribution (alternative parameterization). The untransformed quantiles are plotted against quantiles from a Lognormal distribution.
"lnorm3"
Three-Parameter Lognormal Distribution. The quantiles of
log(x-threshold)
are plotted against quantiles from a Normal (Gaussian) distribution.
The value of threshold
is either specified in the argument param.list
, or,
if estimate.params=TRUE
, then it is estimated.
"zmnorm"
Zero-Modified Normal Distribution. The quantiles of the
non-zero values (i.e., x[x!=0]
) are plotted against quantiles from a Normal
(Gaussian) distribution.
"zmlnorm"
Zero-Modified Lognormal Distribution. The quantiles of the
log-transformed positive values (i.e., log(x[x>0])
) are plotted against quantiles
from a Normal (Gaussian) distribution.
"zmlnormAlt"
Lognormal Distribution (alternative parameterization).
The quantiles of the untransformed positive values (i.e., x[x>0]
) are
plotted against quantiles from a Lognormal distribution.
Explanation of Q-Q Plots
A probability plot or quantile-quantile (Q-Q) plot
is a graphical display invented by Wilk and Gnanadesikan (1968) to compare a
data set to a particular probability distribution or to compare it to another
data set. The idea is that if two population distributions are exactly the same,
then they have the same quantiles (percentiles), so a plot of the quantiles for
the first distribution vs. the quantiles for the second distribution will fall
on the 0-1 line (i.e., the straight line with intercept 0 and slope 1).
If the two distributions have the same shape and spread but different locations,
then the plot of the quantiles will fall on the line
(parallel to the 0-1 line) where
denotes the difference in locations.
If the distributions have different locations and differ by a multiplicative
constant
, then the plot of the quantiles will fall on the line
(D'Agostino, 1986a, p. 25; Helsel and Hirsch, 1986, p. 42).
Various kinds of differences between distributions will yield various kinds of
deviations from a straight line.
Comparing Observations to a Hypothesized Distribution
Let denote the observations
in a random sample of size
from some unknown distribution with
cumulative distribution function
, and let
denote the ordered observations.
Depending on the particular formula used for the empirical cdf
(see
ecdfPlot
), the 'th order statistic is an
estimate of the
'th,
'th, etc., quantile.
For the moment, assume the
'th order statistic is an estimate of the
'th quantile, that is:
so
If we knew the form of the true cdf , then the plot of
vs.
would form approximately
a straight line based on Equation (2) above. A probability plot is a plot of
vs.
, where
denotes the
cdf associated with the hypothesized distribution. The probability plot
should fall roughly on the line
if
. If
and
merely differ by a shift in location and scale, that is, if
, then the plot should fall roughly on the
line
.
The quantity in Equation (1) above is called the
plotting position for the probability plot. This particular
formula for the plotting position is appealing because it can be shown that
for any continuous distribution
(Nelson, 1982, pp. 299-300; Stedinger et al., 1993). That is, the 'th
plotting position defined as in Equation (1) is the expected value of the true
cdf evaluated at the
'th order statistic. Many authors and practitioners,
however, prefer to use a plotting position that satisfies:
or one that satisfies
where denotes the median of the distribution of the
'th
order statistic, and
denotes the
'th order statistic in a
random sample of
uniform (0,1) random variates.
The plotting positions in Equation (4) are often approximated since the expected
value of the 'th order statistic is often difficult and time-consuming
to compute. Note that these plotting positions will differ for different
distributions.
The plotting positions in Equation (5) were recommended by Filliben (1975) because
they require computing or approximating only the medians of
uniform (0,1) order statistics, no matter what the form
of the assumed cdf . Also, the median may be preferred as a measure of
central tendency because the distributions of most order statistics are skewed.
Most plotting positions can be written as:
where (D'Agostino, 1986a, p.25; Stedinger et al., 1993).
The quantity
is sometimes called the “plotting position constant”, and
is determined by the argument
plot.pos.con
in the function qqPlot
.
The table below, adapted from Stedinger et al. (1993), displays commonly used
plotting positions based on equation (6) for several distributions.
Distribution | |||
Often Used | |||
Name | a | With | References |
Weibull | 0 | Weibull, | Weibull (1939), |
Uniform | Stedinger et al. (1993) | ||
Median | 0.3175 | Several | Filliben (1975), |
Vogel (1986) | |||
Blom | 0.375 | Normal | Blom (1958), |
and Others | Looney and Gulledge (1985) | ||
Cunnane | 0.4 | Several | Cunnane (1978), |
Chowdhury et al. (1991) | |||
Gringorten | 0.44 | Gumbel | Gringorton (1963), |
Vogel (1986) | |||
Hazen | 0.5 | Several | Hazen (1914), |
Chambers et al. (1983), | |||
Cleveland (1993) |
For moderate and large sample sizes, there is very little difference in
visual appearance of the Q-Q plot for different choices of plotting positions.
Comparing Two Data Sets
Let denote the observations
in a random sample of size
from some unknown distribution with
cumulative distribution function
, and let
denote the ordered observations. Similarly,
let
denote the observations
in a random sample of size
from some unknown distribution with
cumulative distribution function
, and let
denote the ordered observations.
Suppose we are interested in investigating whether the shape of the distribution
with cdf
is the same as the shape of the distribution with cdf
(e.g.,
and
may both be normal distributions but differ in mean
and standard deviation).
When , we can visually explore this question by plotting
vs.
, for
.
The values in
are spread out in a certain way depending
on the true distribution: they may be more or less symmetric about some value
(the population mean or median) or they may be skewed to the right or left;
they may be concentrated close to the mean or median (platykurtic) or there may
be several observations “far away” from the mean or median on either side
(leptokurtic). Similarly, the values in
are spread out in a
certain way. If the values in
and
are
spread out in the same way, then the plot of
vs.
will be approximately a straight line. If the cdf
is exactly the same
as the cdf
, then the plot of
vs.
will fall
roughly on the straight line
. If
and
differ by a
shift in location and scale, that is, if
, then
the plot will fall roughly on the line
.
When , a slight adjustment has to be made to produce the plot. Let
denote the plotting positions
corresponding to the
empirical quantiles for the
's and let
denote the plotting positions
corresponding the
empirical quantiles for the
's. Then we plot
vs.
for
where
That is, the values for the 's are determined by linear interpolation
based on the values of the plotting positions for
and
.
A similar adjustment is made when .
Note that the R function qqplot
uses a different method than
the one in Equation (7) above; it uses linear interpolation based on
1:n
and m
by calling the approx
function.
qqPlot
returns a list with components x
and y
, giving the
coordinates of the points that have been or would have been plotted. There are four cases to
consider:
1. The argument y
is not supplied and plot.type="Q-Q"
.
x |
the quantiles from the theoretical distribution. |
y |
the observed quantiles (order statistics) based on the data in the argument |
2. The argument y
is not supplied and plot.type="Tukey Mean-Difference Q-Q"
.
x |
the averages of the observed and theoretical quantiles. |
y |
the differences between the observed quantiles (order statistics) and the theoretical quantiles. |
3. The argument y
is supplied and plot.type="Q-Q"
.
x |
the observed quantiles based on the data in the argument |
y |
the observed quantiles based on the data in the argument |
4. The argument y
is supplied and plot.type="Tukey Mean-Difference Q-Q"
.
x |
the averages of the quantiles based on the argument |
y |
the differences between the quantiles based on the argument |
A quantile-quantile (Q-Q) plot, also called a probability plot, is a plot of the observed
order statistics from a random sample (the empirical quantiles) against their (estimated)
mean or median values based on an assumed distribution, or against the empirical quantiles
of another set of data (Wilk and Gnanadesikan, 1968). Q-Q plots are used to assess whether
data come from a particular distribution, or whether two datasets have the same parent
distribution. If the distributions have the same shape (but not necessarily the same
location or scale parameters), then the plot will fall roughly on a straight line. If the
distributions are exactly the same, then the plot will fall roughly on the straight line .
A Tukey mean-difference Q-Q plot, also called an m-d plot, is a modification of a
Q-Q plot. Rather than plotting observed quantiles vs. theoretical quantiles or observed
-quantiles vs. observed
-quantiles, a Tukey mean-difference Q-Q plot plots
the difference between the quantiles on the
-axis vs. the average of the quantiles on
the
-axis (Cleveland, 1993, pp.22-23). If the two sets of quantiles come from the same
parent distribution, then the points in this plot should fall roughly along the horizontal line
. If one set of quantiles come from the same distribution with a shift in median, then
the points in this plot should fall along a horizontal line above or below the line
.
A Tukey mean-difference Q-Q plot enhances our perception of how the points in the Q-Q plot deviate
from a straight line, because it is easier to judge deviations from a horizontal line than from a
line with a non-zero slope.
In a Q-Q plot, the extreme points have more variability than points toward the center. A U-shaped
Q-Q plot indicates that the underlying distribution for the observations on the -axis is
skewed to the right relative to the underlying distribution for the observations on the
-axis.
An upside-down-U-shaped Q-Q plot indicates the
-axis distribution is skewed left relative to
the
-axis distribution. An S-shaped Q-Q plot indicates the
-axis distribution has
shorter tails than the
-axis distribution. Conversely, a plot that is bent down on the
left and bent up on the right indicates that the
-axis distribution has longer tails than
the
-axis distribution.
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.
ppoints
, ecdfPlot
, Distribution.df
,
qqPlotGestalt
, qqPlotCensored
, qqnorm
.
# The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # Create an Q-Q plot for the reference area data first assuming a # normal distribution, then a lognormal distribution, then a # gamma distribution. # Assume a normal distribution #----------------------------- dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"])) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], plot.type = "Tukey", add.line = TRUE)) # The Q-Q plot based on assuming a normal distribution shows a U-shape, # indicating the Reference area TcCB data are skewed to the right # compared to a normal distribuiton. # Assume a lognormal distribution #-------------------------------- dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnorm", digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnorm", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) # Alternative parameterization dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnormAlt", estimate.params = TRUE, digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnormAlt", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) # The lognormal distribution appears to be an adequate fit. # Now look at a Q-Q plot assuming a gamma distribution. #---------------------------------------------------------- dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gamma", estimate.params = TRUE, digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gamma", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) # Alternative Parameterization dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gammaAlt", estimate.params = TRUE, digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gammaAlt", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) #------------------------------------------------------------------------------------- # Generate 20 observations from a gamma distribution with parameters # shape=2 and scale=2, then create a normal (Gaussian) Q-Q plot for these data. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(357) dat <- rgamma(20, shape=2, scale=2) dev.new() qqPlot(dat, add.line = TRUE) # Now assume a gamma distribution and estimate the parameters #------------------------------------------------------------ dev.new() qqPlot(dat, dist = "gamma", estimate.params = TRUE, add.line = TRUE) # Clean up #--------- rm(dat) graphics.off()
# The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # Create an Q-Q plot for the reference area data first assuming a # normal distribution, then a lognormal distribution, then a # gamma distribution. # Assume a normal distribution #----------------------------- dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"])) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], plot.type = "Tukey", add.line = TRUE)) # The Q-Q plot based on assuming a normal distribution shows a U-shape, # indicating the Reference area TcCB data are skewed to the right # compared to a normal distribuiton. # Assume a lognormal distribution #-------------------------------- dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnorm", digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnorm", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) # Alternative parameterization dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnormAlt", estimate.params = TRUE, digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnormAlt", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) # The lognormal distribution appears to be an adequate fit. # Now look at a Q-Q plot assuming a gamma distribution. #---------------------------------------------------------- dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gamma", estimate.params = TRUE, digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gamma", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) # Alternative Parameterization dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gammaAlt", estimate.params = TRUE, digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gammaAlt", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) #------------------------------------------------------------------------------------- # Generate 20 observations from a gamma distribution with parameters # shape=2 and scale=2, then create a normal (Gaussian) Q-Q plot for these data. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(357) dat <- rgamma(20, shape=2, scale=2) dev.new() qqPlot(dat, add.line = TRUE) # Now assume a gamma distribution and estimate the parameters #------------------------------------------------------------ dev.new() qqPlot(dat, dist = "gamma", estimate.params = TRUE, add.line = TRUE) # Clean up #--------- rm(dat) graphics.off()
Produces a quantile-quantile (Q-Q) plot, also called a probability plot, for Type I censored data.
qqPlotCensored(x, censored, censoring.side = "left", prob.method = "michael-schucany", plot.pos.con = NULL, distribution = "norm", param.list = list(mean = 0, sd = 1), estimate.params = plot.type == "Tukey Mean-Difference Q-Q", est.arg.list = NULL, plot.type = "Q-Q", plot.it = TRUE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = FALSE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, include.cen = FALSE, cen.pch = ifelse(censoring.side == "left", 6, 2), cen.cex = par("cex"), cen.col = 4, ..., main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
qqPlotCensored(x, censored, censoring.side = "left", prob.method = "michael-schucany", plot.pos.con = NULL, distribution = "norm", param.list = list(mean = 0, sd = 1), estimate.params = plot.type == "Tukey Mean-Difference Q-Q", est.arg.list = NULL, plot.type = "Q-Q", plot.it = TRUE, equal.axes = qq.line.type == "0-1" || estimate.params, add.line = FALSE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, include.cen = FALSE, cen.pch = ifelse(censoring.side == "left", 6, 2), cen.cex = par("cex"), cen.col = 4, ..., main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
x |
numeric vector of observations that is assumed to represent a sample from the hypothesized
distribution specifed by |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible values are
|
prob.method |
character string indicating what method to use to compute the plotting positions
(empirical probabilities). Possible values are: The default value is The |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position constant.
The default value is |
distribution |
a character string denoting the distribution abbreviation. The default value is
|
param.list |
a list with values for the parameters of the distribution. The default value is
|
estimate.params |
a logical scalar indicating whether to compute quantiles based on estimating the distribution
parameters ( You can set |
est.arg.list |
a list whose components are optional arguments associated with the function used to estimate
the parameters of the assumed distribution (see the section Estimating Distribution Parameters
in the help file EnvStats Functions for Censored Data).
For example, the function |
plot.type |
a character string denoting the kind of plot. Possible values are |
plot.it |
a logical scalar indicating whether to create a plot on the current graphics device.
The default value is |
equal.axes |
a logical scalar indicating whether to use the same range on the |
add.line |
a logical scalar indicating whether to add a line to the plot. If |
qq.line.type |
character string determining what kind of line to add to the Q-Q plot. Possible values are
|
duplicate.points.method |
a character string denoting how to plot points with duplicate |
points.col |
a numeric scalar or character string determining the color of the points in the plot.
The default value is |
line.col |
a numeric scalar or character string determining the color of the line in the plot.
The default value is |
line.lwd |
a numeric scalar determining the width of the line in the plot. The default value is
|
line.lty |
a numeric scalar determining the line type of the line in the plot. The default value is
|
digits |
a scalar indicating how many significant digits to print for the distribution parameters.
The default value is |
include.cen |
logical scalar indicating whether to include censored values in the plot. The default value is
|
cen.pch |
numeric scalar or character string indicating the plotting character to use to plot censored values.
The default value is |
cen.cex |
numeric scalar that determines the size of the plotting character used to plot censored values.
The default value is the current value of the cex graphics parameter. See the entry for |
cen.col |
numeric scalar or character string that determines the color of the plotting character used to
plot censored values. The default value is |
main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters (see |
The function qqPlotCensored
does exactly the same thing as qqPlot
(when the argument y
is not supplied to qqPlot
), except
qqPlotCensored
calls the function ppointsCensored
to compute the
plotting positions (estimated cumulative probabilities).
The vector x
is assumed to be a sample from the probability distribution specified
by the argument distribution
(and param.list
if estimate.params=FALSE
).
When plot.type="Q-Q"
, the quantiles of x
are plotted on the -axis against
the quantiles of the assumed distribution on the
-axis.
When plot.type="Tukey Mean-Difference Q-Q"
, the difference of the quantiles is plotted on
the -axis against the mean of the quantiles on the
-axis.
When prob.method="kaplan-meier"
and censoring.side="left"
and the assumed
distribution has a maximum support of infinity (Inf
; e.g., the normal or lognormal
distribution), the point invovling the largest
value of x
is not plotted because it corresponds to an estimated cumulative probability
of 1 which corresponds to an infinite plotting position.
When prob.method="modified kaplan-meier"
and censoring.side="left"
, the
estimated cumulative probability associated with the maximum value is modified from 1
to be where
denotes the sample size (i.e., the Blom
plotting position) so that the point associated with the maximum value can be displayed.
qqPlotCensored
returns a list with the following components:
x |
numeric vector of |
y |
numeric vector of |
Order.Statistics |
numeric vector of the “ordered” observations.
When |
Cumulative.Probabilities |
numeric vector of the plotting positions associated with the order statistics. |
Censored |
logical vector indicating which of the ordered observations are censored. |
Censoring.Side |
character string indicating whether the data are left- or right-censored.
This is same value as the argument |
Prob.Method |
character string indicating what method was used to compute the plotting positions.
This is the same value as the argument |
Optional Component (only present when prob.method="michael-schucany"
or prob.method="hirsch-stedinger"
):
Plot.Pos.Con |
numeric scalar containing the value of the plotting position constant that was used.
This is the same as the argument |
A quantile-quantile (Q-Q) plot, also called a probability plot, is a plot of the observed
order statistics from a random sample (the empirical quantiles) against their (estimated)
mean or median values based on an assumed distribution, or against the empirical quantiles
of another set of data (Wilk and Gnanadesikan, 1968). Q-Q plots are used to assess whether
data come from a particular distribution, or whether two datasets have the same parent
distribution. If the distributions have the same shape (but not necessarily the same
location or scale parameters), then the plot will fall roughly on a straight line. If the
distributions are exactly the same, then the plot will fall roughly on the straight line .
A Tukey mean-difference Q-Q plot, also called an m-d plot, is a modification of a
Q-Q plot. Rather than plotting observed quantiles vs. theoretical quantiles or observed
-quantiles vs. observed
-quantiles, a Tukey mean-difference Q-Q plot plots
the difference between the quantiles on the
-axis vs. the average of the quantiles on
the
-axis (Cleveland, 1993, pp.22-23). If the two sets of quantiles come from the same
parent distribution, then the points in this plot should fall roughly along the horizontal line
. If one set of quantiles come from the same distribution with a shift in median, then
the points in this plot should fall along a horizontal line above or below the line
.
A Tukey mean-difference Q-Q plot enhances our perception of how the points in the Q-Q plot deviate
from a straight line, because it is easier to judge deviations from a horizontal line than from a
line with a non-zero slope.
In a Q-Q plot, the extreme points have more variability than points toward the center. A U-shaped
Q-Q plot indicates that the underlying distribution for the observations on the -axis is
skewed to the right relative to the underlying distribution for the observations on the
-axis.
An upside-down-U-shaped Q-Q plot indicates the
-axis distribution is skewed left relative to
the
-axis distribution. An S-shaped Q-Q plot indicates the
-axis distribution has
shorter tails than the
-axis distribution. Conversely, a plot that is bent down on the
left and bent up on the right indicates that the
-axis distribution has longer tails than
the
-axis distribution.
Censored observations complicate the procedures used to graphically explore data. Techniques from
survival analysis and life testing have been developed to generalize the procedures for
constructing plotting positions, empirical cdf plots, and Q-Q plots to data sets with censored
observations (see ppointsCensored
).
Steven P. Millard ([email protected])
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.
Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997-2004.
Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715-727.
Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.
Lee, E.T., and J. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley and Sons, New York.
Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461-496.
Nelson, W. (1972). Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14, 945-966.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
ppointsCensored
, EnvStats Functions for Censored Data,
qqPlot
, ecdfPlotCensored
, qqPlotGestalt
.
# Generate 20 observations from a normal distribution with mean=20 and sd=5, # censor all observations less than 18, then generate a Q-Q plot assuming # a normal distribution for the complete data set and the censored data set. # Note that the Q-Q plot for the censored data set starts at the first ordered # uncensored observation, and that for values of x > 18 the two Q-Q plots are # exactly the same. This is because there is only one censoring level and # no uncensored observations fall below the censored observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(333) x <- rnorm(20, mean=20, sd=5) censored <- x < 18 sum(censored) #[1] 7 new.x <- x new.x[censored] <- 18 dev.new() qqPlot(x, ylim = range(pretty(x)), main = "Q-Q Plot for\nComplete Data Set") dev.new() qqPlotCensored(new.x, censored, ylim = range(pretty(x)), main="Q-Q Plot for\nCensored Data Set") # Clean up #--------- rm(x, censored, new.x) #------------------------------------------------------------------------------------ # Example 15-1 of USEPA (2009, page 15-10) gives an example of # computing plotting positions based on censored manganese # concentrations (ppb) in groundwater collected at 5 monitoring # wells. The data for this example are stored in # EPA.09.Ex.15.1.manganese.df. Here we will create a Q-Q # plot based on the Kaplan-Meier method. First we'll assume # a normal distribution, then a lognormal distribution, then a # gamma distribution. EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #4 4 Well.1 21.6 21.6 FALSE #5 5 Well.1 <2 2.0 TRUE #... #21 1 Well.5 17.9 17.9 FALSE #22 2 Well.5 22.7 22.7 FALSE #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE # Assume normal distribution #--------------------------- dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Normal Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", sep = "\n"))) # Include max value in the plot #------------------------------ dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, prob.method = "modified kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Normal Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", "(Max Included)", sep = "\n"))) # Assume lognormal distribution #------------------------------ dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, dist = "lnorm", prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Lognormal Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", sep = "\n"))) # Include max value in the plot #------------------------------ dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, dist = "lnorm", prob.method = "modified kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Lognormal Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", "(Max Included)", sep = "\n"))) # The lognormal distribution appears to be a better fit. # Now create a Q-Q plot assuming a gamma distribution. Here we'll # need to set estimate.params=TRUE. dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, dist = "gamma", estimate.params = TRUE, prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Gamma Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", sep = "\n"))) # Include max value in the plot #------------------------------ dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, dist = "gamma", estimate.params = TRUE, prob.method = "modified kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Gamma Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", "(Max Included)", sep = "\n"))) #========== # Clean up #--------- graphics.off()
# Generate 20 observations from a normal distribution with mean=20 and sd=5, # censor all observations less than 18, then generate a Q-Q plot assuming # a normal distribution for the complete data set and the censored data set. # Note that the Q-Q plot for the censored data set starts at the first ordered # uncensored observation, and that for values of x > 18 the two Q-Q plots are # exactly the same. This is because there is only one censoring level and # no uncensored observations fall below the censored observations. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(333) x <- rnorm(20, mean=20, sd=5) censored <- x < 18 sum(censored) #[1] 7 new.x <- x new.x[censored] <- 18 dev.new() qqPlot(x, ylim = range(pretty(x)), main = "Q-Q Plot for\nComplete Data Set") dev.new() qqPlotCensored(new.x, censored, ylim = range(pretty(x)), main="Q-Q Plot for\nCensored Data Set") # Clean up #--------- rm(x, censored, new.x) #------------------------------------------------------------------------------------ # Example 15-1 of USEPA (2009, page 15-10) gives an example of # computing plotting positions based on censored manganese # concentrations (ppb) in groundwater collected at 5 monitoring # wells. The data for this example are stored in # EPA.09.Ex.15.1.manganese.df. Here we will create a Q-Q # plot based on the Kaplan-Meier method. First we'll assume # a normal distribution, then a lognormal distribution, then a # gamma distribution. EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored #1 1 Well.1 <5 5.0 TRUE #2 2 Well.1 12.1 12.1 FALSE #3 3 Well.1 16.9 16.9 FALSE #4 4 Well.1 21.6 21.6 FALSE #5 5 Well.1 <2 2.0 TRUE #... #21 1 Well.5 17.9 17.9 FALSE #22 2 Well.5 22.7 22.7 FALSE #23 3 Well.5 3.3 3.3 FALSE #24 4 Well.5 8.4 8.4 FALSE #25 5 Well.5 <2 2.0 TRUE # Assume normal distribution #--------------------------- dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Normal Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", sep = "\n"))) # Include max value in the plot #------------------------------ dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, prob.method = "modified kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Normal Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", "(Max Included)", sep = "\n"))) # Assume lognormal distribution #------------------------------ dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, dist = "lnorm", prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Lognormal Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", sep = "\n"))) # Include max value in the plot #------------------------------ dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, dist = "lnorm", prob.method = "modified kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Lognormal Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", "(Max Included)", sep = "\n"))) # The lognormal distribution appears to be a better fit. # Now create a Q-Q plot assuming a gamma distribution. Here we'll # need to set estimate.params=TRUE. dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, dist = "gamma", estimate.params = TRUE, prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Gamma Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", sep = "\n"))) # Include max value in the plot #------------------------------ dev.new() with(EPA.09.Ex.15.1.manganese.df, qqPlotCensored(Manganese.ppb, Censored, dist = "gamma", estimate.params = TRUE, prob.method = "modified kaplan-meier", points.col = "blue", add.line = TRUE, main = paste("Gamma Q-Q Plot of Manganese Data", "Based on Kaplan-Meier Plotting Positions", "(Max Included)", sep = "\n"))) #========== # Clean up #--------- graphics.off()
Produce a series of quantile-quantile (Q-Q) plots (also called probability plots) or Tukey mean-difference Q-Q plots for a user-specified distribution.
qqPlotGestalt(distribution = "norm", param.list = list(mean = 0, sd = 1), estimate.params = FALSE, est.arg.list = NULL, sample.size = 10, num.pages = 2, num.plots.per.page = 4, nrow = ceiling(num.plots.per.page/2), plot.type = "Q-Q", plot.pos.con = switch(dist.abb, norm = , lnorm = , lnormAlt = , lnorm3 = 0.375, evd = 0.44, 0.4), equal.axes = (qq.line.type == "0-1" || estimate.params), margin.title = NULL, add.line = FALSE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, same.window = TRUE, ask = same.window & num.pages > 1, mfrow = c(nrow, num.plots.per.page/nrow), mar = c(4, 4, 1, 1) + 0.1, oma = c(0, 0, 7, 0), mgp = c(2, 0.5, 0), ..., main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
qqPlotGestalt(distribution = "norm", param.list = list(mean = 0, sd = 1), estimate.params = FALSE, est.arg.list = NULL, sample.size = 10, num.pages = 2, num.plots.per.page = 4, nrow = ceiling(num.plots.per.page/2), plot.type = "Q-Q", plot.pos.con = switch(dist.abb, norm = , lnorm = , lnormAlt = , lnorm3 = 0.375, evd = 0.44, 0.4), equal.axes = (qq.line.type == "0-1" || estimate.params), margin.title = NULL, add.line = FALSE, qq.line.type = "least squares", duplicate.points.method = "standard", points.col = 1, line.col = 1, line.lwd = par("cex"), line.lty = 1, digits = .Options$digits, same.window = TRUE, ask = same.window & num.pages > 1, mfrow = c(nrow, num.plots.per.page/nrow), mar = c(4, 4, 1, 1) + 0.1, oma = c(0, 0, 7, 0), mgp = c(2, 0.5, 0), ..., main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
distribution |
a character string denoting the distribution abbreviation. The default value is
|
param.list |
a list with values for the parameters of the distribution. The default value is
|
estimate.params |
a logical scalar indicating whether to compute quantiles based on estimating the
distribution parameters ( |
est.arg.list |
a list whose components are optional arguments associated with the function used
to estimate the parameters of the assumed distribution (see the help file
Estimating Distribution Parameters).
For example, all functions used to estimate distribution parameters have an optional argument
called |
sample.size |
numeric scalar indicating the number of observations to generate for each Q-Q plot.
The default value is |
num.pages |
numeric scalar indicating the number of pages of plots to generate.
The default value is |
num.plots.per.page |
numeric scalar indicating the number of plots per page.
The default value is |
nrow |
numeric scalar indicating the number of rows of plots on each page.
The default value is the smallest integer greater than or equal to
|
plot.type |
a character string denoting the kind of plot. Possible values are |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position constant.
The default value of |
equal.axes |
logical scalar indicating whether to use the same range on the |
margin.title |
character string indicating the title printed in the top margin on each page of plots. The default value indicates the kind of Q-Q plot, the probability distribution, the sample size, and the estimation method used (if any). |
add.line |
logical scalar indicating whether to add a line to the plot.
If |
qq.line.type |
character string determining what kind of line to add to the Q-Q plot.
Possible values are |
duplicate.points.method |
character string denoting how to plot points with duplicate |
points.col |
numeric scalar or character string determining the color of the points in the plot.
The default value is |
line.col |
numeric scalar or character string determining the color of the line in the plot.
The default value is |
line.lwd |
numeric scalar determining the width of the line in the plot. The default value is
|
line.lty |
a numeric scalar determining the line type of the line in the plot. The default value is
|
digits |
a scalar indicating how many significant digits to print for the distribution
parameters. The default value is |
same.window |
logical scalar indicating whether to produce all plots in the same graphics
window ( |
ask |
logical scalar supplied to the function |
mfrow , mar , oma , mgp , main , xlab , ylab , xlim , ylim , ...
|
additional graphical parameters (see |
The function qqPlotGestalt
allows the user to display several Q-Q plots or
Tukey mean-difference Q-Q plots for a specified probability distribution.
The distribution is specified with the arguments distribution
and
param.list
. By default, normal (Gaussian)
Q-Q plots are produced.
If estimate.params=FALSE
(the default), the theoretical quantiles on the
-axis are computed using the known distribution parameters specified in
param.list
. If estimate.params=TRUE
, the distribution parameters
are estimated based on the sample, and these estimated parameters are then used
to compute the theoretical quantiles. For distributions that can be specified
by a location and scale parameter (e.g., Normal, Logistic, extreme value, etc.),
the value of estimate.params
will not affect the general shape of the
plot, only the values recorded on the -axis. For distributions that cannot
be specified by a location and scale parameter (e.g., exponential, gamma, etc.), it
is recommended that
estimate.params
be set to TRUE
since in pracitice
the values of the distribution parameters are not known but must be estimated from
the sample.
The purpose of qqPlotGestalt
is to allow the user to build-up a visual
memory of “typical” Q-Q plots. A Q-Q plot is a graphical tool that allows
you to assess how well a particular set of observations fit a particular
probability distribution. The value of this tool depends on the user having an
internal reference set of Q-Q plots with which to compare the current Q-Q plot.
See the help file for qqPlot
for more information.
The NULL
value is returned.
Steven P. Millard ([email protected])
See the REFERENCES section for qqPlot
.
# Look at eight typical normal (Gaussian) Q-Q plots for random samples # of size 10 from a N(0,1) distribution # Are you surprised by the variability in the plots? # # (Note: you must use set.seed if you want to reproduce the exact # same plots more than once.) set.seed(298) qqPlotGestalt(same.window = FALSE) # Add lines to these same Q-Q plots #---------------------------------- set.seed(298) qqPlotGestalt(same.window = FALSE, add.line = TRUE) # Add lines to different Q-Q plots #--------------------------------- qqPlotGestalt(same.window = FALSE, add.line = TRUE) ## Not run: # Look at 4 sets of plots all in the same graphics window #-------------------------------------------------------- qqPlotGestalt(add.line = TRUE, num.pages = 4) ## End(Not run) #========== # Look at Q-Q plots for a gamma distribution #------------------------------------------- qqPlotGestalt(dist = "gammaAlt", param.list = list(mean = 10, cv = 1), estimate.params = TRUE, num.pages = 3, same.window = FALSE, add.line = TRUE) # Look at Tukey Mean Difference Q-Q plots # for a gamma distribution #---------------------------------------- qqPlotGestalt(dist = "gammaAlt", param.list = list(mean = 10, cv = 1), estimate.params = TRUE, num.pages = 3, plot.type = "Tukey", same.window = FALSE, add.line = TRUE) #========== # Clean up #--------- graphics.off()
# Look at eight typical normal (Gaussian) Q-Q plots for random samples # of size 10 from a N(0,1) distribution # Are you surprised by the variability in the plots? # # (Note: you must use set.seed if you want to reproduce the exact # same plots more than once.) set.seed(298) qqPlotGestalt(same.window = FALSE) # Add lines to these same Q-Q plots #---------------------------------- set.seed(298) qqPlotGestalt(same.window = FALSE, add.line = TRUE) # Add lines to different Q-Q plots #--------------------------------- qqPlotGestalt(same.window = FALSE, add.line = TRUE) ## Not run: # Look at 4 sets of plots all in the same graphics window #-------------------------------------------------------- qqPlotGestalt(add.line = TRUE, num.pages = 4) ## End(Not run) #========== # Look at Q-Q plots for a gamma distribution #------------------------------------------- qqPlotGestalt(dist = "gammaAlt", param.list = list(mean = 10, cv = 1), estimate.params = TRUE, num.pages = 3, same.window = FALSE, add.line = TRUE) # Look at Tukey Mean Difference Q-Q plots # for a gamma distribution #---------------------------------------- qqPlotGestalt(dist = "gammaAlt", param.list = list(mean = 10, cv = 1), estimate.params = TRUE, num.pages = 3, plot.type = "Tukey", same.window = FALSE, add.line = TRUE) #========== # Clean up #--------- graphics.off()
Two-sample rank test to detect a positive shift in a proportion of one population (here called the “treated” population) compared to another (here called the “reference” population). This test is usually called the quantile test (Johnson et al., 1987).
quantileTest(x, y, alternative = "greater", target.quantile = 0.5, target.r = NULL, exact.p = TRUE)
quantileTest(x, y, alternative = "greater", target.quantile = 0.5, target.r = NULL, exact.p = TRUE)
x |
numeric vector of observations from the “treatment” group.
Missing ( |
y |
numeric vector of observations from the “reference” group.
Missing ( |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
target.quantile |
numeric scalar between 0 and 1 indicating the desired quantile to use as the
lower cut off point for the test. Because of the discrete nature of empirical
quantiles, the upper bound for the possible empirical quantiles will often differ
from the value of |
target.r |
integer indicating the rank of the observation to use as the lower cut off point
for the test. The value of |
exact.p |
logical scalar indicating whether to compute the p-value based on the exact
distribution of the test statistic ( |
Let denote a random variable representing measurements from a
“treatment” group with cumulative distribution function (cdf)
and let denote
observations from this
treatment group. Let
denote a random variable from a “reference”
group with cdf
and let denote
observations from this
reference group. Consider the null hypothesis:
versus the alternative hypothesis
where denotes some random variable with cdf
and ,
for all values of
,
and
for at least one value of
.
In English, the alternative hypothesis (4) says that a portion of the
distribution for the treatment group (the distribution of
) is shifted to the
right of the distribution for the reference group (the distribution of
).
The alternative hypothesis (4) with
is the alternative hypothesis
associated with testing a location shift, for which the the
Wilcoxon rank sum test can be used.
Johnson et al. (1987) investigated locally most powerful rank tests for the test of
the null hypothesis (3) against the alternative hypothesis (4). They considered the
case when and
were normal random variables and the case when the
densities of
and
assumed only two positive values. For the latter
case, the locally most powerful rank test reduces to the following procedure, which
Johnson et al. (1987) call the quantile test.
Combine the observations from the reference group and the
observations from the treatment group and rank them from smallest to largest.
Tied observations receive the average rank of all observations tied at that value.
Choose a quantile and determine the smallest rank
such that
Note that because of the discrete nature of ranks, any quantile such
that
will yield the same value for as the quantile
does.
Alternatively, choose a value of
. The bounds on an associated quantile
are then given in Equation (7). Note: the component called
parameters
in
the list returned by quantileTest
contains an element named
quantile.ub
. The value of this element is the left-hand side of Equation (7).
Set equal to the number of observations from the treatment group
(the number of
observations) with ranks bigger than or equal to
.
Under the null hypothesis (3), the probability that at least out of
the
largest observations come from the treatment group is given by:
This probability may be approximated by:
where
and denotes the cumulative distribution function of the standard
normal distribution (USEPA, 1994, pp.7.16-7.17).
(See
quantileTestPValue
.)
Reject the null hypothesis (3) in favor of the alternative hypothesis (4) at
significance level if
.
Johnson et al. (1987) note that their quantile test is asymptotically equivalent
to one proposed by Carrano and Moore (1982) in the context of a two-sided test.
Also, when , the quantile test reduces to Mood's median test for two
groups (see Zar, 2010, p.172; Conover, 1980, pp.171-178).
The optimal choice of or
in Step 2 above (i.e., the choice that
yields the largest power) depends on the true underlying distributions of
and
and the mixing proportion
.
Johnson et al. (1987) performed a simulation study and showed that the quantile
test performs better than the Wilcoxon rank sum test and the normal scores test
under the alternative of a mixed normal distribution with a shift of at least
2 standard deviations in the
distribution. USEPA (1994, pp.7.17-7.21)
shows that when the mixing proportion
is small and the shift is
large, the quantile test is more powerful than the Wilcoxon rank sum test, and
when
is large and the shift is small the Wilcoxon rank sum test
is more powerful than the quantile test.
A list of class "htest"
containing the results of the hypothesis test.
See the help file for htest.object
for details.
The EPA guidance document Statistical Methods for Evaluating the Attainment of Cleanup Standards, Volume 3: Reference-Based Standards for Soils and Solid Media (USEPA, 1994, pp.4.7-4.9) recommends three different statistical tests for determining whether a remediated Superfund site has attained compliance: the Wilcoxon rank sum test, the quantile test, and the “hot measurement” comparison test. The Wilcoxon rank sum test and quantile test are nonparametric tests that compare chemical concentrations in the cleanup area with those in the reference area. The hot-measurement comparison test compares concentrations in the cleanup area with a pre-specified upper limit value Hm (the value of Hm must be negotiated between the EPA and the Superfund-site owner or operator). The Wilcoxon rank sum test is appropriate for detecting uniform failure of remedial action throughout the cleanup area. The quantile test is appropriate for detecting failure in only a few areas within the cleanup area. The hot-measurement comparison test is appropriate for detecting hot spots that need to be remediated regardless of the outcomes of the other two tests.
USEPA (1994, pp.4.7-4.9) recommends applying all three tests to all cleanup units within a cleanup area. This leads to the usual multiple comparisons problem: the probability of at least one of the tests indicating non-compliance, when in fact the cleanup area is in compliance, is greater than the pre-set Type I error level for any of the individual tests. USEPA (1994, p.3.3) recommends against using multiple comparison procedures to control the overall Type I error and suggests instead a re-sampling scheme where additional samples are taken in cases where non-compliance is indicated.
Steven P. Millard ([email protected])
Carrano, A., and D. Moore. (1982). The Rationale and Methodology for Quantifying Sister Chromatid Exchange in Humans. In Heddle, J.A., ed., Mutagenicity: New Horizons in Genetic Toxocology. Academic Press, New York, pp.268-304.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York, Chapter 4.
Johnson, R.A., S. Verrill, and D.H. Moore. (1987). Two-Sample Rank Tests for Detecting Changes That Occur in a Small Proportion of the Treated Population. Biometrics 43, 641-655.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.435-439.
USEPA. (1994). Statistical Methods for Evaluating the Attainment of Cleanup Standards, Volume 3: Reference-Based Standards for Soils and Solid Media. EPA/230-R-94-004. Office of Policy, Planning, and Evaluation, U.S. Environmental Protection Agency, Washington, D.C.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
quantileTestPValue
, wilcox.test
,
htest.object
, Hypothesis Tests.
# Following Example 7.5 on pages 7.23-7.24 of USEPA (1994b), perform the # quantile test for the TcCB data (the data are stored in EPA.94b.tccb.df). # There are n=47 observations from the reference area and m=77 observations # from the cleanup unit. The target rank is set to 9, resulting in a value # of quantile.ub=0.928. Note that the p-value is 0.0114, not 0.0117. EPA.94b.tccb.df # TcCB.orig TcCB Censored Area #1 0.22 0.22 FALSE Reference #2 0.23 0.23 FALSE Reference #... #46 1.20 1.20 FALSE Reference #47 1.33 1.33 FALSE Reference #48 <0.09 0.09 TRUE Cleanup #49 0.09 0.09 FALSE Cleanup #... #123 51.97 51.97 FALSE Cleanup #124 168.64 168.64 FALSE Cleanup # Determine the values to use for r and k for # a desired significance level of 0.01 #-------------------------------------------- p.vals <- quantileTestPValue(m = 77, n = 47, r = c(rep(8, 3), rep(9, 3), rep(10, 3)), k = c(6, 7, 8, 7, 8, 9, 8, 9, 10)) round(p.vals, 3) #[1] 0.355 0.122 0.019 0.264 0.081 0.011 0.193 0.053 0.007 # Choose r=9, k=9 to get a significance level of 0.011 #----------------------------------------------------- with(EPA.94b.tccb.df, quantileTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], target.r = 9)) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: e = 0 # #Alternative Hypothesis: Tail of Fx Shifted to Right of # Tail of Fy. # 0 < e <= 1, where # Fx(t) = (1-e)*Fy(t) + e*Fz(t), # Fz(t) <= Fy(t) for all t, # and Fy != Fz # #Test Name: Quantile Test # #Data: x = TcCB[Area == "Cleanup"] # y = TcCB[Area == "Reference"] # #Sample Sizes: nx = 77 # ny = 47 # #Test Statistics: k (# x obs of r largest) = 9 # r = 9 # #Test Statistic Parameters: m = 77.000 # n = 47.000 # quantile.ub = 0.928 # #P-value: 0.01136926 #========== # Clean up #--------- rm(p.vals)
# Following Example 7.5 on pages 7.23-7.24 of USEPA (1994b), perform the # quantile test for the TcCB data (the data are stored in EPA.94b.tccb.df). # There are n=47 observations from the reference area and m=77 observations # from the cleanup unit. The target rank is set to 9, resulting in a value # of quantile.ub=0.928. Note that the p-value is 0.0114, not 0.0117. EPA.94b.tccb.df # TcCB.orig TcCB Censored Area #1 0.22 0.22 FALSE Reference #2 0.23 0.23 FALSE Reference #... #46 1.20 1.20 FALSE Reference #47 1.33 1.33 FALSE Reference #48 <0.09 0.09 TRUE Cleanup #49 0.09 0.09 FALSE Cleanup #... #123 51.97 51.97 FALSE Cleanup #124 168.64 168.64 FALSE Cleanup # Determine the values to use for r and k for # a desired significance level of 0.01 #-------------------------------------------- p.vals <- quantileTestPValue(m = 77, n = 47, r = c(rep(8, 3), rep(9, 3), rep(10, 3)), k = c(6, 7, 8, 7, 8, 9, 8, 9, 10)) round(p.vals, 3) #[1] 0.355 0.122 0.019 0.264 0.081 0.011 0.193 0.053 0.007 # Choose r=9, k=9 to get a significance level of 0.011 #----------------------------------------------------- with(EPA.94b.tccb.df, quantileTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], target.r = 9)) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: e = 0 # #Alternative Hypothesis: Tail of Fx Shifted to Right of # Tail of Fy. # 0 < e <= 1, where # Fx(t) = (1-e)*Fy(t) + e*Fz(t), # Fz(t) <= Fy(t) for all t, # and Fy != Fz # #Test Name: Quantile Test # #Data: x = TcCB[Area == "Cleanup"] # y = TcCB[Area == "Reference"] # #Sample Sizes: nx = 77 # ny = 47 # #Test Statistics: k (# x obs of r largest) = 9 # r = 9 # #Test Statistic Parameters: m = 77.000 # n = 47.000 # quantile.ub = 0.928 # #P-value: 0.01136926 #========== # Clean up #--------- rm(p.vals)
Compute the p-value associated with a specified combination of
,
,
, and
for the
quantile test (useful for determining
and
for a given significance level
).
quantileTestPValue(m, n, r, k, exact.p = TRUE)
quantileTestPValue(m, n, r, k, exact.p = TRUE)
m |
numeric vector of integers indicating the number of observations from the
“treatment” group.
Missing ( |
n |
numeric vector of integers indicating the number of observations from the
“reference” group.
Missing ( |
r |
numeric vector of integers indicating the ranks of the observations to use as the
lower cut off for the quantile test. All values of |
k |
numeric vector of integers indicating the number of observations from the
“treatment” group contained in the |
exact.p |
logical scalar indicating whether to compute the p-value based on the exact
distribution of the test statistic ( |
If the arguments m
, n
, r
, and k
are not all the same
length, they are replicated to be the same length as the length of the longest
argument.
For details on how the p-value is computed, see the help file for
quantileTest
.
The function quantileTestPValue
is useful for determining what values to
use for r
and k
, given the values of m
, n
, and a
specified significance level . The function
quantileTestPValue
can be used to reproduce Tables A.6-A.9 in
USEPA (1994, pp.A.22-A.25).
numeric vector of p-values.
See the help file for quantileTest
.
Steven P. Millard ([email protected])
See the help file for quantileTest
.
quantileTest
, wilcox.test
,
htest.object
, Hypothesis Tests.
# Reproduce the first column of Table A.9 in USEPA (1994, p.A.25): #----------------------------------------------------------------- p.vals <- quantileTestPValue(m = 5, n = seq(15, 45, by = 5), r = c(9, 3, 4, 4, 5, 5, 6), k = c(4, 2, 2, 2, 2, 2, 2)) round(p.vals, 3) #[1] 0.098 0.091 0.119 0.089 0.109 0.087 0.103 #========== # Clean up #--------- rm(p.vals)
# Reproduce the first column of Table A.9 in USEPA (1994, p.A.25): #----------------------------------------------------------------- p.vals <- quantileTestPValue(m = 5, n = seq(15, 45, by = 5), r = c(9, 3, 4, 4, 5, 5, 6), k = c(4, 2, 2, 2, 2, 2, 2)) round(p.vals, 3) #[1] 0.098 0.091 0.119 0.089 0.109 0.087 0.103 #========== # Clean up #--------- rm(p.vals)
Carbon monoxide (CO) emissions (ppm) from an oil refinery near San Francisco. The refinery submitted 31 daily measurements from its stack for the period April 16, 1993 through May 16, 1993 to the Bay Area Air Quality Management District (BAAQMD). The BAAQMD made nine of its own indepent measurements for the period September 11, 1990 through March 30, 1993.
data(Refinery.CO.df)
data(Refinery.CO.df)
A data frame with 40 observations on the following 3 variables.
CO.ppm
a numeric vector of CO emissions (ppm)
Source
a factor indicating the source of the measurment (BAAQMD
or refinery
Date
a Date object indicating the date the measurement was taken
Data and Story Library, http://lib.stat.cmu.edu/DASL/Datafiles/Refinery.html.
Zou, G.Y., C.Y. Huo, and J. Taleban. (2009). Simple Confidence Intervals for Lognormal Means and their Differences with Environmental Applications. Environmetrics, 20, 172–180.
Perform Rosner's generalized extreme Studentized deviate test for up to
potential outliers in a dataset, assuming the data without any outliers come
from a normal (Gaussian) distribution.
rosnerTest(x, k = 3, alpha = 0.05, warn = TRUE)
rosnerTest(x, k = 3, alpha = 0.05, warn = TRUE)
x |
numeric vector of observations.
Missing ( |
k |
positive integer indicating the number of suspected outliers. The argument |
alpha |
numeric scalar between 0 and 1 indicating the Type I error associated with the
test of hypothesis. The default value is |
warn |
logical scalar indicating whether to issue a warning ( |
Let denote the
observations. We assume that
of these observations come from the same normal (Gaussian) distribution, and
that the
most “extreme” observations may or may not represent observations
from a different distribution. Let
denote
the
observations left after omiting the
most extreme observations, where
. Let
and
denote the
mean and standard deviation, respectively, of the
observations in the data
that remain after removing the
most extreme observations.
Thus,
and
denote the
mean and standard deviation for the full sample, and in general
For a specified value of , the most extreme observation
is the one
that is the greatest distance from the mean for that data set, i.e.,
Thus, an extreme observation may be the smallest or the largest one in that data set.
Rosner's test is based on the statistics
, which represent the extreme Studentized deviates computed
from successively reduced samples of size
:
Critical values for are denoted
and are computed as:
where denotes the
'th quantile of
Student's t-distribution with
degrees of freedom, and in this case
where denotes the Type I error level.
The algorithm for determining the number of outliers is as follows:
Compare with
. If
then
conclude the
most extreme values are outliers.
If then compare
with
.
If
then conclude the
most extreme values
are outliers.
Continue in this fashion until a certain number of outliers have been identified or Rosner's test finds no outliers at all.
Based on a study using N=1,000 simulations, Rosner's (1983) Table 1 shows the estimated
true Type I error of declaring at least one outlier when none exists for various
sample sizes ranging from 10 to 100, and the declared maximum number of outliers
ranging from 1 to 10. Based on that table, Roser (1983) declared that for an
assumed Type I error level of 0.05, as long as
, the estimated
levels are quite close to 0.05, and that similar results were obtained
assuming a Type I error level of 0.01. However, the table below is an expanded version
of Rosner's (1983) Table 1 and shows results based on N=10,000 simulations.
You can see that for an assumed Type I error of 0.05, the test maintains the Type I error
fairly well for sample sizes as small as
as long as
, and for
, as long as
.
Also, for an assumed Type I error of 0.01, the test maintains the Type I error fairly
well for sample sizes as small as
as long as
.
Based on these results, when warn=TRUE
, a warning is issued for the following cases
indicating that the assumed Type I error may not be correct:
alpha
is greater than 0.01
, the sample size is less than 15, and
k
is greater than 1
.
alpha
is greater than 0.01
,
the sample size is at least 15 and less than 25, and
k
is greater than 2
.
alpha
is less than or equal to 0.01
, the sample size is less than 15, and
k
is greater than 1
.
k
is greater than 10
, or greater than the floor of half of the sample size
(i.e., greater than the greatest integer less than or equal to half of the sample size).
A warning is given for this case because simulations have not been done for this case.
Table 1a. Observed Type I Error Levels based on 10,000 Simulations, 3 to 5.
Assumed | |
Assumed | |
||||
|
|
|
95% LCL | 95% UCL | |
95% LCL | 95% UCL |
3 | 1 | 0.047 | 0.043 | 0.051 | 0.009 | 0.007 | 0.01 |
4 | 1 | 0.049 | 0.045 | 0.053 | 0.010 | 0.008 | 0.012 |
2 | 0.107 | 0.101 | 0.113 | 0.021 | 0.018 | 0.024 | |
5 | 1 | 0.048 | 0.044 | 0.053 | 0.008 | 0.006 | 0.009 |
2 | 0.095 | 0.090 | 0.101 | 0.020 | 0.018 | 0.023 |
Table 1b. Observed Type I Error Levels based on 10,000 Simulations, 6 to 10.
Assumed | |
Assumed | |
||||
|
|
|
95% LCL | 95% UCL | |
95% LCL | 95% UCL |
6 | 1 | 0.048 | 0.044 | 0.053 | 0.010 | 0.009 | 0.012 |
2 | 0.085 | 0.080 | 0.091 | 0.017 | 0.015 | 0.020 | |
3 | 0.141 | 0.134 | 0.148 | 0.028 | 0.025 | 0.031 | |
7 | 1 | 0.048 | 0.044 | 0.053 | 0.013 | 0.011 | 0.015 |
2 | 0.080 | 0.075 | 0.086 | 0.017 | 0.015 | 0.020 | |
3 | 0.112 | 0.106 | 0.118 | 0.022 | 0.019 | 0.025 | |
8 | 1 | 0.048 | 0.044 | 0.053 | 0.011 | 0.009 | 0.013 |
2 | 0.080 | 0.074 | 0.085 | 0.017 | 0.014 | 0.019 | |
3 | 0.102 | 0.096 | 0.108 | 0.020 | 0.017 | 0.023 | |
4 | 0.143 | 0.136 | 0.150 | 0.028 | 0.025 | 0.031 | |
9 | 1 | 0.052 | 0.048 | 0.057 | 0.010 | 0.008 | 0.012 |
2 | 0.069 | 0.064 | 0.074 | 0.014 | 0.012 | 0.016 | |
3 | 0.097 | 0.091 | 0.103 | 0.018 | 0.015 | 0.021 | |
4 | 0.120 | 0.114 | 0.126 | 0.024 | 0.021 | 0.027 | |
10 | 1 | 0.051 | 0.047 | 0.056 | 0.010 | 0.008 | 0.012 |
2 | 0.068 | 0.063 | 0.073 | 0.012 | 0.010 | 0.014 | |
3 | 0.085 | 0.080 | 0.091 | 0.015 | 0.013 | 0.017 | |
4 | 0.106 | 0.100 | 0.112 | 0.021 | 0.018 | 0.024 | |
5 | 0.135 | 0.128 | 0.142 | 0.025 | 0.022 | 0.028 |
Table 1c. Observed Type I Error Levels based on 10,000 Simulations, 11 to 15.
Assumed | |
Assumed | |
||||
|
|
|
95% LCL | 95% UCL | |
95% LCL | 95% UCL |
11 | 1 | 0.052 | 0.048 | 0.056 | 0.012 | 0.010 | 0.014 |
2 | 0.070 | 0.065 | 0.075 | 0.014 | 0.012 | 0.017 | |
3 | 0.082 | 0.077 | 0.088 | 0.014 | 0.011 | 0.016 | |
4 | 0.101 | 0.095 | 0.107 | 0.019 | 0.016 | 0.021 | |
5 | 0.116 | 0.110 | 0.123 | 0.022 | 0.019 | 0.024 | |
12 | 1 | 0.052 | 0.047 | 0.056 | 0.011 | 0.009 | 0.013 |
2 | 0.067 | 0.062 | 0.072 | 0.011 | 0.009 | 0.013 | |
3 | 0.074 | 0.069 | 0.080 | 0.016 | 0.013 | 0.018 | |
4 | 0.088 | 0.082 | 0.093 | 0.016 | 0.014 | 0.019 | |
5 | 0.099 | 0.093 | 0.105 | 0.016 | 0.013 | 0.018 | |
6 | 0.117 | 0.111 | 0.123 | 0.021 | 0.018 | 0.023 | |
13 | 1 | 0.048 | 0.044 | 0.052 | 0.010 | 0.008 | 0.012 |
2 | 0.064 | 0.059 | 0.069 | 0.014 | 0.012 | 0.016 | |
3 | 0.070 | 0.065 | 0.075 | 0.013 | 0.011 | 0.015 | |
4 | 0.079 | 0.074 | 0.084 | 0.014 | 0.012 | 0.017 | |
5 | 0.088 | 0.083 | 0.094 | 0.015 | 0.013 | 0.018 | |
6 | 0.109 | 0.103 | 0.115 | 0.020 | 0.017 | 0.022 | |
14 | 1 | 0.046 | 0.042 | 0.051 | 0.009 | 0.007 | 0.011 |
2 | 0.062 | 0.057 | 0.066 | 0.012 | 0.010 | 0.014 | |
3 | 0.069 | 0.064 | 0.074 | 0.012 | 0.010 | 0.014 | |
4 | 0.077 | 0.072 | 0.082 | 0.015 | 0.013 | 0.018 | |
5 | 0.084 | 0.079 | 0.090 | 0.016 | 0.013 | 0.018 | |
6 | 0.091 | 0.085 | 0.097 | 0.017 | 0.014 | 0.019 | |
7 | 0.107 | 0.101 | 0.113 | 0.018 | 0.016 | 0.021 | |
15 | 1 | 0.054 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 |
2 | 0.057 | 0.053 | 0.062 | 0.010 | 0.008 | 0.012 | |
3 | 0.065 | 0.060 | 0.069 | 0.013 | 0.011 | 0.016 | |
4 | 0.073 | 0.068 | 0.078 | 0.014 | 0.011 | 0.016 | |
5 | 0.074 | 0.069 | 0.079 | 0.012 | 0.010 | 0.014 | |
6 | 0.086 | 0.081 | 0.092 | 0.015 | 0.013 | 0.017 | |
7 | 0.099 | 0.094 | 0.105 | 0.018 | 0.015 | 0.020 |
Table 1d. Observed Type I Error Levels based on 10,000 Simulations, 16 to 20.
Assumed | |
Assumed | |
||||
|
|
|
95% LCL | 95% UCL | |
95% LCL | 95% UCL |
16 | 1 | 0.052 | 0.048 | 0.057 | 0.010 | 0.008 | 0.012 |
2 | 0.055 | 0.051 | 0.059 | 0.011 | 0.009 | 0.013 | |
3 | 0.068 | 0.063 | 0.073 | 0.011 | 0.009 | 0.013 | |
4 | 0.074 | 0.069 | 0.079 | 0.015 | 0.013 | 0.017 | |
5 | 0.077 | 0.072 | 0.082 | 0.015 | 0.013 | 0.018 | |
6 | 0.075 | 0.070 | 0.080 | 0.013 | 0.011 | 0.016 | |
7 | 0.087 | 0.082 | 0.093 | 0.017 | 0.014 | 0.020 | |
8 | 0.096 | 0.090 | 0.101 | 0.016 | 0.014 | 0.019 | |
17 | 1 | 0.047 | 0.043 | 0.051 | 0.008 | 0.007 | 0.010 |
2 | 0.059 | 0.054 | 0.063 | 0.011 | 0.009 | 0.013 | |
3 | 0.062 | 0.057 | 0.067 | 0.012 | 0.010 | 0.014 | |
4 | 0.070 | 0.065 | 0.075 | 0.012 | 0.009 | 0.014 | |
5 | 0.069 | 0.064 | 0.074 | 0.012 | 0.010 | 0.015 | |
6 | 0.071 | 0.066 | 0.076 | 0.015 | 0.012 | 0.017 | |
7 | 0.081 | 0.076 | 0.087 | 0.014 | 0.012 | 0.016 | |
8 | 0.083 | 0.078 | 0.088 | 0.015 | 0.013 | 0.017 | |
18 | 1 | 0.051 | 0.047 | 0.055 | 0.010 | 0.008 | 0.012 |
2 | 0.056 | 0.052 | 0.061 | 0.012 | 0.010 | 0.014 | |
3 | 0.065 | 0.060 | 0.070 | 0.012 | 0.010 | 0.015 | |
4 | 0.065 | 0.060 | 0.070 | 0.013 | 0.011 | 0.015 | |
5 | 0.069 | 0.064 | 0.074 | 0.012 | 0.010 | 0.014 | |
6 | 0.068 | 0.063 | 0.073 | 0.014 | 0.011 | 0.016 | |
7 | 0.072 | 0.067 | 0.077 | 0.014 | 0.011 | 0.016 | |
8 | 0.076 | 0.071 | 0.081 | 0.012 | 0.010 | 0.014 | |
9 | 0.081 | 0.076 | 0.086 | 0.012 | 0.010 | 0.014 | |
19 | 1 | 0.051 | 0.046 | 0.055 | 0.008 | 0.006 | 0.010 |
2 | 0.059 | 0.055 | 0.064 | 0.012 | 0.010 | 0.014 | |
3 | 0.059 | 0.054 | 0.064 | 0.011 | 0.009 | 0.013 | |
4 | 0.061 | 0.057 | 0.066 | 0.012 | 0.010 | 0.014 | |
5 | 0.067 | 0.062 | 0.072 | 0.013 | 0.010 | 0.015 | |
6 | 0.066 | 0.061 | 0.071 | 0.011 | 0.009 | 0.013 | |
7 | 0.069 | 0.064 | 0.074 | 0.013 | 0.011 | 0.015 | |
8 | 0.074 | 0.069 | 0.079 | 0.012 | 0.010 | 0.014 | |
9 | 0.082 | 0.077 | 0.087 | 0.015 | 0.013 | 0.018 | |
20 | 1 | 0.053 | 0.048 | 0.057 | 0.011 | 0.009 | 0.013 |
2 | 0.056 | 0.052 | 0.061 | 0.010 | 0.008 | 0.012 | |
3 | 0.060 | 0.056 | 0.065 | 0.009 | 0.007 | 0.011 | |
4 | 0.063 | 0.058 | 0.068 | 0.012 | 0.010 | 0.014 | |
5 | 0.063 | 0.059 | 0.068 | 0.014 | 0.011 | 0.016 | |
6 | 0.063 | 0.058 | 0.067 | 0.011 | 0.009 | 0.013 | |
7 | 0.065 | 0.061 | 0.070 | 0.011 | 0.009 | 0.013 | |
8 | 0.070 | 0.065 | 0.076 | 0.012 | 0.010 | 0.014 | |
9 | 0.076 | 0.070 | 0.081 | 0.013 | 0.011 | 0.015 | |
10 | 0.081 | 0.076 | 0.087 | 0.012 | 0.010 | 0.014 |
Table 1e. Observed Type I Error Levels based on 10,000 Simulations, 21 to 25.
Assumed | |
Assumed | |
||||
|
|
|
95% LCL | 95% UCL | |
95% LCL | 95% UCL |
21 | 1 | 0.054 | 0.049 | 0.058 | 0.013 | 0.011 | 0.015 |
2 | 0.054 | 0.049 | 0.058 | 0.012 | 0.010 | 0.014 | |
3 | 0.058 | 0.054 | 0.063 | 0.012 | 0.010 | 0.014 | |
4 | 0.058 | 0.054 | 0.063 | 0.011 | 0.009 | 0.013 | |
5 | 0.064 | 0.059 | 0.069 | 0.013 | 0.011 | 0.016 | |
6 | 0.066 | 0.061 | 0.071 | 0.012 | 0.010 | 0.015 | |
7 | 0.063 | 0.058 | 0.068 | 0.013 | 0.011 | 0.015 | |
8 | 0.066 | 0.061 | 0.071 | 0.010 | 0.008 | 0.012 | |
9 | 0.073 | 0.068 | 0.078 | 0.013 | 0.011 | 0.015 | |
10 | 0.071 | 0.066 | 0.076 | 0.012 | 0.010 | 0.014 | |
22 | 1 | 0.047 | 0.042 | 0.051 | 0.010 | 0.008 | 0.012 |
2 | 0.058 | 0.053 | 0.062 | 0.012 | 0.010 | 0.015 | |
3 | 0.056 | 0.052 | 0.061 | 0.010 | 0.008 | 0.012 | |
4 | 0.059 | 0.055 | 0.064 | 0.012 | 0.010 | 0.014 | |
5 | 0.061 | 0.057 | 0.066 | 0.009 | 0.008 | 0.011 | |
6 | 0.063 | 0.058 | 0.068 | 0.013 | 0.010 | 0.015 | |
7 | 0.065 | 0.060 | 0.070 | 0.013 | 0.010 | 0.015 | |
8 | 0.065 | 0.060 | 0.070 | 0.014 | 0.012 | 0.016 | |
9 | 0.065 | 0.060 | 0.070 | 0.012 | 0.010 | 0.014 | |
10 | 0.067 | 0.062 | 0.072 | 0.012 | 0.009 | 0.014 | |
23 | 1 | 0.051 | 0.047 | 0.056 | 0.008 | 0.007 | 0.010 |
2 | 0.056 | 0.052 | 0.061 | 0.010 | 0.009 | 0.012 | |
3 | 0.056 | 0.052 | 0.061 | 0.011 | 0.009 | 0.013 | |
4 | 0.062 | 0.057 | 0.066 | 0.011 | 0.009 | 0.013 | |
5 | 0.061 | 0.056 | 0.065 | 0.010 | 0.009 | 0.012 | |
6 | 0.060 | 0.055 | 0.064 | 0.012 | 0.010 | 0.014 | |
7 | 0.062 | 0.057 | 0.066 | 0.011 | 0.009 | 0.013 | |
8 | 0.063 | 0.058 | 0.068 | 0.012 | 0.010 | 0.014 | |
9 | 0.066 | 0.061 | 0.071 | 0.012 | 0.010 | 0.014 | |
10 | 0.068 | 0.063 | 0.073 | 0.014 | 0.012 | 0.017 | |
24 | 1 | 0.051 | 0.046 | 0.055 | 0.010 | 0.008 | 0.012 |
2 | 0.056 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
3 | 0.058 | 0.053 | 0.062 | 0.010 | 0.008 | 0.012 | |
4 | 0.060 | 0.056 | 0.065 | 0.013 | 0.011 | 0.015 | |
5 | 0.057 | 0.053 | 0.062 | 0.012 | 0.010 | 0.014 | |
6 | 0.065 | 0.060 | 0.069 | 0.011 | 0.009 | 0.013 | |
7 | 0.062 | 0.057 | 0.066 | 0.012 | 0.010 | 0.014 | |
8 | 0.060 | 0.055 | 0.065 | 0.012 | 0.010 | 0.014 | |
9 | 0.066 | 0.061 | 0.071 | 0.012 | 0.010 | 0.014 | |
10 | 0.064 | 0.059 | 0.068 | 0.012 | 0.010 | 0.015 | |
25 | 1 | 0.054 | 0.050 | 0.059 | 0.012 | 0.009 | 0.014 |
2 | 0.055 | 0.051 | 0.060 | 0.010 | 0.008 | 0.012 | |
3 | 0.057 | 0.052 | 0.062 | 0.011 | 0.009 | 0.013 | |
4 | 0.055 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
5 | 0.060 | 0.055 | 0.065 | 0.012 | 0.010 | 0.014 | |
6 | 0.060 | 0.055 | 0.064 | 0.011 | 0.009 | 0.013 | |
7 | 0.057 | 0.052 | 0.061 | 0.011 | 0.009 | 0.013 | |
8 | 0.062 | 0.058 | 0.067 | 0.011 | 0.009 | 0.013 | |
9 | 0.058 | 0.053 | 0.062 | 0.012 | 0.010 | 0.014 | |
10 | 0.061 | 0.057 | 0.066 | 0.010 | 0.008 | 0.012 |
Table 1f. Observed Type I Error Levels based on 10,000 Simulations, 26 to 30.
Assumed | |
Assumed | |
||||
|
|
|
95% LCL | 95% UCL | |
95% LCL | 95% UCL |
26 | 1 | 0.051 | 0.047 | 0.055 | 0.012 | 0.010 | 0.014 |
2 | 0.057 | 0.053 | 0.062 | 0.013 | 0.011 | 0.015 | |
3 | 0.055 | 0.050 | 0.059 | 0.012 | 0.010 | 0.014 | |
4 | 0.055 | 0.051 | 0.060 | 0.010 | 0.008 | 0.012 | |
5 | 0.058 | 0.054 | 0.063 | 0.011 | 0.009 | 0.013 | |
6 | 0.061 | 0.056 | 0.066 | 0.012 | 0.010 | 0.014 | |
7 | 0.059 | 0.054 | 0.064 | 0.011 | 0.009 | 0.013 | |
8 | 0.060 | 0.056 | 0.065 | 0.010 | 0.008 | 0.012 | |
9 | 0.060 | 0.056 | 0.065 | 0.011 | 0.009 | 0.013 | |
10 | 0.061 | 0.056 | 0.065 | 0.011 | 0.009 | 0.013 | |
27 | 1 | 0.050 | 0.046 | 0.054 | 0.009 | 0.007 | 0.011 |
2 | 0.054 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
3 | 0.062 | 0.057 | 0.066 | 0.012 | 0.010 | 0.014 | |
4 | 0.063 | 0.058 | 0.068 | 0.011 | 0.009 | 0.013 | |
5 | 0.051 | 0.047 | 0.055 | 0.010 | 0.008 | 0.012 | |
6 | 0.058 | 0.053 | 0.062 | 0.011 | 0.009 | 0.013 | |
7 | 0.060 | 0.056 | 0.065 | 0.010 | 0.008 | 0.012 | |
8 | 0.056 | 0.052 | 0.061 | 0.010 | 0.008 | 0.012 | |
9 | 0.061 | 0.056 | 0.066 | 0.012 | 0.010 | 0.014 | |
10 | 0.055 | 0.051 | 0.060 | 0.008 | 0.006 | 0.010 | |
28 | 1 | 0.049 | 0.045 | 0.053 | 0.010 | 0.008 | 0.011 |
2 | 0.057 | 0.052 | 0.061 | 0.011 | 0.009 | 0.013 | |
3 | 0.056 | 0.052 | 0.061 | 0.012 | 0.009 | 0.014 | |
4 | 0.057 | 0.053 | 0.062 | 0.011 | 0.009 | 0.013 | |
5 | 0.057 | 0.053 | 0.062 | 0.010 | 0.008 | 0.012 | |
6 | 0.056 | 0.051 | 0.060 | 0.010 | 0.008 | 0.012 | |
7 | 0.057 | 0.052 | 0.061 | 0.010 | 0.008 | 0.012 | |
8 | 0.058 | 0.054 | 0.063 | 0.011 | 0.009 | 0.013 | |
9 | 0.054 | 0.050 | 0.058 | 0.011 | 0.009 | 0.013 | |
10 | 0.062 | 0.057 | 0.067 | 0.011 | 0.009 | 0.013 | |
29 | 1 | 0.049 | 0.045 | 0.053 | 0.011 | 0.009 | 0.013 |
2 | 0.053 | 0.048 | 0.057 | 0.010 | 0.008 | 0.012 | |
3 | 0.056 | 0.051 | 0.060 | 0.010 | 0.009 | 0.012 | |
4 | 0.055 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 | |
5 | 0.056 | 0.051 | 0.060 | 0.010 | 0.008 | 0.012 | |
6 | 0.057 | 0.053 | 0.062 | 0.012 | 0.010 | 0.014 | |
7 | 0.055 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 | |
8 | 0.057 | 0.052 | 0.061 | 0.011 | 0.009 | 0.013 | |
9 | 0.056 | 0.051 | 0.061 | 0.011 | 0.009 | 0.013 | |
10 | 0.057 | 0.052 | 0.061 | 0.011 | 0.009 | 0.013 | |
30 | 1 | 0.050 | 0.046 | 0.054 | 0.009 | 0.007 | 0.011 |
2 | 0.054 | 0.049 | 0.058 | 0.011 | 0.009 | 0.013 | |
3 | 0.056 | 0.052 | 0.061 | 0.012 | 0.010 | 0.015 | |
4 | 0.054 | 0.049 | 0.058 | 0.010 | 0.008 | 0.012 | |
5 | 0.058 | 0.053 | 0.063 | 0.012 | 0.010 | 0.014 | |
6 | 0.062 | 0.058 | 0.067 | 0.012 | 0.010 | 0.014 | |
7 | 0.056 | 0.052 | 0.061 | 0.012 | 0.010 | 0.014 | |
8 | 0.059 | 0.054 | 0.064 | 0.011 | 0.009 | 0.013 | |
9 | 0.056 | 0.052 | 0.061 | 0.010 | 0.009 | 0.012 | |
10 | 0.058 | 0.053 | 0.062 | 0.012 | 0.010 | 0.015 |
Table 1g. Observed Type I Error Levels based on 10,000 Simulations, n = 31 to 35.
Assumed | |
Assumed | |
||||
|
|
|
95% LCL | 95% UCL | |
95% LCL | 95% UCL |
31 | 1 | 0.051 | 0.047 | 0.056 | 0.009 | 0.007 | 0.011 |
2 | 0.054 | 0.050 | 0.059 | 0.010 | 0.009 | 0.012 | |
3 | 0.053 | 0.049 | 0.058 | 0.010 | 0.008 | 0.012 | |
4 | 0.055 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 | |
5 | 0.053 | 0.049 | 0.057 | 0.011 | 0.009 | 0.013 | |
6 | 0.055 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 | |
7 | 0.055 | 0.050 | 0.059 | 0.012 | 0.010 | 0.014 | |
8 | 0.056 | 0.051 | 0.060 | 0.010 | 0.008 | 0.012 | |
9 | 0.057 | 0.053 | 0.062 | 0.011 | 0.009 | 0.013 | |
10 | 0.058 | 0.053 | 0.062 | 0.011 | 0.009 | 0.013 | |
32 | 1 | 0.054 | 0.049 | 0.058 | 0.010 | 0.008 | 0.012 |
2 | 0.054 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 | |
3 | 0.052 | 0.047 | 0.056 | 0.009 | 0.007 | 0.011 | |
4 | 0.056 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
5 | 0.056 | 0.052 | 0.061 | 0.011 | 0.009 | 0.013 | |
6 | 0.055 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
7 | 0.055 | 0.051 | 0.060 | 0.010 | 0.008 | 0.012 | |
8 | 0.055 | 0.051 | 0.060 | 0.010 | 0.008 | 0.012 | |
9 | 0.057 | 0.053 | 0.062 | 0.012 | 0.010 | 0.014 | |
10 | 0.054 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 | |
33 | 1 | 0.051 | 0.046 | 0.055 | 0.011 | 0.009 | 0.013 |
2 | 0.055 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
3 | 0.056 | 0.052 | 0.061 | 0.010 | 0.008 | 0.012 | |
4 | 0.052 | 0.048 | 0.057 | 0.010 | 0.008 | 0.012 | |
5 | 0.055 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 | |
6 | 0.058 | 0.053 | 0.062 | 0.011 | 0.009 | 0.013 | |
7 | 0.057 | 0.052 | 0.061 | 0.010 | 0.008 | 0.012 | |
8 | 0.058 | 0.054 | 0.063 | 0.011 | 0.009 | 0.013 | |
9 | 0.057 | 0.053 | 0.062 | 0.012 | 0.010 | 0.014 | |
10 | 0.055 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
34 | 1 | 0.052 | 0.048 | 0.056 | 0.009 | 0.007 | 0.011 |
2 | 0.053 | 0.049 | 0.058 | 0.011 | 0.009 | 0.013 | |
3 | 0.055 | 0.050 | 0.059 | 0.012 | 0.010 | 0.014 | |
4 | 0.056 | 0.052 | 0.061 | 0.010 | 0.008 | 0.012 | |
5 | 0.053 | 0.048 | 0.057 | 0.009 | 0.007 | 0.011 | |
6 | 0.055 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 | |
7 | 0.052 | 0.048 | 0.057 | 0.012 | 0.010 | 0.014 | |
8 | 0.055 | 0.050 | 0.059 | 0.009 | 0.008 | 0.011 | |
9 | 0.055 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
10 | 0.054 | 0.049 | 0.058 | 0.010 | 0.008 | 0.012 | |
35 | 1 | 0.051 | 0.046 | 0.055 | 0.010 | 0.009 | 0.012 |
2 | 0.054 | 0.049 | 0.058 | 0.010 | 0.009 | 0.012 | |
3 | 0.055 | 0.050 | 0.059 | 0.010 | 0.009 | 0.012 | |
4 | 0.053 | 0.048 | 0.057 | 0.011 | 0.009 | 0.013 | |
5 | 0.056 | 0.051 | 0.061 | 0.011 | 0.009 | 0.013 | |
6 | 0.055 | 0.051 | 0.059 | 0.012 | 0.010 | 0.014 | |
7 | 0.054 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
8 | 0.054 | 0.049 | 0.058 | 0.011 | 0.009 | 0.013 | |
9 | 0.061 | 0.056 | 0.066 | 0.012 | 0.010 | 0.014 | |
10 | 0.053 | 0.048 | 0.057 | 0.011 | 0.009 | 0.013 |
Table 1h. Observed Type I Error Levels based on 10,000 Simulations, n = 36 to 40.
Assumed | |
Assumed | |
||||
|
|
|
95% LCL | 95% UCL | |
95% LCL | 95% UCL |
36 | 1 | 0.047 | 0.043 | 0.051 | 0.010 | 0.008 | 0.012 |
2 | 0.058 | 0.053 | 0.062 | 0.012 | 0.010 | 0.015 | |
3 | 0.052 | 0.047 | 0.056 | 0.009 | 0.007 | 0.011 | |
4 | 0.052 | 0.048 | 0.056 | 0.012 | 0.010 | 0.014 | |
5 | 0.052 | 0.048 | 0.057 | 0.010 | 0.008 | 0.012 | |
6 | 0.055 | 0.051 | 0.059 | 0.012 | 0.010 | 0.014 | |
7 | 0.053 | 0.048 | 0.057 | 0.011 | 0.009 | 0.013 | |
8 | 0.056 | 0.051 | 0.060 | 0.012 | 0.010 | 0.014 | |
9 | 0.056 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
10 | 0.056 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
37 | 1 | 0.050 | 0.046 | 0.055 | 0.010 | 0.008 | 0.012 |
2 | 0.054 | 0.049 | 0.058 | 0.011 | 0.009 | 0.013 | |
3 | 0.054 | 0.049 | 0.058 | 0.011 | 0.009 | 0.013 | |
4 | 0.054 | 0.050 | 0.058 | 0.010 | 0.008 | 0.012 | |
5 | 0.054 | 0.049 | 0.058 | 0.010 | 0.008 | 0.012 | |
6 | 0.054 | 0.050 | 0.058 | 0.011 | 0.009 | 0.013 | |
7 | 0.055 | 0.051 | 0.060 | 0.010 | 0.008 | 0.012 | |
8 | 0.055 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
9 | 0.053 | 0.049 | 0.058 | 0.011 | 0.009 | 0.013 | |
10 | 0.049 | 0.045 | 0.054 | 0.009 | 0.007 | 0.011 | |
38 | 1 | 0.049 | 0.045 | 0.053 | 0.009 | 0.007 | 0.011 |
2 | 0.052 | 0.047 | 0.056 | 0.008 | 0.007 | 0.010 | |
3 | 0.054 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
4 | 0.055 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
5 | 0.056 | 0.052 | 0.061 | 0.012 | 0.010 | 0.014 | |
6 | 0.055 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
7 | 0.049 | 0.045 | 0.053 | 0.009 | 0.007 | 0.011 | |
8 | 0.052 | 0.048 | 0.057 | 0.010 | 0.008 | 0.012 | |
9 | 0.054 | 0.050 | 0.059 | 0.010 | 0.009 | 0.012 | |
10 | 0.055 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
39 | 1 | 0.047 | 0.043 | 0.051 | 0.010 | 0.008 | 0.012 |
2 | 0.055 | 0.051 | 0.059 | 0.010 | 0.008 | 0.012 | |
3 | 0.053 | 0.049 | 0.057 | 0.010 | 0.008 | 0.012 | |
4 | 0.053 | 0.049 | 0.058 | 0.010 | 0.009 | 0.012 | |
5 | 0.052 | 0.048 | 0.057 | 0.010 | 0.008 | 0.012 | |
6 | 0.053 | 0.049 | 0.058 | 0.010 | 0.008 | 0.012 | |
7 | 0.057 | 0.052 | 0.061 | 0.011 | 0.009 | 0.013 | |
8 | 0.057 | 0.053 | 0.062 | 0.012 | 0.010 | 0.014 | |
9 | 0.050 | 0.046 | 0.055 | 0.010 | 0.008 | 0.012 | |
10 | 0.056 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
40 | 1 | 0.049 | 0.045 | 0.054 | 0.010 | 0.008 | 0.012 |
2 | 0.052 | 0.048 | 0.057 | 0.010 | 0.009 | 0.012 | |
3 | 0.055 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
4 | 0.054 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
5 | 0.054 | 0.050 | 0.059 | 0.010 | 0.008 | 0.012 | |
6 | 0.049 | 0.045 | 0.053 | 0.010 | 0.008 | 0.012 | |
7 | 0.056 | 0.051 | 0.060 | 0.011 | 0.009 | 0.013 | |
8 | 0.054 | 0.050 | 0.059 | 0.011 | 0.009 | 0.013 | |
9 | 0.047 | 0.043 | 0.052 | 0.010 | 0.008 | 0.011 | |
10 | 0.058 | 0.054 | 0.063 | 0.010 | 0.008 | 0.012 |
A list of class "gofOutlier"
containing the results of the hypothesis test.
See the help file for gofOutlier.object
for details.
Rosner's test is a commonly used test for “outliers” when you are willing to assume that the data without outliers follows a normal (Gaussian) distribution. It is designed to avoid masking, which occurs when an outlier goes undetected because it is close in value to another outlier.
Rosner's test is a kind of discordancy test (Barnett and Lewis, 1995). The test statistic of a discordancy test is usually a ratio: the numerator is the difference between the suspected outlier and some summary statistic of the data set (e.g., mean, next largest observation, etc.), while the denominator is always a measure of spread within the data (e.g., standard deviation, range, etc.). Both USEPA (2009) and USEPA (2013a,b) discuss two commonly used discordancy tests: Dixon's test and Rosner's test. Both of these tests assume that all of the data that are not outliers come from a normal (Gaussian) distribution.
There are many forms of Dixon's test (Barnett and Lewis, 1995). The one presented in USEPA (2009) and USEPA (20013a,b) assumes just one outlier (Dixon, 1953). This test is vulnerable to "masking" in which the presence of several outliers masks the fact that even one outlier is present. There are also other forms of Dixon's test that allow for more than one outlier based on a sequence of sub-tests, but these tests are also vulnerable to masking.
Rosner's test allows you to test for several possible outliers and avoids the problem of
masking. Rosner's test requires you to set the number of suspected outliers, ,
in advance. As in the case of Dixon's test, there are several forms of Rosner's test,
so you need to be aware of which one you are using. The form of Rosner's test presented in
USEPA (2009) is based on the extreme Studentized deviate (ESD) (Rosner, 1975), whereas the
form of Rosner's test performed by the EnvStats function
rosnerTest
and
presented in USEPA (2013a,b) is based on the generalized ESD (Rosner, 1983; Gilbert, 1987).
USEPA (2013a, p. 190) cites both Rosner (1975) and Rosner (1983), but presents only the
test given in Rosner (1983). Rosner's test based on the ESD has the appropriate Type I
error level if there are no outliers in the dataset, but if there are actually say
outliers, where
, then the ESD version of Rosner's test tends to declare
more than
outliers with a probability that is greater than the stated Type I
error level (referred to as “swamping”). Rosner's test based on the
generalized ESD fixes this problem. USEPA (2013a, pp. 17, 191) incorrectly states that
the generalized ESD version of Rosner's test is vulnerable to masking. Surprisingly,
the well-known book on statistical outliers by Barnett and Lewis (1995) does not
discuss Rosner's generalized ESD test.
As noted, using Rosner's test requires specifying the number of suspected outliers,
, in advance. USEPA (2013a, pp.190-191) states:
“A graphical display (Q-Q plot) can be used to identify suspected outliers
needed to perform the Rosner test”, and USEPA (2009, p. 12-11) notes:
“A potential drawback of Rosner's test is that the user must first identify
the maximum number of potential outliers (k) prior to running the test. Therefore,
this requirement makes the test ill-advised as an automatic outlier screening tool,
and somewhat reliant on the user to identify candidate outliers.”
When observations contain non-detect values (NDs), USEPA (2013a, p. 191) states:
“one may replace the NDs by their respective detection limits (DLs), DL/2, or may
just ignore them ....” This is bad advice, as this method of dealing with non-detects
will produce Type I error rates that are not correct.
OUTLIERS ARE NOT NECESSARILY INCORRECT VALUES
Whether an observation is an “outlier” depends on the underlying assumed
statistical model. McBean and Rovers (1992) state:
“It may be possible to ignore the outlier if a physical rationale is available but,
failing that, the value must be included .... Note that the use of statistics does not
interpret the facts, it simply makes the facts easier to see. Therefore, it is incumbent
on the analyst to identify whether or not the high value ... is truly representative of
the chemical being monitored or, instead, is an outlier for reasons such as a result of
sampling or laboratory error.”
USEPA (2006, p.51) states:
“If scientific reasoning does not explain the outlier, it should not be
discarded from the data set.”
Finally, an editorial by the Editor-in-Chief of the journal Science deals with this topic (McNutt, 2014).
You can use the functions qqPlot
and gofTest
to explore
other possible statistical models for the data, or you can use nonparametric statistics
if you do not want to assume a particular distribution.
Steven P. Millard ([email protected])
Barnett, V., and T. Lewis. (1995). Outliers in Statistical Data. Third Edition. John Wiley & Sons, Chichester, UK, pp. 235–236.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY, pp.188–191.
McBean, E.A, and F.A. Rovers. (1992). Estimation of the Probability of Exceedance of Contaminant Concentrations. Ground Water Monitoring Review Winter, pp. 115–119.
McNutt, M. (2014). Raising the Bar. Science 345(6192), p. 9.
Rosner, B. (1975). On the Detection of Many Outliers. Technometrics 17, 221–227.
Rosner, B. (1983). Percentage Points for a Generalized ESD Many-Outlier Procedure. Technometrics 25, 165–172.
USEPA. (2006). Data Quality Assessment: A Reviewer's Guide. EPA QA/G-9R. EPA/240/B-06/002, February 2006. Office of Environmental Information, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C., pp. 12-10 to 12-14.
USEPA. (2013a). ProUCL Version 5.0.00 Technical Guide. EPA/600/R-07/041, September 2013. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C., pp. 190–195.
USEPA. (2013b). ProUCL Version 5.0.00 User Guide. EPA/600/R-07/041, September 2013. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C., pp. 190–195.
gofTest
, gofOutlier.object
, print.gofOutlier
,
Normal, qqPlot
.
# Combine 30 observations from a normal distribution with mean 3 and # standard deviation 2, with 3 observations from a normal distribution # with mean 10 and standard deviation 1, then run Rosner's Test on these # data, specifying k=4 potential outliers based on looking at the # normal Q-Q plot. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- c(rnorm(30, mean = 3, sd = 2), rnorm(3, mean = 10, sd = 1)) dev.new() qqPlot(dat) rosnerTest(dat, k = 4) #Results of Outlier Test #------------------------- # #Test Method: Rosner's Test for Outliers # #Hypothesized Distribution: Normal # #Data: dat # #Sample Size: 33 # #Test Statistics: R.1 = 2.848514 # R.2 = 3.086875 # R.3 = 3.033044 # R.4 = 2.380235 # #Test Statistic Parameter: k = 4 # #Alternative Hypothesis: Up to 4 observations are not # from the same Distribution. # #Type I Error: 5% # #Number of Outliers Detected: 3 # # i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier #1 0 3.549744 2.531011 10.7593656 33 2.848514 2.951949 TRUE #2 1 3.324444 2.209872 10.1460427 31 3.086875 2.938048 TRUE #3 2 3.104392 1.856109 8.7340527 32 3.033044 2.923571 TRUE #4 3 2.916737 1.560335 -0.7972275 25 2.380235 2.908473 FALSE #---------- # Clean up rm(dat) graphics.off() #-------------------------------------------------------------------- # Example 12-4 of USEPA (2009, page 12-12) gives an example of # using Rosner's test to test for outliers in napthalene measurements (ppb) # taken at 5 background wells over 5 quarters. The data for this example # are stored in EPA.09.Ex.12.4.naphthalene.df. EPA.09.Ex.12.4.naphthalene.df # Quarter Well Naphthalene.ppb #1 1 BW.1 3.34 #2 2 BW.1 5.39 #3 3 BW.1 5.74 # ... #23 3 BW.5 5.53 #24 4 BW.5 4.42 #25 5 BW.5 35.45 longToWide(EPA.09.Ex.12.4.naphthalene.df, "Naphthalene.ppb", "Quarter", "Well", paste.row.name = TRUE) # BW.1 BW.2 BW.3 BW.4 BW.5 #Quarter.1 3.34 5.59 1.91 6.12 8.64 #Quarter.2 5.39 5.96 1.74 6.05 5.34 #Quarter.3 5.74 1.47 23.23 5.18 5.53 #Quarter.4 6.88 2.57 1.82 4.43 4.42 #Quarter.5 5.85 5.39 2.02 1.00 35.45 # Look at Q-Q plots for both the raw and log-transformed data #------------------------------------------------------------ dev.new() with(EPA.09.Ex.12.4.naphthalene.df, qqPlot(Naphthalene.ppb, add.line = TRUE, main = "Figure 12-6. Naphthalene Probability Plot")) dev.new() with(EPA.09.Ex.12.4.naphthalene.df, qqPlot(Naphthalene.ppb, dist = "lnorm", add.line = TRUE, main = "Figure 12-7. Log Naphthalene Probability Plot")) # Test for 2 potential outliers on the original scale: #----------------------------------------------------- with(EPA.09.Ex.12.4.naphthalene.df, rosnerTest(Naphthalene.ppb, k = 2)) #Results of Outlier Test #------------------------- # #Test Method: Rosner's Test for Outliers # #Hypothesized Distribution: Normal # #Data: Naphthalene.ppb # #Sample Size: 25 # #Test Statistics: R.1 = 3.930957 # R.2 = 4.160223 # #Test Statistic Parameter: k = 2 # #Alternative Hypothesis: Up to 2 observations are not # from the same Distribution. # #Type I Error: 5% # #Number of Outliers Detected: 2 # # i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier #1 0 6.44240 7.379271 35.45 25 3.930957 2.821681 TRUE #2 1 5.23375 4.325790 23.23 13 4.160223 2.801551 TRUE #---------- # Clean up graphics.off()
# Combine 30 observations from a normal distribution with mean 3 and # standard deviation 2, with 3 observations from a normal distribution # with mean 10 and standard deviation 1, then run Rosner's Test on these # data, specifying k=4 potential outliers based on looking at the # normal Q-Q plot. # (Note: the call to set.seed simply allows you to reproduce # this example.) set.seed(250) dat <- c(rnorm(30, mean = 3, sd = 2), rnorm(3, mean = 10, sd = 1)) dev.new() qqPlot(dat) rosnerTest(dat, k = 4) #Results of Outlier Test #------------------------- # #Test Method: Rosner's Test for Outliers # #Hypothesized Distribution: Normal # #Data: dat # #Sample Size: 33 # #Test Statistics: R.1 = 2.848514 # R.2 = 3.086875 # R.3 = 3.033044 # R.4 = 2.380235 # #Test Statistic Parameter: k = 4 # #Alternative Hypothesis: Up to 4 observations are not # from the same Distribution. # #Type I Error: 5% # #Number of Outliers Detected: 3 # # i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier #1 0 3.549744 2.531011 10.7593656 33 2.848514 2.951949 TRUE #2 1 3.324444 2.209872 10.1460427 31 3.086875 2.938048 TRUE #3 2 3.104392 1.856109 8.7340527 32 3.033044 2.923571 TRUE #4 3 2.916737 1.560335 -0.7972275 25 2.380235 2.908473 FALSE #---------- # Clean up rm(dat) graphics.off() #-------------------------------------------------------------------- # Example 12-4 of USEPA (2009, page 12-12) gives an example of # using Rosner's test to test for outliers in napthalene measurements (ppb) # taken at 5 background wells over 5 quarters. The data for this example # are stored in EPA.09.Ex.12.4.naphthalene.df. EPA.09.Ex.12.4.naphthalene.df # Quarter Well Naphthalene.ppb #1 1 BW.1 3.34 #2 2 BW.1 5.39 #3 3 BW.1 5.74 # ... #23 3 BW.5 5.53 #24 4 BW.5 4.42 #25 5 BW.5 35.45 longToWide(EPA.09.Ex.12.4.naphthalene.df, "Naphthalene.ppb", "Quarter", "Well", paste.row.name = TRUE) # BW.1 BW.2 BW.3 BW.4 BW.5 #Quarter.1 3.34 5.59 1.91 6.12 8.64 #Quarter.2 5.39 5.96 1.74 6.05 5.34 #Quarter.3 5.74 1.47 23.23 5.18 5.53 #Quarter.4 6.88 2.57 1.82 4.43 4.42 #Quarter.5 5.85 5.39 2.02 1.00 35.45 # Look at Q-Q plots for both the raw and log-transformed data #------------------------------------------------------------ dev.new() with(EPA.09.Ex.12.4.naphthalene.df, qqPlot(Naphthalene.ppb, add.line = TRUE, main = "Figure 12-6. Naphthalene Probability Plot")) dev.new() with(EPA.09.Ex.12.4.naphthalene.df, qqPlot(Naphthalene.ppb, dist = "lnorm", add.line = TRUE, main = "Figure 12-7. Log Naphthalene Probability Plot")) # Test for 2 potential outliers on the original scale: #----------------------------------------------------- with(EPA.09.Ex.12.4.naphthalene.df, rosnerTest(Naphthalene.ppb, k = 2)) #Results of Outlier Test #------------------------- # #Test Method: Rosner's Test for Outliers # #Hypothesized Distribution: Normal # #Data: Naphthalene.ppb # #Sample Size: 25 # #Test Statistics: R.1 = 3.930957 # R.2 = 4.160223 # #Test Statistic Parameter: k = 2 # #Alternative Hypothesis: Up to 2 observations are not # from the same Distribution. # #Type I Error: 5% # #Number of Outliers Detected: 2 # # i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier #1 0 6.44240 7.379271 35.45 25 3.930957 2.821681 TRUE #2 1 5.23375 4.325790 23.23 13 4.160223 2.801551 TRUE #---------- # Clean up graphics.off()
serialCorrelationTest
is a generic function used to test for the
presence of lag-one serial correlation using either the rank
von Neumann ratio test, the normal approximation based on the Yule-Walker
estimate of lag-one correlation, or the normal approximation based on the
MLE of lag-one correlation. The function invokes particular
methods
which depend on the class
of the first
argument.
Currently, there is a default method and a method for objects of class "lm"
.
serialCorrelationTest(x, ...) ## Default S3 method: serialCorrelationTest(x, test = "rank.von.Neumann", alternative = "two.sided", conf.level = 0.95, ...) ## S3 method for class 'lm' serialCorrelationTest(x, test = "rank.von.Neumann", alternative = "two.sided", conf.level = 0.95, ...)
serialCorrelationTest(x, ...) ## Default S3 method: serialCorrelationTest(x, test = "rank.von.Neumann", alternative = "two.sided", conf.level = 0.95, ...) ## S3 method for class 'lm' serialCorrelationTest(x, test = "rank.von.Neumann", alternative = "two.sided", conf.level = 0.95, ...)
x |
numeric vector of observations, a numeric univariate time series of
class When Note: when |
test |
character string indicating which test to use. The possible values are: |
alternative |
character string indicating the kind of alternative hypothesis. The possible
values are |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated with
the confidence interval for the population lag-one autocorrelation. The default
value is |
... |
optional arguments for possible future methods. Currently not used. |
Let denote
observations from a
stationary time series sampled at equispaced points in time with normal (Gaussian)
errors. The function
serialCorrelationTest
tests the null hypothesis:
where denotes the true lag-1 autocorrelation (also called the lag-1
serial correlation coefficient). Actually, the null hypothesis is that the
lag-
autocorrelation is 0 for all values of
greater than 0 (i.e.,
the time series is purely random).
In the case when the argument x
is a linear model, the function
serialCorrelationTest
tests the null hypothesis (1) for the
residuals.
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
):
the lower one-sided alternative (alternative="less"
):
and the two-sided alternative:
Testing the Null Hypothesis of No Lag-1 Autocorrelation
There are several possible methods for testing the null hypothesis (1) versus any
of the three alternatives (2)-(4). The function serialCorrelationTest
allows
you to use one of three possible tests:
The rank von Neuman ratio test.
The test based on the normal approximation for the distribution of the Yule-Walker estimate of lag-one correlation.
The test based on the normal approximation for the distribution of the maximum likelihood estimate (MLE) of lag-one correlation.
Each of these tests is described below.
Test Based on Yule-Walker Estimate (test="AR1.yw"
)
The Yule-Walker estimate of the lag-1 autocorrelation is given by:
where
is the estimate of the lag- autocovariance.
(This estimator does not allow for missing values.)
Under the null hypothesis (1), the estimator of lag-1 correlation in Equation (5) is approximately distributed as a normal (Gaussian) random variable with mean 0 and variance given by:
(Box and Jenkins, 1976, pp.34-35). Thus, the null hypothesis (1) can be tested with the statistic
which is distributed approximately as a standard normal random variable under the
null hypothesis that the lag-1 autocorrelation is 0.
Test Based on the MLE (test="AR1.mle"
)
The function serialCorrelationTest
the R function arima
to
compute the MLE of the lag-one autocorrelation and the estimated variance of this
estimator. As for the test based on the Yule-Walker estimate, the z-statistic is
computed as the estimated lag-one autocorrelation divided by the square root of the
estimated variance.
Test Based on Rank von Neumann Ratio (test="rank.von.Neumann"
)
The null distribution of the serial correlation coefficient may be badly affected
by departures from normality in the underlying process (Cox, 1966; Bartels, 1977).
It is therefore a good idea to consider using a nonparametric test for randomness if
the normality of the underlying process is in doubt (Bartels, 1982).
Wald and Wolfowitz (1943) introduced the rank serial correlation coefficient, which for lag-1 autocorrelation is simply the Yule-Walker estimate (Equation (5) above) with the actual observations replaced with their ranks.
von Neumann et al. (1941) introduced a test for randomness in the context of testing for trend in the mean of a process. Their statistic is given by:
which is the ratio of the square of successive differences to the usual sums of squared deviations from the mean. This statistic is bounded between 0 and 4, and for a purely random process is symmetric about 2. Small values of this statistic indicate possible positive autocorrelation, and large values of this statistics indicate possible negative autocorrelation. Durbin and Watson (1950, 1951, 1971) proposed using this statistic in the context of checking the independence of residuals from a linear regression model and provided tables for the distribution of this statistic. This statistic is therefore often called the “Durbin-Watson statistic” (Draper and Smith, 1998, p.181).
The rank version of the von Neumann ratio statistic is given by:
where denotes the rank of the
'th observation (Bartels, 1982).
(This test statistic does not allow for missing values.) In the absence of ties,
the denominator of this test statistic is equal to
The range of the test statistic is given by:
if n is even, with a negligible adjustment if n is odd (Bartels, 1982), so
asymptotically the range is from 0 to 4, just as for the test statistic in
Equation (9) above.
Bartels (1982) shows that asymptotically, the rank von Neumann ratio statistic is a linear transformation of the rank serial correlation coefficient, so any asymptotic results apply to both statistics.
For any fixed sample size , the exact distribution of the
statistic in Equation (10) above can be computed by simply computing the value of
for all possible permutations of the serial order of the ranks.
Based on this exact distribution, Bartels (1982) presents a table of critical
values for the numerator of the RVN statistic for sample sizes between 4 and 10.
Determining the exact distribution of becomes impractical as the
sample size increases. For values of n between 10 and 100, Bartels (1982)
approximated the distribution of
by a
beta distribution over the range 0 to 4 with shape parameters
shape1=
and
shape2=
and:
Bartels (1982) checked this approximation by simulating the distribution of
for
and
and comparing the empirical quantiles
at
,
,
,
, and
with the
approximated quantiles based on the beta distribution. He found that the quantiles
agreed to 2 decimal places for eight of the 10 values, and differed by
for the other two values.
Note: The definition of the beta distribution assumes the
random variable ranges from 0 to 1. This definition can be generalized as follows.
Suppose the random variable has a beta distribution over the range
, with shape parameters
and
. Then the
random variable
defined as:
has the “standard beta distribution” as described in the help file for Beta (Johnson et al., 1995, p.210).
Bartels (1982) shows that asymptotically, has normal distribution
with mean 2 and variance
, but notes that a slightly better approximation
is given by using a variance of
.
To test the null hypothesis (1) when test="rank.von.Neumann"
, the function serialCorrelationTest
does the following:
When the sample size is between 3 and 10, the exact distribution of
is used to compute the p-value.
When the sample size is between 11 and 100, the beta approximation to the
distribution of is used to compute the p-value.
When the sample size is larger than 100, the normal approximation to the
distribution of is used to compute the p-value.
(This uses the variance
.)
When ties are present in the observations and midranks are used for the tied
observations, the distribution of the statistic based on the
assumption of no ties is not applicable. If the number of ties is small, however,
they may not grossly affect the assumed p-value.
When ties are present, the function serialCorrelationTest
issues a warning.
When the sample size is between 3 and 10, the p-value is computed based on
rounding up the computed value of to the nearest possible value
that could be observed in the case of no ties.
Computing a Confidence Interval for the Lag-1 Autocorrelation
The function serialCorrelationTest
computes an approximate
confidence interval for the lag-1 autocorrelation as follows:
where denotes the estimated standard deviation of
the estimated of lag-1 autocorrelation and
denotes the
'th quantile
of the standard normal distribution.
When test="AR1.yw"
or test="rank.von.Neumann"
, the Yule-Walker
estimate of lag-1 autocorrelation is used and the variance of the estimated
lag-1 autocorrelation is approximately:
(Box and Jenkins, 1976, p.34), so
When test="AR1.mle"
, the MLE of the lag-1 autocorrelation is used, and its
standard deviation is estimated with the square root of the estimated variance
returned by arima
.
A list of class "htest"
containing the results of the hypothesis test.
See the help file for htest.object
for details.
Data collected over time on the same phenomenon are called a time series. A time series is usually modeled as a single realization of a stochastic process; that is, if we could go back in time and repeat the experiment, we would get different results that would vary according to some probabilistic law. The simplest kind of time series is a stationary time series, in which the mean value is constant over time, the variability of the observations is constant over time, etc. That is, the probability distribution associated with each future observation is the same.
A common concern in applying standard statistical tests to time series data is the assumption of independence. Most conventional statistical hypothesis tests assume the observations are independent, but data collected sequentially in time may not satisfy this assumption. For example, high observations may tend to follow high observations (positive serial correlation), or low observations may tend to follow high observations (negative serial correlation). One way to investigate the assumption of independence is to estimate the lag-one serial correlation and test whether it is significantly different from 0.
The null distribution of the serial correlation coefficient may be badly affected by departures from normality in the underlying process (Cox, 1966; Bartels, 1977). It is therefore a good idea to consider using a nonparametric test for randomness if the normality of the underlying process is in doubt (Bartels, 1982). Knoke (1977) showed that under normality, the test based on the rank serial correlation coefficient (and hence the test based on the rank von Neumann ratio statistic) has asymptotic relative efficiency of 0.91 with respect to using the test based on the ordinary serial correlation coefficient against the alternative of first-order autocorrelation.
Bartels (1982) performed an extensive simulation study of the power of the rank von Neumann ratio test relative to the standard von Neumann ratio test (based on the statistic in Equation (9) above) and the runs test (Lehmann, 1975, 313-315). He generated a first-order autoregressive process for sample sizes of 10, 25, and 50, using 6 different parent distributions: normal, Cauchy, contaminated normal, Johnson, Stable, and exponential. Values of lag-1 autocorrelation ranged from -0.8 to 0.8. Bartels (1982) found three important results:
The rank von Neumann ratio test is far more powerful than the runs test.
For the normal process, the power of the rank von Neumann ratio test was never less than 89% of the power of the standard von Neumann ratio test.
For non-normal processes, the rank von Neumann ratio test was often much more powerful than of the standard von Neumann ratio test.
Steven P. Millard ([email protected])
Bartels, R. (1982). The Rank Version of von Neumann's Ratio Test for Randomness. Journal of the American Statistical Association 77(377), 40–46.
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Box, G.E.P., and G.M. Jenkins. (1976). Time Series Analysis: Forecasting and Control. Prentice Hall, Englewood Cliffs, NJ, Chapter 2.
Cox, D.R. (1966). The Null Distribution of the First Serial Correlation Coefficient. Biometrika 53, 623–626.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.69-70;181-192.
Durbin, J., and G.S. Watson. (1950). Testing for Serial Correlation in Least Squares Regression I. Biometrika 37, 409–428.
Durbin, J., and G.S. Watson. (1951). Testing for Serial Correlation in Least Squares Regression II. Biometrika 38, 159–178.
Durbin, J., and G.S. Watson. (1971). Testing for Serial Correlation in Least Squares Regression III. Biometrika 58, 1–19.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, pp.250–253.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapter 25.
Knoke, J.D. (1975). Testing for Randomness Against Autocorrelation Alternatives: The Parametric Case. Biometrika 62, 571–575.
Knoke, J.D. (1977). Testing for Randomness Against Autocorrelation Alternatives: Alternative Tests. Biometrika 64, 523–529.
Lehmann, E.L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, Oakland, CA, 457pp.
von Neumann, J., R.H. Kent, H.R. Bellinson, and B.I. Hart. (1941). The Mean Square Successive Difference. Annals of Mathematical Statistics 12(2), 153–162.
Wald, A., and J. Wolfowitz. (1943). An Exact Test for Randomness in the Non-Parametric Case Based on Serial Correlation. Annals of Mathematical Statistics 14, 378–388.
htest.object
, acf
, ar
,
arima
, arima.sim
,
ts.plot
, plot.ts
,
lag.plot
, Hypothesis Tests.
# Generate a purely random normal process, then use serialCorrelationTest # to test for the presence of correlation. # (Note: the call to set.seed allows you to reproduce this example.) set.seed(345) x <- rnorm(100) # Look at the data #----------------- dev.new() ts.plot(x) dev.new() acf(x) # Test for serial correlation #---------------------------- serialCorrelationTest(x) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: Rank von Neumann Test for # Lag-1 Autocorrelation # (Beta Approximation) # #Estimated Parameter(s): rho = 0.02773737 # #Estimation Method: Yule-Walker # #Data: x # #Sample Size: 100 # #Test Statistic: RVN = 1.929733 # #P-value: 0.7253405 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.1681836 # UCL = 0.2236584 # Clean up #--------- rm(x) graphics.off() #========== # Now use the R function arima.sim to generate an AR(1) process with a # lag-1 autocorrelation of 0.8, then test for autocorrelation. set.seed(432) y <- arima.sim(model = list(ar = 0.8), n = 100) # Look at the data #----------------- dev.new() ts.plot(y) dev.new() acf(y) # Test for serial correlation #---------------------------- serialCorrelationTest(y) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: Rank von Neumann Test for # Lag-1 Autocorrelation # (Beta Approximation) # #Estimated Parameter(s): rho = 0.835214 # #Estimation Method: Yule-Walker # #Data: y # #Sample Size: 100 # #Test Statistic: RVN = 0.3743174 # #P-value: 0 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.7274307 # UCL = 0.9429973 #---------- # Clean up #--------- rm(y) graphics.off() #========== # The data frame Air.df contains information on ozone (ppb^1/3), # radiation (langleys), temperature (degrees F), and wind speed (mph) # for 153 consecutive days between May 1 and September 30, 1973. # First test for serial correlation in (the cube root of) ozone. # Note that we must use the test based on the MLE because the time series # contains missing values. Serial correlation appears to be present. # Next fit a linear model that includes the predictor variables temperature, # radiation, and wind speed, and test for the presence of serial correlation # in the residuals. There is no evidence of serial correlation. # Look at the data #----------------- Air.df # ozone radiation temperature wind #05/01/1973 3.448217 190 67 7.4 #05/02/1973 3.301927 118 72 8.0 #05/03/1973 2.289428 149 74 12.6 #05/04/1973 2.620741 313 62 11.5 #05/05/1973 NA NA 56 14.3 #... #09/27/1973 NA 145 77 13.2 #09/28/1973 2.410142 191 75 14.3 #09/29/1973 2.620741 131 76 8.0 #09/30/1973 2.714418 223 68 11.5 #---------- # Test for serial correlation #---------------------------- with(Air.df, serialCorrelationTest(ozone, test = "AR1.mle")) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: z-Test for # Lag-1 Autocorrelation # (Wald Test Based on MLE) # #Estimated Parameter(s): rho = 0.5641616 # #Estimation Method: Maximum Likelihood # #Data: ozone # #Sample Size: 153 # #Number NA/NaN/Inf's: 37 # #Test Statistic: z = 7.586952 # #P-value: 3.28626e-14 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.4184197 # UCL = 0.7099034 #---------- # Next fit a linear model that includes the predictor variables temperature, # radiation, and wind speed, and test for the presence of serial correlation # in the residuals. Note setting the argument na.action = na.exclude in the # call to lm to correctly deal with missing values. #---------------------------------------------------------------------------- lm.ozone <- lm(ozone ~ radiation + temperature + wind + I(temperature^2) + I(wind^2), data = Air.df, na.action = na.exclude) # Now test for serial correlation in the residuals. #-------------------------------------------------- serialCorrelationTest(lm.ozone, test = "AR1.mle") #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: z-Test for # Lag-1 Autocorrelation # (Wald Test Based on MLE) # #Estimated Parameter(s): rho = 0.1298024 # #Estimation Method: Maximum Likelihood # #Data: Residuals # #Data Source: lm.ozone # #Sample Size: 153 # #Number NA/NaN/Inf's: 42 # #Test Statistic: z = 1.285963 # #P-value: 0.1984559 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.06803223 # UCL = 0.32763704 # Clean up #--------- rm(lm.ozone)
# Generate a purely random normal process, then use serialCorrelationTest # to test for the presence of correlation. # (Note: the call to set.seed allows you to reproduce this example.) set.seed(345) x <- rnorm(100) # Look at the data #----------------- dev.new() ts.plot(x) dev.new() acf(x) # Test for serial correlation #---------------------------- serialCorrelationTest(x) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: Rank von Neumann Test for # Lag-1 Autocorrelation # (Beta Approximation) # #Estimated Parameter(s): rho = 0.02773737 # #Estimation Method: Yule-Walker # #Data: x # #Sample Size: 100 # #Test Statistic: RVN = 1.929733 # #P-value: 0.7253405 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.1681836 # UCL = 0.2236584 # Clean up #--------- rm(x) graphics.off() #========== # Now use the R function arima.sim to generate an AR(1) process with a # lag-1 autocorrelation of 0.8, then test for autocorrelation. set.seed(432) y <- arima.sim(model = list(ar = 0.8), n = 100) # Look at the data #----------------- dev.new() ts.plot(y) dev.new() acf(y) # Test for serial correlation #---------------------------- serialCorrelationTest(y) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: Rank von Neumann Test for # Lag-1 Autocorrelation # (Beta Approximation) # #Estimated Parameter(s): rho = 0.835214 # #Estimation Method: Yule-Walker # #Data: y # #Sample Size: 100 # #Test Statistic: RVN = 0.3743174 # #P-value: 0 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.7274307 # UCL = 0.9429973 #---------- # Clean up #--------- rm(y) graphics.off() #========== # The data frame Air.df contains information on ozone (ppb^1/3), # radiation (langleys), temperature (degrees F), and wind speed (mph) # for 153 consecutive days between May 1 and September 30, 1973. # First test for serial correlation in (the cube root of) ozone. # Note that we must use the test based on the MLE because the time series # contains missing values. Serial correlation appears to be present. # Next fit a linear model that includes the predictor variables temperature, # radiation, and wind speed, and test for the presence of serial correlation # in the residuals. There is no evidence of serial correlation. # Look at the data #----------------- Air.df # ozone radiation temperature wind #05/01/1973 3.448217 190 67 7.4 #05/02/1973 3.301927 118 72 8.0 #05/03/1973 2.289428 149 74 12.6 #05/04/1973 2.620741 313 62 11.5 #05/05/1973 NA NA 56 14.3 #... #09/27/1973 NA 145 77 13.2 #09/28/1973 2.410142 191 75 14.3 #09/29/1973 2.620741 131 76 8.0 #09/30/1973 2.714418 223 68 11.5 #---------- # Test for serial correlation #---------------------------- with(Air.df, serialCorrelationTest(ozone, test = "AR1.mle")) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: z-Test for # Lag-1 Autocorrelation # (Wald Test Based on MLE) # #Estimated Parameter(s): rho = 0.5641616 # #Estimation Method: Maximum Likelihood # #Data: ozone # #Sample Size: 153 # #Number NA/NaN/Inf's: 37 # #Test Statistic: z = 7.586952 # #P-value: 3.28626e-14 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = 0.4184197 # UCL = 0.7099034 #---------- # Next fit a linear model that includes the predictor variables temperature, # radiation, and wind speed, and test for the presence of serial correlation # in the residuals. Note setting the argument na.action = na.exclude in the # call to lm to correctly deal with missing values. #---------------------------------------------------------------------------- lm.ozone <- lm(ozone ~ radiation + temperature + wind + I(temperature^2) + I(wind^2), data = Air.df, na.action = na.exclude) # Now test for serial correlation in the residuals. #-------------------------------------------------- serialCorrelationTest(lm.ozone, test = "AR1.mle") #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: rho = 0 # #Alternative Hypothesis: True rho is not equal to 0 # #Test Name: z-Test for # Lag-1 Autocorrelation # (Wald Test Based on MLE) # #Estimated Parameter(s): rho = 0.1298024 # #Estimation Method: Maximum Likelihood # #Data: Residuals # #Data Source: lm.ozone # #Sample Size: 153 # #Number NA/NaN/Inf's: 42 # #Test Statistic: z = 1.285963 # #P-value: 0.1984559 # #Confidence Interval for: rho # #Confidence Interval Method: Normal Approximation # #Confidence Interval Type: two-sided # #Confidence Level: 95% # #Confidence Interval: LCL = -0.06803223 # UCL = 0.32763704 # Clean up #--------- rm(lm.ozone)
Estimate the median, test the null hypothesis that the median is equal to a user-specified value based on the sign test, and create a confidence interval for the median.
signTest(x, y = NULL, alternative = "two.sided", mu = 0, paired = FALSE, conf.level = 0.95)
signTest(x, y = NULL, alternative = "two.sided", mu = 0, paired = FALSE, conf.level = 0.95)
x |
numeric vector of observations.
Missing ( |
y |
optional numeric vector of observations that are paired with the observations in
|
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
mu |
numeric scalar indicating the hypothesized value of the median. The default value is
|
paired |
logical scalar indicating whether to perform a paired or one-sample sign test.
The possible values are |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated with the
confidence interval for the population median. The default value is |
One-Sample Case (paired=FALSE
)
Let be a vector of
independent observations from one or more distributions that all have the
same median
.
Consider the test of the null hypothesis:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
)
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative (alternative="two.sided"
)
To perform the test of the null hypothesis (1) versus any of the three alternatives
(2)-(4), the sign test uses the test statistic which is simply the number of
observations that are greater than
(Conover, 1980, p. 122;
van Belle et al., 2004, p. 256; Hollander and Wolfe, 1999, p. 60;
Lehmann, 1975, p. 120; Sheskin, 2011; Zar, 2010, p. 537). Under the null
hypothesis, the distribution of
is a
binomial random variable with
parameters
size=
and
prob=0.5
. Usually, however, cases for
which the observations are equal to are discarded, so the distribution
of
is taken to be binomial with parameters
size=
and
prob=0.5
, where denotes the number of observations not equal to
. The sign test only requires that the observations are independent
and that they all come from one or more distributions (not necessarily the same
ones) that all have the same population median.
For a two-sided alternative hypothesis (Equation (4)), the p-value is computed as:
where denotes a binomial random variable
with parameters
size=
and
prob=
, and
is defined by:
For a one-sided lower alternative hypothesis (Equation (3)), the p-value is computed as:
and for a one-sided upper alternative hypothesis (Equation (2)), the p-value is computed as:
It is obvious that the sign test is simply a special case of the
binomial test with p=0.5
.
Computing Confidence Intervals
Based on the relationship between hypothesis tests and confidence intervals,
we can construct a confidence interval for the population median based on the
sign test (e.g., Hollander and Wolfe, 1999, p. 72; Lehmann, 1975, p. 182).
It turns out that this is equivalent to using the formulas for a nonparametric
confidence intervals for the 0.5 quantile (see eqnpar
).
Paired-Sample Case (paired=TRUE
)
When the argument paired=TRUE
, the arguments x
and y
are
assumed to have the same length, and the differences
are assumed to be independent
observations from distributions with the same median
. The sign test
can then be applied to the differences.
A list of class "htest"
containing the results of the hypothesis test.
See the help file for htest.object
for details.
A frequent question in environmental statistics is “Is the concentration of chemical X greater than Y units?”. For example, in groundwater assessment (compliance) monitoring at hazardous and solid waste sites, the concentration of a chemical in the groundwater at a downgradient well must be compared to a groundwater protection standard (GWPS). If the concentration is “above” the GWPS, then the site enters corrective action monitoring. As another example, soil screening at a Superfund site involves comparing the concentration of a chemical in the soil with a pre-determined soil screening level (SSL). If the concentration is “above” the SSL, then further investigation and possible remedial action is required. Determining what it means for the chemical concentration to be “above” a GWPS or an SSL is a policy decision: the average of the distribution of the chemical concentration must be above the GWPS or SSL, or the median must be above the GWPS or SSL, or the 95th percentile must be above the GWPS or SSL, or something else. Often, the first interpretation is used.
Hypothesis tests you can use to perform tests of location include: Student's t-test, Fisher's randomization test, the Wilcoxon signed rank test, Chen's modified t-test, the sign test, and a test based on a bootstrap confidence interval. For a discussion comparing the performance of these tests, see Millard and Neerchal (2001, pp.408-409).
Steven P. Millard ([email protected])
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York, p.122
Hollander, M., and D.A. Wolfe. (1999). Nonparametric Statistical Methods. Second Edition. John Wiley and Sons, New York, p.60.
Lehmann, E.L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, Oakland, CA, p.120.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.404–406.
Sheskin, D.J. (2011). Handbook of Parametric and Nonparametric Statistical Procedures Fifth Edition. CRC Press, Boca Raton, FL.
van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences 2nd Edition. John Wiley & Sons, New York.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ,
wilcox.test
, Hypothesis Tests, eqnpar
,
htest.object
.
# Generate 10 observations from a lognormal distribution with parameters # meanlog=2 and sdlog=1. The median of this distribution is e^2 (about 7.4). # Test the null hypothesis that the true median is equal to 5 against the # alternative that the true mean is greater than 5. # (Note: the call to set.seed allows you to reproduce this example). set.seed(23) dat <- rlnorm(10, meanlog = 2, sdlog = 1) signTest(dat, mu = 5) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: median = 5 # #Alternative Hypothesis: True median is not equal to 5 # #Test Name: Sign test # #Estimated Parameter(s): median = 19.21717 # #Data: dat # #Test Statistic: # Obs > median = 9 # #P-value: 0.02148438 # #Confidence Interval for: median # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 93.45703% # #Confidence Limit Rank(s): 3 9 # #Confidence Interval: LCL = 7.732538 # UCL = 35.722459 # Clean up #--------- rm(dat) #========== # The guidance document "Supplemental Guidance to RAGS: Calculating the # Concentration Term" (USEPA, 1992d) contains an example of 15 observations # of chromium concentrations (mg/kg) which are assumed to come from a # lognormal distribution. These data are stored in the vector # EPA.92d.chromium.vec. Here, we will use the sign test to test the null # hypothesis that the median chromium concentration is less than or equal to # 100 mg/kg vs. the alternative that it is greater than 100 mg/kg. The # estimated median is 110 mg/kg. There are 8 out of 15 observations greater # than 100 mg/kg, the p-value is equal to 0.5, and the lower 94% confidence # limit is 41 mg/kg. signTest(EPA.92d.chromium.vec, mu = 100, alternative = "greater") #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: median = 100 # #Alternative Hypothesis: True median is greater than 100 # #Test Name: Sign test # #Estimated Parameter(s): median = 110 # #Data: EPA.92d.chromium.vec # #Test Statistic: # Obs > median = 8 # #P-value: 0.5 # #Confidence Interval for: median # #Confidence Interval Method: exact # #Confidence Interval Type: lower # #Confidence Level: 94.07654% # #Confidence Limit Rank(s): 5 # #Confidence Interval: LCL = 41 # UCL = Inf
# Generate 10 observations from a lognormal distribution with parameters # meanlog=2 and sdlog=1. The median of this distribution is e^2 (about 7.4). # Test the null hypothesis that the true median is equal to 5 against the # alternative that the true mean is greater than 5. # (Note: the call to set.seed allows you to reproduce this example). set.seed(23) dat <- rlnorm(10, meanlog = 2, sdlog = 1) signTest(dat, mu = 5) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: median = 5 # #Alternative Hypothesis: True median is not equal to 5 # #Test Name: Sign test # #Estimated Parameter(s): median = 19.21717 # #Data: dat # #Test Statistic: # Obs > median = 9 # #P-value: 0.02148438 # #Confidence Interval for: median # #Confidence Interval Method: exact # #Confidence Interval Type: two-sided # #Confidence Level: 93.45703% # #Confidence Limit Rank(s): 3 9 # #Confidence Interval: LCL = 7.732538 # UCL = 35.722459 # Clean up #--------- rm(dat) #========== # The guidance document "Supplemental Guidance to RAGS: Calculating the # Concentration Term" (USEPA, 1992d) contains an example of 15 observations # of chromium concentrations (mg/kg) which are assumed to come from a # lognormal distribution. These data are stored in the vector # EPA.92d.chromium.vec. Here, we will use the sign test to test the null # hypothesis that the median chromium concentration is less than or equal to # 100 mg/kg vs. the alternative that it is greater than 100 mg/kg. The # estimated median is 110 mg/kg. There are 8 out of 15 observations greater # than 100 mg/kg, the p-value is equal to 0.5, and the lower 94% confidence # limit is 41 mg/kg. signTest(EPA.92d.chromium.vec, mu = 100, alternative = "greater") #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: median = 100 # #Alternative Hypothesis: True median is greater than 100 # #Test Name: Sign test # #Estimated Parameter(s): median = 110 # #Data: EPA.92d.chromium.vec # #Test Statistic: # Obs > median = 8 # #P-value: 0.5 # #Confidence Interval for: median # #Confidence Interval Method: exact # #Confidence Interval Type: lower # #Confidence Level: 94.07654% # #Confidence Limit Rank(s): 5 # #Confidence Interval: LCL = 41 # UCL = Inf
Simulate a multivariate matrix of random numbers from specified theoretical probability distributions and/or empirical probability distributions based on a specified rank correlation matrix, using either Latin Hypercube sampling or simple random sampling.
simulateMvMatrix(n, distributions = c(Var.1 = "norm", Var.2 = "norm"), param.list = list(Var.1 = list(mean = 0, sd = 1), Var.2 = list(mean = 0, sd = 1)), cor.mat = diag(length(distributions)), sample.method = "SRS", seed = NULL, left.tail.cutoff = ifelse(is.finite(supp.min), 0, .Machine$double.eps), right.tail.cutoff = ifelse(is.finite(supp.max), 0, .Machine$double.eps), tol.1 = .Machine$double.eps, tol.symmetry = .Machine$double.eps, tol.recip.cond.num = .Machine$double.eps, max.iter = 10)
simulateMvMatrix(n, distributions = c(Var.1 = "norm", Var.2 = "norm"), param.list = list(Var.1 = list(mean = 0, sd = 1), Var.2 = list(mean = 0, sd = 1)), cor.mat = diag(length(distributions)), sample.method = "SRS", seed = NULL, left.tail.cutoff = ifelse(is.finite(supp.min), 0, .Machine$double.eps), right.tail.cutoff = ifelse(is.finite(supp.max), 0, .Machine$double.eps), tol.1 = .Machine$double.eps, tol.symmetry = .Machine$double.eps, tol.recip.cond.num = .Machine$double.eps, max.iter = 10)
n |
a positive integer indicating the number of random vectors (i.e., the number of rows of the matrix) to generate. |
distributions |
a character vector of length Alternatively, the character string |
param.list |
a list containing Alternatively, if you specify an empirical distribution for the |
cor.mat |
a |
sample.method |
a character vector of length 1 or |
seed |
integer to supply to the R function |
left.tail.cutoff |
a numeric vector of length |
right.tail.cutoff |
a numeric vector of length |
tol.1 |
a positive numeric scalar indicating the allowable absolute deviation
from 1 for the diagonal elements of |
tol.symmetry |
a positive numeric scalar indicating the allowable absolute deviation from
0 for the difference between symmetric elements of |
tol.recip.cond.num |
a positive numeric scalar indicating the allowable minimum value of the
reciprocal of the condition number for |
max.iter |
a positive integer indicating the maximum number of iterations to use to
produce the |
Motivation
In risk assessment and Monte Carlo simulation, the outcome variable of interest,
say , is usually some function of one or more other random variables:
For example, may be the incremental lifetime cancer risk due to
ingestion of soil contaminated with benzene (Thompson et al., 1992;
Hamed and Bedient, 1997). In this case the random vector
may represent observations from several kinds of distributions that characterize
exposure and dose-response, such as benzene concentration in the soil,
soil ingestion rate, average body weight, the cancer potency factor for benzene,
etc. These distributions may or may not be assumed to be independent of one
another (Smith et al., 1992; Bukowski et al., 1995). Often, input variables in a
Monte Carlo simulation are in fact known to be correlated, such as body weight
and dermal area.
Characterizing the joint distribution of a random vector ,
where different elements of
come from different distributions,
is usually mathematically complex or impossible unless the elements
(random variables) of
are independent.
Iman and Conover (1982) present an algorithm for creating a set of
multivariate observations with a rank correlation matrix that is approximately
equal to a specified rank correlation matrix. This method allows for different
probability distributions for each element of the multivariate vector. The
details of this algorithm are as follows.
Algorithm
Specify , the desired number of random vectors (i.e., number of
rows of the
output matrix). This is specified by the
argument
n
for the function simulateMvMatrix
.
Create , the desired
correlation matrix. This is
specified by the argument
cor.mat
.
Compute , where
is a lower triangular
matrix and
where denotes the transpose of
. The function
simulateMvMatrix
uses the Cholesky decomposition to compute P
(see the R help file for chol
).
Create , an
matrix, whose columns represent
independent permutations of van der Waerden scores. That is, each
column of
is a random permutation of the scores
where denotes the cumulative distribution function of the standard
normal distribution.
Compute , the
Pearson sample correlation matrix
of
.
Make sure
is positive definite; if it is not, then repeat step 4.
Compute , where
is a lower triangular
matrix and
The function simulateMvMatrix
uses the Cholesky decomposition to compute
(see the R help file for
chol
).
Compute the lower triangular matrix
, where
Compute the matrix , where
Generate an matrix of random numbers
,
where each column of
comes from the distribution specified
by the arguments
distributions
and param.list
. Generate each
column of random numbers independently of the other columns. If the 'th
element of
sample.method
equals "SRS"
, use simple random sampling
to generate the random numbers for the 'th column of
.
If the
'th element of
sample.method
equals "LHS"
, use
Latin Hypercube sampling to generate the random numbers for the 'th column
of
. At this stage in the algorithm, the function
simulateMvMatrix
calls the function simulateVector
to
create each column of .
Order the observations within each column of so that
the order of the ranks within each column of
matches the
order of the ranks within each column of
. This way,
and
have exactly the same sample rank correlation matrix.
Explanation
Iman and Conover (1982) present two algorithms for computing an
output matrix with a specified rank correlation. The algorithm presented above is
the second, more complicated one. In order to explain the reasoning behind this
algorithm, we need to explain the simple algorithm first.
Simple Algorithm
Let denote the
'th row vector of the matrix
, the
matrix of scores. This row vector has a population correlation matrix of
,
where
denotes the
identity matrix. Thus, the
vector
has a population correlation matrix equal to
. Therefore, if we define
by
each row of has the same multivariate distribution with population
correlation matrix
. The rank correlation matrix of
should
therefore be close to
. Ordering the columns of
as
described in Step 10 above will yield a matrix of observations with the
specified distributions and the exact same rank correlation matrix as the
rank correlation matrix of
.
Iman and Conover (1982) use van der Waerden scores instead of raw ranks to create
because van der Waerden scores yield more "natural-looking" pairwise
scatterplots.
If the Pearson sample correlation matrix of , denoted
in Step 5
above, is exactly equal to the true population correlation matrix
,
then the sample correlation matrix of
is exactly equal to
,
and the rank correlation matrix of
is approximately equal to
.
The Pearson sample correlation matrix of
, however, is an estimate of the
true population correlation matrix
, and is therefore
“bouncing around”
. Likewise, the Pearson sample correlation matrix
of
is an estimate of the true population correlation matrix
,
and is therefore bouncing around
. Using this simple algorithm, the
Pearson sample correlation matrix of
, as
is defined in
Equation (7) above, may not be “close” enough to the desired rank
correlation matrix
, and thus the rank correlation of
will not
be close enough to
. Iman and Conover (1982), therefore present a more
complicated algorithm.
More Complicated Algorithm
To get around the problem mentioned above, Iman and Conover (1982) find a
lower triangular matrix
such that the matrix
as defined in Equation (6) above has a correlation matrix exactly equal to
.
The formula for
is given in Steps 6 and 7 of the algorithm above.
Iman and Conover (1982, p.330) note that even if the desired rank correlation matrix
is in fact the identity matrix
, this method of generating the
matrix will produce a matrix with an associated rank correlation that more closely
resembles
than you would get by simply generating random numbers within
each column of
.
A numeric matrix of dimension of random numbers,
where the
'th column of numbers comes from the distribution
specified by the
'th elements of the arguments
distributions
and param.list
, and the rank correlation of this matrix is
approximately equal to the argument cor.mat
. The value of
is determined by the argument
n
, and the value of is
determined by the length of the argument
distributions
.
Monte Carlo simulation and risk assessment often involve looking at the distribution or characteristics of the distribution of some outcome variable that depends upon several input variables (see Equation (1) above). Usually these input variables can be considered random variables. An important part of both sensitivity analysis and uncertainty analysis involves looking at how the distribution of the outcome variable changes with changing assumptions on the input variables. One important assumption is the correlation between the input random variables.
Often, the input random variables are assumed to be independent when in fact they are know to be correlated (Smith et al., 1992; Bukowski et al., 1995). It is therefore important to assess the effect of the assumption of independence on the distribution of the outcome variable. One way to assess the effect of this assumption is to run the Monte Carlo simulation assuming independence and then also run it assuming certain forms of correlations among the input variables.
Iman and Davenport (1982) present a series of scatterplots showing “typical”
scatterplots with various distributions on the - and
-axes and
various assumed rank correlations. These plots are meant to aid in developing
reasonable estimates of rank correlation between input variables. These plots can
easily be produced using the
simulateMvMatrix
and plot
functions.
Steven P. Millard ([email protected])
Bukowski, J., L. Korn, and D. Wartenberg. (1995). Correlated Inputs in Quantitative Risk Assessment: The Effects of Distributional Shape. Risk Analysis 15(2), 215–219.
Hamed, M., and P.B. Bedient. (1997). On the Effect of Probability Distributions of Input Variables in Public Health Risk Assessment. Risk Analysis 17(1), 97–105.
Iman, R.L., and W.J. Conover. (1980). Small Sample Sensitivity Analysis Techniques for Computer Models, With an Application to Risk Assessment (with Comments). Communications in Statistics–Volume A, Theory and Methods, 9(17), 1749–1874.
Iman, R.L., and W.J. Conover. (1982). A Distribution-Free Approach to Inducing Rank Correlation Among Input Variables. Communications in Statistics–Volume B, Simulation and Computation, 11(3), 311–334.
Iman, R.L., and J.M. Davenport. (1982). Rank Correlation Plots For Use With Correlated Input Variables. Communications in Statistics–Volume B, Simulation and Computation, 11(3), 335–360.
Iman, R.L., and J.C. Helton. (1988). An Investigation of Uncertainty and Sensitivity Analysis Techniques for Computer Models. Risk Analysis 8(1), 71–90.
Iman, R.L. and J.C. Helton. (1991). The Repeatability of Uncertainty and Sensitivity Analyses for Complex Probabilistic Risk Assessments. Risk Analysis 11(4), 591–606.
McKay, M.D., R.J. Beckman., and W.J. Conover. (1979). A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code. Technometrics 21(2), 239–245.
Millard, S.P. (2013). EnvStats: an R Package for Environmental Statistics. Springer, New York. https://link.springer.com/book/10.1007/978-1-4614-8456-1.
Smith, A.E., P.B. Ryan, and J.S. Evans. (1992). The Effect of Neglecting Correlations When Propagating Uncertainty and Estimating the Population Distribution of Risk. Risk Analysis 12(4), 467–474.
Thompson, K.M., D.E. Burmaster, and E.A.C. Crouch. (1992). Monte Carlo Techniques for Quantitative Uncertainty Analysis in Public Health Risk Assessments. Risk Analysis 12(1), 53–63.
Vose, D. (2008). Risk Analysis: A Quantitative Guide. Third Edition. John Wiley & Sons, West Sussex, UK, 752 pp.
Probability Distributions and Random Numbers, Empirical,
simulateVector
, cor
, set.seed
.
# Generate 5 observations from a standard bivariate normal distribution # with a rank correlation matrix (approximately) equal to the 2 x 2 # identity matrix, using simple random sampling for each # marginal distribution. simulateMvMatrix(5, seed = 47) # Var.1 Var.2 #[1,] 0.01513086 0.03960243 #[2,] -1.08573747 0.09147291 #[3,] -0.98548216 0.49382018 #[4,] -0.25204590 -0.92245624 #[5,] -1.46575030 -1.82822917 #========== # Look at the observed rank correlation matrix for 100 observations # from a standard bivariate normal distribution with a rank correlation matrix # (approximately) equal to the 2 x 2 identity matrix. Compare this observed # rank correlation matrix with the observed rank correlation matrix based on # generating two independent sets of standard normal random numbers. # Note that the cross-correlation is closer to 0 for the matrix created with # simulateMvMatrix. cor(simulateMvMatrix(100, seed = 47), method = "spearman") # Var.1 Var.2 #Var.1 1.000000000 -0.005976598 #Var.2 -0.005976598 1.000000000 cor(matrix(simulateVector(200, seed = 47), 100 , 2), method = "spearman") # [,1] [,2] #[1,] 1.00000000 -0.05374137 #[2,] -0.05374137 1.00000000 #========== # Generate 1000 observations from a bivariate distribution, where the first # distribution is a normal distribution with parameters mean=10 and sd=2, # the second distribution is a lognormal distribution with parameters # mean=10 and cv=1, and the desired rank correlation between the two # distributions is 0.8. Look at the observed rank correlation matrix, and # plot the results. mat <- simulateMvMatrix(1000, distributions = c(N.10.2 = "norm", LN.10.1 = "lnormAlt"), param.list = list(N.10.2 = list(mean=10, sd=2), LN.10.1 = list(mean=10, cv=1)), cor.mat = matrix(c(1, .8, .8, 1), 2, 2), seed = 47) round(cor(mat, method = "spearman"), 2) # N.10.2 LN.10.1 #N.10.2 1.00 0.78 #LN.10.1 0.78 1.00 dev.new() plot(mat, xlab = "Observations from N(10, 2)", ylab = "Observations from LN(mean=10, cv=1)", main = "Lognormal vs. Normal Deviates with Rank Correlation 0.8") #---------- # Repeat the last example, but use Latin Hypercube sampling for both # distributions. Note the wider range on the y-axis. mat.LHS <- simulateMvMatrix(1000, distributions = c(N.10.2 = "norm", LN.10.1 = "lnormAlt"), param.list = list(N.10.2 = list(mean=10, sd=2), LN.10.1 = list(mean=10, cv=1)), cor.mat = matrix(c(1, .8, .8, 1), 2, 2), sample.method = "LHS", seed = 298) round(cor(mat.LHS, method = "spearman"), 2) # N.10.2 LN.10.1 #N.10.2 1.00 0.79 #LN.10.1 0.79 1.00 dev.new() plot(mat.LHS, xlab = "Observations from N(10, 2)", ylab = "Observations from LN(mean=10, cv=1)", main = paste("Lognormal vs. Normal Deviates with Rank Correlation 0.8", "(Latin Hypercube Sampling)", sep = "\n")) #========== # Generate 1000 observations from a multivariate distribution, where the # first distribution is a normal distribution with parameters # mean=10 and sd=2, the second distribution is a lognormal distribution # with parameters mean=10 and cv=1, the third distribution is a beta # distribution with parameters shape1=2 and shape2=3, and the fourth # distribution is an empirical distribution of 100 observations that # we'll generate from a Pareto distribution with parameters # location=10 and shape=2. Set the desired rank correlation matrix to: cor.mat <- matrix(c(1, .8, 0, .5, .8, 1, 0, .7, 0, 0, 1, .2, .5, .7, .2, 1), 4, 4) cor.mat # [,1] [,2] [,3] [,4] #[1,] 1.0 0.8 0.0 0.5 #[2,] 0.8 1.0 0.0 0.7 #[3,] 0.0 0.0 1.0 0.2 #[4,] 0.5 0.7 0.2 1.0 # Use Latin Hypercube sampling for each variable, look at the observed # rank correlation matrix, and plot the results. pareto.rns <- simulateVector(100, "pareto", list(location = 10, shape = 2), sample.method = "LHS", seed = 56) mat <- simulateMvMatrix(1000, distributions = c(Normal = "norm", Lognormal = "lnormAlt", Beta = "beta", Empirical = "emp"), param.list = list(Normal = list(mean=10, sd=2), Lognormal = list(mean=10, cv=1), Beta = list(shape1 = 2, shape2 = 3), Empirical = list(obs = pareto.rns)), cor.mat = cor.mat, seed = 47, sample.method = "LHS") round(cor(mat, method = "spearman"), 2) # Normal Lognormal Beta Empirical #Normal 1.00 0.78 -0.01 0.47 #Lognormal 0.78 1.00 -0.01 0.67 #Beta -0.01 -0.01 1.00 0.19 #Empirical 0.47 0.67 0.19 1.00 dev.new() pairs(mat) #========== # Clean up #--------- rm(mat, mat.LHS, pareto.rns) graphics.off()
# Generate 5 observations from a standard bivariate normal distribution # with a rank correlation matrix (approximately) equal to the 2 x 2 # identity matrix, using simple random sampling for each # marginal distribution. simulateMvMatrix(5, seed = 47) # Var.1 Var.2 #[1,] 0.01513086 0.03960243 #[2,] -1.08573747 0.09147291 #[3,] -0.98548216 0.49382018 #[4,] -0.25204590 -0.92245624 #[5,] -1.46575030 -1.82822917 #========== # Look at the observed rank correlation matrix for 100 observations # from a standard bivariate normal distribution with a rank correlation matrix # (approximately) equal to the 2 x 2 identity matrix. Compare this observed # rank correlation matrix with the observed rank correlation matrix based on # generating two independent sets of standard normal random numbers. # Note that the cross-correlation is closer to 0 for the matrix created with # simulateMvMatrix. cor(simulateMvMatrix(100, seed = 47), method = "spearman") # Var.1 Var.2 #Var.1 1.000000000 -0.005976598 #Var.2 -0.005976598 1.000000000 cor(matrix(simulateVector(200, seed = 47), 100 , 2), method = "spearman") # [,1] [,2] #[1,] 1.00000000 -0.05374137 #[2,] -0.05374137 1.00000000 #========== # Generate 1000 observations from a bivariate distribution, where the first # distribution is a normal distribution with parameters mean=10 and sd=2, # the second distribution is a lognormal distribution with parameters # mean=10 and cv=1, and the desired rank correlation between the two # distributions is 0.8. Look at the observed rank correlation matrix, and # plot the results. mat <- simulateMvMatrix(1000, distributions = c(N.10.2 = "norm", LN.10.1 = "lnormAlt"), param.list = list(N.10.2 = list(mean=10, sd=2), LN.10.1 = list(mean=10, cv=1)), cor.mat = matrix(c(1, .8, .8, 1), 2, 2), seed = 47) round(cor(mat, method = "spearman"), 2) # N.10.2 LN.10.1 #N.10.2 1.00 0.78 #LN.10.1 0.78 1.00 dev.new() plot(mat, xlab = "Observations from N(10, 2)", ylab = "Observations from LN(mean=10, cv=1)", main = "Lognormal vs. Normal Deviates with Rank Correlation 0.8") #---------- # Repeat the last example, but use Latin Hypercube sampling for both # distributions. Note the wider range on the y-axis. mat.LHS <- simulateMvMatrix(1000, distributions = c(N.10.2 = "norm", LN.10.1 = "lnormAlt"), param.list = list(N.10.2 = list(mean=10, sd=2), LN.10.1 = list(mean=10, cv=1)), cor.mat = matrix(c(1, .8, .8, 1), 2, 2), sample.method = "LHS", seed = 298) round(cor(mat.LHS, method = "spearman"), 2) # N.10.2 LN.10.1 #N.10.2 1.00 0.79 #LN.10.1 0.79 1.00 dev.new() plot(mat.LHS, xlab = "Observations from N(10, 2)", ylab = "Observations from LN(mean=10, cv=1)", main = paste("Lognormal vs. Normal Deviates with Rank Correlation 0.8", "(Latin Hypercube Sampling)", sep = "\n")) #========== # Generate 1000 observations from a multivariate distribution, where the # first distribution is a normal distribution with parameters # mean=10 and sd=2, the second distribution is a lognormal distribution # with parameters mean=10 and cv=1, the third distribution is a beta # distribution with parameters shape1=2 and shape2=3, and the fourth # distribution is an empirical distribution of 100 observations that # we'll generate from a Pareto distribution with parameters # location=10 and shape=2. Set the desired rank correlation matrix to: cor.mat <- matrix(c(1, .8, 0, .5, .8, 1, 0, .7, 0, 0, 1, .2, .5, .7, .2, 1), 4, 4) cor.mat # [,1] [,2] [,3] [,4] #[1,] 1.0 0.8 0.0 0.5 #[2,] 0.8 1.0 0.0 0.7 #[3,] 0.0 0.0 1.0 0.2 #[4,] 0.5 0.7 0.2 1.0 # Use Latin Hypercube sampling for each variable, look at the observed # rank correlation matrix, and plot the results. pareto.rns <- simulateVector(100, "pareto", list(location = 10, shape = 2), sample.method = "LHS", seed = 56) mat <- simulateMvMatrix(1000, distributions = c(Normal = "norm", Lognormal = "lnormAlt", Beta = "beta", Empirical = "emp"), param.list = list(Normal = list(mean=10, sd=2), Lognormal = list(mean=10, cv=1), Beta = list(shape1 = 2, shape2 = 3), Empirical = list(obs = pareto.rns)), cor.mat = cor.mat, seed = 47, sample.method = "LHS") round(cor(mat, method = "spearman"), 2) # Normal Lognormal Beta Empirical #Normal 1.00 0.78 -0.01 0.47 #Lognormal 0.78 1.00 -0.01 0.67 #Beta -0.01 -0.01 1.00 0.19 #Empirical 0.47 0.67 0.19 1.00 dev.new() pairs(mat) #========== # Clean up #--------- rm(mat, mat.LHS, pareto.rns) graphics.off()
Simulate a vector of random numbers from a specified theoretical probability distribution or empirical probability distribution, using either Latin Hypercube sampling or simple random sampling.
simulateVector(n, distribution = "norm", param.list = list(mean = 0, sd = 1), sample.method = "SRS", seed = NULL, sorted = FALSE, left.tail.cutoff = ifelse(is.finite(supp.min), 0, .Machine$double.eps), right.tail.cutoff = ifelse(is.finite(supp.max), 0, .Machine$double.eps))
simulateVector(n, distribution = "norm", param.list = list(mean = 0, sd = 1), sample.method = "SRS", seed = NULL, sorted = FALSE, left.tail.cutoff = ifelse(is.finite(supp.min), 0, .Machine$double.eps), right.tail.cutoff = ifelse(is.finite(supp.max), 0, .Machine$double.eps))
n |
a positive integer indicating the number of random numbers to generate. |
distribution |
a character string denoting the distribution abbreviation. The default value is
Alternatively, the character string |
param.list |
a list with values for the parameters of the distribution.
The default value is Alternatively, if you specify an empirical distribution by setting |
sample.method |
a character string indicating whether to use simple random sampling |
seed |
integer to supply to the R function |
sorted |
logical scalar indicating whether to return the random numbers in sorted
(ascending) order. The default value is |
left.tail.cutoff |
a scalar between 0 and 1 indicating what proportion of the left-tail of
the probability distribution to omit for Latin Hypercube sampling.
For densities with a finite support minimum (e.g., Lognormal or
Empirical) the default value is |
right.tail.cutoff |
a scalar between 0 and 1 indicating what proportion of the right-tail of
the probability distribution to omit for Latin Hypercube sampling.
For densities with a finite support maximum (e.g., Beta or
Empirical) the default value is |
Simple Random Sampling (sample.method="SRS"
)
When sample.method="SRS"
, the function simulateVector
simply
calls the function r
abb, where abb denotes the
abbreviation of the specified distribution (e.g., rlnorm
,
remp
, etc.).
Latin Hypercube Sampling (sample.method="LHS"
)
When sample.method="LHS"
, the function simulateVector
generates
n
random numbers using Latin Hypercube sampling. The distribution is
divided into n
intervals of equal probability and simple random
sampling is performed once within each interval; i.e., Latin Hypercube sampling
is simply stratified sampling without replacement, where the strata are defined
by the 0'th, 100(1/n)'th, 100(2/n)'th, ..., and 100'th percentiles of the
distribution.
Latin Hypercube sampling, sometimes abbreviated LHS,
is a method of sampling from a probability distribution that ensures all
portions of the probability distribution are represented in the sample.
It was introduced in the published literature by McKay et al. (1979) to overcome
the following problem in Monte Carlo simulation based on simple random sampling
(SRS). Suppose we want to generate random numbers from a specified distribution.
If we use simple random sampling, there is a low probability of getting very many
observations in an area of low probability of the distribution. For example, if
we generate observations from the distribution, the probability that none
of these observations falls into the upper 98'th percentile of the distribution
is
. So, for example, there is a 13% chance that out of 100
random numbers, none will fall at or above the 98'th percentile. If we are
interested in reproducing the shape of the distribution, we will need a very large
number of observations to ensure that we can adequately characterize the tails of
the distribution (Vose, 2008, pp. 59–62).
See Millard (2013) for a visual explanation of Latin Hypercube sampling.
a numeric vector of random numbers from the specified distribution.
Latin Hypercube sampling, sometimes abbreviated LHS, is a method of sampling from a probability distribution that ensures all portions of the probability distribution are represented in the sample. It was introduced in the published literature by McKay et al. (1979). Latin Hypercube sampling is often used in probabilistic risk assessment, specifically for sensitivity and uncertainty analysis (e.g., Iman and Conover, 1980; Iman and Helton, 1988; Iman and Helton, 1991; Vose, 1996).
Steven P. Millard ([email protected])
Iman, R.L., and W.J. Conover. (1980). Small Sample Sensitivity Analysis Techniques for Computer Models, With an Application to Risk Assessment (with Comments). Communications in Statistics–Volume A, Theory and Methods, 9(17), 1749–1874.
Iman, R.L., and J.C. Helton. (1988). An Investigation of Uncertainty and Sensitivity Analysis Techniques for Computer Models. Risk Analysis 8(1), 71–90.
Iman, R.L. and J.C. Helton. (1991). The Repeatability of Uncertainty and Sensitivity Analyses for Complex Probabilistic Risk Assessments. Risk Analysis 11(4), 591–606.
McKay, M.D., R.J. Beckman., and W.J. Conover. (1979). A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output From a Computer Code. Technometrics 21(2), 239–245.
Millard, S.P. (2013). EnvStats: an R Package for Environmental Statistics. Springer, New York. https://link.springer.com/book/10.1007/978-1-4614-8456-1.
Vose, D. (2008). Risk Analysis: A Quantitative Guide. Third Edition. John Wiley & Sons, West Sussex, UK, 752 pp.
Probability Distributions and Random Numbers, Empirical,
simulateMvMatrix
, set.seed
.
# Generate 10 observations from a lognormal distribution with # parameters mean=10 and cv=1 using simple random sampling: simulateVector(10, distribution = "lnormAlt", param.list = list(mean = 10, cv = 1), seed = 47, sort = TRUE) # [1] 2.086931 2.863589 3.112866 5.592502 5.732602 7.160707 # [7] 7.741327 8.251306 12.782493 37.214748 #---------- # Repeat the above example by calling rlnormAlt directly: set.seed(47) sort(rlnormAlt(10, mean = 10, cv = 1)) # [1] 2.086931 2.863589 3.112866 5.592502 5.732602 7.160707 # [7] 7.741327 8.251306 12.782493 37.214748 #---------- # Now generate 10 observations from the same lognormal distribution # but use Latin Hypercube sampling. Note that the largest value # is larger than for simple random sampling: simulateVector(10, distribution = "lnormAlt", param.list = list(mean = 10, cv = 1), seed = 47, sample.method = "LHS", sort = TRUE) # [1] 2.406149 2.848428 4.311175 5.510171 6.467852 8.174608 # [7] 9.506874 12.298185 17.022151 53.552699 #========== # Generate 50 observations from a Pareto distribution with parameters # location=10 and shape=2, then use this resulting vector of # observations as the basis for generating 3 observations from an # empirical distribution using Latin Hypercube sampling: set.seed(321) pareto.rns <- rpareto(50, location = 10, shape = 2) simulateVector(3, distribution = "emp", param.list = list(obs = pareto.rns), sample.method = "LHS") #[1] 11.50685 13.50962 17.47335 #========== # Clean up #--------- rm(pareto.rns)
# Generate 10 observations from a lognormal distribution with # parameters mean=10 and cv=1 using simple random sampling: simulateVector(10, distribution = "lnormAlt", param.list = list(mean = 10, cv = 1), seed = 47, sort = TRUE) # [1] 2.086931 2.863589 3.112866 5.592502 5.732602 7.160707 # [7] 7.741327 8.251306 12.782493 37.214748 #---------- # Repeat the above example by calling rlnormAlt directly: set.seed(47) sort(rlnormAlt(10, mean = 10, cv = 1)) # [1] 2.086931 2.863589 3.112866 5.592502 5.732602 7.160707 # [7] 7.741327 8.251306 12.782493 37.214748 #---------- # Now generate 10 observations from the same lognormal distribution # but use Latin Hypercube sampling. Note that the largest value # is larger than for simple random sampling: simulateVector(10, distribution = "lnormAlt", param.list = list(mean = 10, cv = 1), seed = 47, sample.method = "LHS", sort = TRUE) # [1] 2.406149 2.848428 4.311175 5.510171 6.467852 8.174608 # [7] 9.506874 12.298185 17.022151 53.552699 #========== # Generate 50 observations from a Pareto distribution with parameters # location=10 and shape=2, then use this resulting vector of # observations as the basis for generating 3 observations from an # empirical distribution using Latin Hypercube sampling: set.seed(321) pareto.rns <- rpareto(50, location = 10, shape = 2) simulateVector(3, distribution = "emp", param.list = list(obs = pareto.rns), sample.method = "LHS") #[1] 11.50685 13.50962 17.47335 #========== # Clean up #--------- rm(pareto.rns)
Ammonia nitrogen (NH—N) concentration (mg/L) in the Skagit River
measured monthly from January 1978 through December 2010 at the
Marblemount, Washington monitoring station.
Skagit.NH3_N.df
Skagit.NH3_N.df
A data frame with 396 observations on the following 6 variables.
Date
Date of collection.
NH3_N.Orig.mg.per.L
a character vector of the ammonia nitrogen concentrations where values for non-detects are preceeded with the less-than sign (<).
NH3_N.mg.per.L
a numeric vector of ammonia nitrogen concentrations; non-detects have been coded to their detection limit.
DQ1
factor of data qualifier values.
U
= The analyte was not detected at or above the reported result.
J
= The analyte was positively identified. The associated numerical result is an estimate.
UJ
= The analyte was not detected at or above the reported estimated result.
DQ2
factor of data qualifier values.
An asterisk (*
) indicates a possible quality problem for the result.
Censored
a logical vector indicating which observations are censored.
Station 04A100 - Skagit R \@ Marblemount. Located at the bridge on the Casdace River Road where Highway 20 (North Cascades Highway) turns 90 degrees in Marblemount.
Washington State Deparment of Ecology.
https://ecology.wa.gov/Research-Data/Monitoring-assessment/River-stream-monitoring/Water-quality-monitoring/Using-river-stream-water-quality-data
Compute the sample coefficient of skewness.
skewness(x, na.rm = FALSE, method = "fisher", l.moment.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0))
skewness(x, na.rm = FALSE, method = "fisher", l.moment.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0))
x |
numeric vector of observations. |
na.rm |
logical scalar indicating whether to remove missing values from |
method |
character string specifying what method to use to compute the sample coefficient
of skewness. The possible values are
|
l.moment.method |
character string specifying what method to use to compute the
|
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for
the plotting positions when |
Let denote a random sample of
observations from
some distribution with mean
and standard deviation
.
Product Moment Coefficient of Skewness (method="moment"
or method="fisher"
)
The coefficient of skewness of a distribution is the third
standardized moment about the mean:
where
and
denotes the 'th moment about the mean (central moment).
That is, the coefficient of skewness is the third central moment divided by the
cube of the standard deviation. The coefficient of skewness is 0 for a symmetric
distribution. Distributions with positive skew have heavy right-hand tails, and
distributions with negative skew have heavy left-hand tails.
When method="moment"
, the coefficient of skewness is estimated using the
method of moments estimator for the third central moment and and the method of
moments estimator for the variance:
where
This form of estimation should be used when resampling (bootstrap or jackknife).
When method="fisher"
, the coefficient of skewness is estimated using the
unbiased estimator for the third central moment
(Serfling, 1980, p.73; Chen, 1995, p.769) and the unbiased estimator for the
variance.
where
(Note that Serfling, 1980, p.73 contains a typographical error in the numerator for
the unbiased estimator of the third central moment.)
L-Moment Coefficient of skewness (method="l.moments"
)
Hosking (1990) defines the -moment analog of the coefficient of skewness as:
that is, the third -moment divided by the second
-moment. He shows
that this quantity lies in the interval (-1, 1).
When l.moment.method="unbiased"
, the -skewness is estimated by:
that is, the unbiased estimator of the third -moment divided by the
unbiased estimator of the second
-moment.
When l.moment.method="plotting.position"
, the -skewness is estimated by:
that is, the plotting-position estimator of the third -moment divided by the
plotting-position estimator of the second
-moment.
See the help file for lMoment
for more information on
estimating -moments.
A numeric scalar – the sample coefficient of skewness.
Traditionally, the coefficient of skewness has been estimated using product
moment estimators. Sometimes an estimate of skewness is used in a
goodness-of-fit test for normality (e.g., set test="skew"
in the call to gofTest
).
Hosking (1990) introduced the idea of -moments and
-skewness.
Vogel and Fennessey (1993) argue that -moment ratios should replace
product moment ratios because of their superior performance (they are nearly
unbiased and better for discriminating between distributions).
They compare product moment diagrams with
-moment diagrams.
Hosking and Wallis (1995) recommend using unbiased estimators of -moments
(vs. plotting-position estimators) for almost all applications.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Chen, L. (1995). Testing the Mean of Skewed Distributions. Journal of the American Statistical Association 90(430), 767–772.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley and Sons, New York, p.73.
Taylor, J.K. (1990). Statistical Techniques for Data Analysis. Lewis Publishers, Boca Raton, FL.
Vogel, R.M., and N.M. Fennessey. (1993). Moment Diagrams Should Replace
Product Moment Diagrams. Water Resources Research 29(6), 1745–1752.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
var
, sd
, cv
,
kurtosis
, summaryFull
,
Summary Statistics.
# Generate 20 observations from a lognormal distribution with parameters # mean=10 and cv=1, and estimate the coefficient of skewness. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 1) skewness(dat) #[1] 0.9876632 skewness(dat, method = "moment") #[1] 0.9119889 skewness(dat, meth = "l.moment") #[1] 0.2656674 #---------- # Clean up rm(dat)
# Generate 20 observations from a lognormal distribution with parameters # mean=10 and cv=1, and estimate the coefficient of skewness. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 1) skewness(dat) #[1] 0.9876632 skewness(dat, method = "moment") #[1] 0.9119889 skewness(dat, meth = "l.moment") #[1] 0.2656674 #---------- # Clean up rm(dat)
For a strip plot or scatterplot produced using the package ggplot2
(e.g., with geom_point
),
for each value on the -axis, add text indicating the mean and
standard deviation of the
-values for that particular
-value.
stat_mean_sd_text(mapping = NULL, data = NULL, geom = ifelse(text.box, "label", "text"), position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, y.pos = NULL, y.expand.factor = 0.2, digits = 1, digit.type = "round", nsmall = ifelse(digit.type == "round", digits, 0), text.box = FALSE, alpha = 1, angle = 0, color = "black", family = "", fontface = "plain", hjust = 0.5, label.padding = ggplot2::unit(0.25, "lines"), label.r = ggplot2::unit(0.15, "lines"), label.size = 0.25, lineheight = 1.2, size = 4, vjust = 0.5, ...)
stat_mean_sd_text(mapping = NULL, data = NULL, geom = ifelse(text.box, "label", "text"), position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, y.pos = NULL, y.expand.factor = 0.2, digits = 1, digit.type = "round", nsmall = ifelse(digit.type == "round", digits, 0), text.box = FALSE, alpha = 1, angle = 0, color = "black", family = "", fontface = "plain", hjust = 0.5, label.padding = ggplot2::unit(0.25, "lines"), label.r = ggplot2::unit(0.15, "lines"), label.size = 0.25, lineheight = 1.2, size = 4, vjust = 0.5, ...)
mapping , data , position , na.rm , show.legend , inherit.aes
|
See the help file for |
geom |
Character string indicating which |
y.pos |
Numeric scalar indicating the |
y.expand.factor |
For the case when |
digits |
Integer indicating the number of digits to use for displaying the
mean and standard deviation. When |
digit.type |
Character string indicating whether the |
nsmall |
Integer passed to the function |
text.box |
Logical scalar indicating whether to surround the text with a text box (i.e.,
whether to use |
alpha , angle , color , family , fontface , hjust , vjust , lineheight , size
|
See the help file for |
label.padding , label.r , label.size
|
See the help file for |
... |
Other arguments passed on to |
See the help file for geom_text
for details about how
geom_text
and geom_label
work.
See the vignette Extending ggplot2 at https://cran.r-project.org/package=ggplot2/vignettes/extending-ggplot2.html for information on how to create a new stat.
The function stat_mean_sd_text
is called by the function geom_stripchart
.
Steven P. Millard ([email protected])
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (Use R!). Second Edition. Springer.
geom_stripchart
, stat_median_iqr_text
,
stat_n_text
, stat_test_text
,
geom_text
, geom_label
,
mean
, sd
.
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #==================== # Example 1: # Using the built-in data frame mtcars, # plot miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + theme(legend.position = "none") p + geom_point() + labs(x = "Number of Cylinders", y = "Miles per Gallon") # Now add text indicating the mean and standard deviation # for each level of cylinder. #-------------------------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 2: # Repeat Example 1, but: # 1) facet by transmission type, # 2) make the size of the text smaller. #-------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text(size = 3) + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 3: # Repeat Example 1, but specify the y-position for the text. #----------------------------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text(y.pos = 36) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 4: # Repeat Example 1, but show the # mean and standard deviation in a text box. #------------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text(text.box = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 5: # Repeat Example 1, but use the color brown for the text. #-------------------------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text(color = "brown") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 6: # Repeat Example 1, but: # 1) use the same colors for the text that are used for each group, # 2) use the bold monospaced font. #------------------------------------------------------------------ mat <- ggplot_build(p)$data[[1]] group <- mat[, "group"] colors <- mat[match(1:max(group), group), "colour"] dev.new() p + geom_point() + stat_mean_sd_text(color = colors, size = 5, family = "mono", fontface = "bold") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Clean up #--------- graphics.off() rm(p, mat, group, colors)
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #==================== # Example 1: # Using the built-in data frame mtcars, # plot miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + theme(legend.position = "none") p + geom_point() + labs(x = "Number of Cylinders", y = "Miles per Gallon") # Now add text indicating the mean and standard deviation # for each level of cylinder. #-------------------------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 2: # Repeat Example 1, but: # 1) facet by transmission type, # 2) make the size of the text smaller. #-------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text(size = 3) + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 3: # Repeat Example 1, but specify the y-position for the text. #----------------------------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text(y.pos = 36) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 4: # Repeat Example 1, but show the # mean and standard deviation in a text box. #------------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text(text.box = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 5: # Repeat Example 1, but use the color brown for the text. #-------------------------------------------------------- dev.new() p + geom_point() + stat_mean_sd_text(color = "brown") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 6: # Repeat Example 1, but: # 1) use the same colors for the text that are used for each group, # 2) use the bold monospaced font. #------------------------------------------------------------------ mat <- ggplot_build(p)$data[[1]] group <- mat[, "group"] colors <- mat[match(1:max(group), group), "colour"] dev.new() p + geom_point() + stat_mean_sd_text(color = colors, size = 5, family = "mono", fontface = "bold") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Clean up #--------- graphics.off() rm(p, mat, group, colors)
For a strip plot or scatterplot produced using the package ggplot2
(e.g., with geom_point
),
for each value on the -axis, add text indicating the median and
interquartile range (IQR) of the
-values for that particular
-value.
stat_median_iqr_text(mapping = NULL, data = NULL, geom = ifelse(text.box, "label", "text"), position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, y.pos = NULL, y.expand.factor = 0.2, digits = 1, digit.type = "round", nsmall = ifelse(digit.type == "round", digits, 0), text.box = FALSE, alpha = 1, angle = 0, color = "black", family = "", fontface = "plain", hjust = 0.5, label.padding = ggplot2::unit(0.25, "lines"), label.r = ggplot2::unit(0.15, "lines"), label.size = 0.25, lineheight = 1.2, size = 4, vjust = 0.5, ...)
stat_median_iqr_text(mapping = NULL, data = NULL, geom = ifelse(text.box, "label", "text"), position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, y.pos = NULL, y.expand.factor = 0.2, digits = 1, digit.type = "round", nsmall = ifelse(digit.type == "round", digits, 0), text.box = FALSE, alpha = 1, angle = 0, color = "black", family = "", fontface = "plain", hjust = 0.5, label.padding = ggplot2::unit(0.25, "lines"), label.r = ggplot2::unit(0.15, "lines"), label.size = 0.25, lineheight = 1.2, size = 4, vjust = 0.5, ...)
mapping , data , position , na.rm , show.legend , inherit.aes
|
See the help file for |
geom |
Character string indicating which |
y.pos |
Numeric scalar indicating the |
y.expand.factor |
For the case when |
digits |
Integer indicating the number of digits to use for displaying the
median and interquartile range. When |
digit.type |
Character string indicating whether the |
nsmall |
Integer passed to the function |
text.box |
Logical scalar indicating whether to surround the text with a text box (i.e.,
whether to use |
alpha , angle , color , family , fontface , hjust , vjust , lineheight , size
|
See the help file for |
label.padding , label.r , label.size
|
See the help file for |
... |
Other arguments passed on to |
See the help file for geom_text
for details about how
geom_text
and geom_label
work.
See the vignette Extending ggplot2 at https://cran.r-project.org/package=ggplot2/vignettes/extending-ggplot2.html for information on how to create a new stat.
The function stat_median_iqr_text
is called by the function geom_stripchart
.
Steven P. Millard ([email protected])
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (Use R!). Second Edition. Springer.
geom_stripchart
, stat_mean_sd_text
,
stat_n_text
, stat_test_text
,
geom_text
, geom_label
,
median
, iqr
.
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #==================== # Example 1: # Using the built-in data frame mtcars, # plot miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + theme(legend.position = "none") p + geom_point() + labs(x = "Number of Cylinders", y = "Miles per Gallon") # Now add text indicating the median and interquartile range # for each level of cylinder. #----------------------------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 2: # Repeat Example 1, but: # 1) facet by transmission type, # 2) make the size of the text smaller. #-------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text(size = 2.75) + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 3: # Repeat Example 1, but specify the y-position for the text. #----------------------------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text(y.pos = 36) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 4: # Repeat Example 1, but show the # median and interquartile range in a text box. #---------------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text(text.box = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 5: # Repeat Example 1, but use the color brown for the text. #-------------------------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text(color = "brown") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 6: # Repeat Example 1, but: # 1) use the same colors for the text that are used for each group, # 2) use the bold monospaced font. #------------------------------------------------------------------ mat <- ggplot_build(p)$data[[1]] group <- mat[, "group"] colors <- mat[match(1:max(group), group), "colour"] dev.new() p + geom_point() + stat_median_iqr_text(color = colors, family = "mono", fontface = "bold") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Clean up #--------- graphics.off() rm(p, mat, group, colors)
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #==================== # Example 1: # Using the built-in data frame mtcars, # plot miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + theme(legend.position = "none") p + geom_point() + labs(x = "Number of Cylinders", y = "Miles per Gallon") # Now add text indicating the median and interquartile range # for each level of cylinder. #----------------------------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 2: # Repeat Example 1, but: # 1) facet by transmission type, # 2) make the size of the text smaller. #-------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text(size = 2.75) + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 3: # Repeat Example 1, but specify the y-position for the text. #----------------------------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text(y.pos = 36) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 4: # Repeat Example 1, but show the # median and interquartile range in a text box. #---------------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text(text.box = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 5: # Repeat Example 1, but use the color brown for the text. #-------------------------------------------------------- dev.new() p + geom_point() + stat_median_iqr_text(color = "brown") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Example 6: # Repeat Example 1, but: # 1) use the same colors for the text that are used for each group, # 2) use the bold monospaced font. #------------------------------------------------------------------ mat <- ggplot_build(p)$data[[1]] group <- mat[, "group"] colors <- mat[match(1:max(group), group), "colour"] dev.new() p + geom_point() + stat_median_iqr_text(color = colors, family = "mono", fontface = "bold") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #==================== # Clean up #--------- graphics.off() rm(p, mat, group, colors)
For a strip plot or scatterplot produced using the package ggplot2
(e.g., with geom_point
),
for each value on the -axis, add text indicating the
number of
-values for that particular
-value.
stat_n_text(mapping = NULL, data = NULL, geom = ifelse(text.box, "label", "text"), position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, y.pos = NULL, y.expand.factor = 0.1, text.box = FALSE, alpha = 1, angle = 0, color = "black", family = "", fontface = "plain", hjust = 0.5, label.padding = ggplot2::unit(0.25, "lines"), label.r = ggplot2::unit(0.15, "lines"), label.size = 0.25, lineheight = 1.2, size = 4, vjust = 0.5, ...)
stat_n_text(mapping = NULL, data = NULL, geom = ifelse(text.box, "label", "text"), position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, y.pos = NULL, y.expand.factor = 0.1, text.box = FALSE, alpha = 1, angle = 0, color = "black", family = "", fontface = "plain", hjust = 0.5, label.padding = ggplot2::unit(0.25, "lines"), label.r = ggplot2::unit(0.15, "lines"), label.size = 0.25, lineheight = 1.2, size = 4, vjust = 0.5, ...)
mapping , data , position , na.rm , show.legend , inherit.aes
|
See the help file for |
geom |
Character string indicating which |
y.pos |
Numeric scalar indicating the |
y.expand.factor |
For the case when |
text.box |
Logical scalar indicating whether to surround the text with a text box (i.e.,
whether to use |
alpha , angle , color , family , fontface , hjust , vjust , lineheight , size
|
See the help file for |
label.padding , label.r , label.size
|
See the help file for |
... |
Other arguments passed on to |
See the help file for geom_text
for details about how
geom_text
and geom_label
work.
See the vignette Extending ggplot2 at https://cran.r-project.org/package=ggplot2/vignettes/extending-ggplot2.html for information on how to create a new stat.
The function stat_n_text
is called by the function geom_stripchart
.
Steven P. Millard ([email protected])
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (Use R!). Second Edition. Springer.
geom_stripchart
, stat_mean_sd_text
,
stat_median_iqr_text
, stat_test_text
,
geom_text
, geom_label
.
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #==================== # Example 1: # Using the built-in data frame mtcars, # plot miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + theme(legend.position = "none") p + geom_point() + labs(x = "Number of Cylinders", y = "Miles per Gallon") # Now add the sample size for each level of cylinder. #---------------------------------------------------- dev.new() p + geom_point() + stat_n_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 2: # Repeat Example 1, but: # 1) facet by transmission type, # 2) make the size of the text smaller. #-------------------------------------- dev.new() p + geom_point() + stat_n_text(size = 3) + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 3: # Repeat Example 1, but specify the y-position for the text. #----------------------------------------------------------- dev.new() p + geom_point() + stat_n_text(y.pos = 5) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 4: # Repeat Example 1, but show the sample size in a text box. #---------------------------------------------------------- dev.new() p + geom_point() + stat_n_text(text.box = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 5: # Repeat Example 1, but use the color brown for the text. #-------------------------------------------------------- dev.new() p + geom_point() + stat_n_text(color = "brown") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 6: # Repeat Example 1, but: # 1) use the same colors for the text that are used for each group, # 2) use the bold monospaced font. #------------------------------------------------------------------ mat <- ggplot_build(p)$data[[1]] group <- mat[, "group"] colors <- mat[match(1:max(group), group), "colour"] dev.new() p + geom_point() + stat_n_text(color = colors, size = 5, family = "mono", fontface = "bold") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Clean up #--------- graphics.off() rm(p, mat, group, colors)
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #==================== # Example 1: # Using the built-in data frame mtcars, # plot miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + theme(legend.position = "none") p + geom_point() + labs(x = "Number of Cylinders", y = "Miles per Gallon") # Now add the sample size for each level of cylinder. #---------------------------------------------------- dev.new() p + geom_point() + stat_n_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 2: # Repeat Example 1, but: # 1) facet by transmission type, # 2) make the size of the text smaller. #-------------------------------------- dev.new() p + geom_point() + stat_n_text(size = 3) + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 3: # Repeat Example 1, but specify the y-position for the text. #----------------------------------------------------------- dev.new() p + geom_point() + stat_n_text(y.pos = 5) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 4: # Repeat Example 1, but show the sample size in a text box. #---------------------------------------------------------- dev.new() p + geom_point() + stat_n_text(text.box = TRUE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 5: # Repeat Example 1, but use the color brown for the text. #-------------------------------------------------------- dev.new() p + geom_point() + stat_n_text(color = "brown") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 6: # Repeat Example 1, but: # 1) use the same colors for the text that are used for each group, # 2) use the bold monospaced font. #------------------------------------------------------------------ mat <- ggplot_build(p)$data[[1]] group <- mat[, "group"] colors <- mat[match(1:max(group), group), "colour"] dev.new() p + geom_point() + stat_n_text(color = colors, size = 5, family = "mono", fontface = "bold") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Clean up #--------- graphics.off() rm(p, mat, group, colors)
For a strip plot or scatterplot produced using the package ggplot2
(e.g., with geom_point
),
add text indicating the results of a hypothesis test comparing locations
betweeen groups, where the groups are defined based on the unique -values.
stat_test_text(mapping = NULL, data = NULL, geom = ifelse(text.box, "label", "text"), position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, y.pos = NULL, y.expand.factor = 0.35, test = "parametric", paired = FALSE, test.arg.list = list(), two.lines = TRUE, p.value.digits = 3, p.value.digit.type = "round", location.digits = 1, location.digit.type = "round", nsmall = ifelse(location.digit.type == "round", location.digits, 0), text.box = FALSE, alpha = 1, angle = 0, color = "black", family = "", fontface = "plain", hjust = 0.5, label.padding = ggplot2::unit(0.25, "lines"), label.r = ggplot2::unit(0.15, "lines"), label.size = 0.25, lineheight = 1.2, size = 4, vjust = 0.5, ...)
stat_test_text(mapping = NULL, data = NULL, geom = ifelse(text.box, "label", "text"), position = "identity", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE, y.pos = NULL, y.expand.factor = 0.35, test = "parametric", paired = FALSE, test.arg.list = list(), two.lines = TRUE, p.value.digits = 3, p.value.digit.type = "round", location.digits = 1, location.digit.type = "round", nsmall = ifelse(location.digit.type == "round", location.digits, 0), text.box = FALSE, alpha = 1, angle = 0, color = "black", family = "", fontface = "plain", hjust = 0.5, label.padding = ggplot2::unit(0.25, "lines"), label.r = ggplot2::unit(0.15, "lines"), label.size = 0.25, lineheight = 1.2, size = 4, vjust = 0.5, ...)
mapping , data , position , na.rm , show.legend , inherit.aes
|
See the help file for |
geom |
Character string indicating which |
y.pos |
Numeric scalar indicating the |
y.expand.factor |
For the case when |
test |
A character string indicating whether to use a standard parametric test
( |
paired |
For the case of two groups, a logical scalar indicating whether the data
should be considered to be paired. The default value is NOTE: if the argument |
test.arg.list |
An optional list of arguments to pass to the function used to test for
group differences in location. The default value is an empty list:
NOTE: If |
two.lines |
For the case of one or two groups, a logical scalar indicating whether the
associated confidence interval should be be displayed on a second line
instead of on the same line as the p-value. The default is |
p.value.digits |
An integer indicating the number of digits to use for displaying the
p-value. When |
p.value.digit.type |
A character string indicating whether the |
location.digits |
For the case of one or two groups, an integer indicating the number of digits
to use for displaying the associated confidence interval.
When |
location.digit.type |
For the case of one or two groups, a character string indicating
whether the |
nsmall |
For the case of one or two groups, an integer passed to the function
|
text.box |
Logical scalar indicating whether to surround the text with a text box (i.e.,
whether to use |
alpha , angle , color , family , fontface , hjust , vjust , lineheight , size
|
See the help file for |
label.padding , label.r , label.size
|
See the help file for |
... |
Other arguments passed on to |
The table below shows which hypothesis tests are performed based on the number of groups
and the values of the arguments test
and paired
.
Function | ||||
# Groups | test |
paired |
Name | Called |
1 | "parametric" |
One-Sample t-test | t.test |
|
"nonparametric" |
Wilcoxon Signed Rank Test | wilcox.test |
||
2 | "parametric" |
FALSE |
Two-Sample t-test | t.test |
TRUE |
Paired t-test | t.test |
||
"nonparametric" |
FALSE |
Wilcoxon Rank Sum Test | wilcox.test |
|
TRUE |
Wilcoxon Signed Rank Test | wilcox.test |
||
on Paired Differences | ||||
3 |
"parametric" |
Analysis of Variance | aov |
|
summary.aov |
||||
"nonparametric" |
Kruskal-Wallis Test | kruskal.test |
||
See the help file for geom_text
for details about how
geom_text
and geom_label
work.
See the vignette Extending ggplot2 at https://cran.r-project.org/package=ggplot2/vignettes/extending-ggplot2.html for information on how to create a new stat.
The function stat_test_text
is called by the function geom_stripchart
.
Steven P. Millard ([email protected])
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (Use R!). Second Edition. Springer.
geom_stripchart
, stat_mean_sd_text
,
stat_median_iqr_text
, stat_n_text
,
geom_text
, geom_label
,
t.test
, wilcox.test
,
aov
, summary.aov
,
kruskal.test
.
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #========== # Example 1: # Using the built-in data frame mtcars, # plot miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + theme(legend.position = "none") p + geom_point(show.legend = FALSE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") # Now add text indicating the sample size and # mean and standard deviation for each level of cylinder, and # test for the difference in means between groups. #------------------------------------------------------------ dev.new() p + geom_point() + stat_n_text() + stat_mean_sd_text() + stat_test_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 2: # Repeat Example 1, but show text indicating the median and IQR, # and use the nonparametric test. #--------------------------------------------------------------- dev.new() p + geom_point() + stat_n_text() + stat_median_iqr_text() + stat_test_text(test = "nonparametric") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 3: # Repeat Example 1, but use only the groups with # 4 and 8 cylinders. #----------------------------------------------- p <- ggplot(subset(mtcars, cyl %in% c(4, 8)), aes(x = factor(cyl), y = mpg, color = cyl)) + theme(legend.position = "none") dev.new() p + geom_point() + stat_n_text() + stat_mean_sd_text() + stat_test_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 4: # Repeat Example 3, but # 1) facet by transmission type, # 2) make the text smaller, # 3) put the text for the test results in a text box # and make them blue. #--------------------------------------------------- dev.new() p + geom_point() + stat_n_text(size = 3) + stat_mean_sd_text(size = 3) + stat_test_text(size = 3, text.box = TRUE, color = "blue") + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Clean up #--------- graphics.off() rm(p)
# First, load and attach the ggplot2 package. #-------------------------------------------- library(ggplot2) #========== # Example 1: # Using the built-in data frame mtcars, # plot miles per gallon vs. number of cylinders # using different colors for each level of the number of cylinders. #------------------------------------------------------------------ p <- ggplot(mtcars, aes(x = factor(cyl), y = mpg, color = factor(cyl))) + theme(legend.position = "none") p + geom_point(show.legend = FALSE) + labs(x = "Number of Cylinders", y = "Miles per Gallon") # Now add text indicating the sample size and # mean and standard deviation for each level of cylinder, and # test for the difference in means between groups. #------------------------------------------------------------ dev.new() p + geom_point() + stat_n_text() + stat_mean_sd_text() + stat_test_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 2: # Repeat Example 1, but show text indicating the median and IQR, # and use the nonparametric test. #--------------------------------------------------------------- dev.new() p + geom_point() + stat_n_text() + stat_median_iqr_text() + stat_test_text(test = "nonparametric") + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 3: # Repeat Example 1, but use only the groups with # 4 and 8 cylinders. #----------------------------------------------- p <- ggplot(subset(mtcars, cyl %in% c(4, 8)), aes(x = factor(cyl), y = mpg, color = cyl)) + theme(legend.position = "none") dev.new() p + geom_point() + stat_n_text() + stat_mean_sd_text() + stat_test_text() + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Example 4: # Repeat Example 3, but # 1) facet by transmission type, # 2) make the text smaller, # 3) put the text for the test results in a text box # and make them blue. #--------------------------------------------------- dev.new() p + geom_point() + stat_n_text(size = 3) + stat_mean_sd_text(size = 3) + stat_test_text(size = 3, text.box = TRUE, color = "blue") + facet_wrap(~ am, labeller = label_both) + labs(x = "Number of Cylinders", y = "Miles per Gallon") #========== # Clean up #--------- graphics.off() rm(p)
stripChart
is a modification of the R function stripchart
.
It is a generic function used to produce one dimensional scatter
plots (or dot plots) of the given data, along with text indicating sample size and
estimates of location (mean or median) and scale (standard deviation
or interquartile range), as well as confidence intervals for the population
location parameter.
One dimensional scatterplots are a good alternative to boxplots
when sample sizes are small or moderate. The function invokes particular
methods
which depend on the class
of the first argument.
stripChart(x, ...) ## S3 method for class 'formula' stripChart(x, data = NULL, dlab = NULL, subset, na.action = NULL, ...) ## Default S3 method: stripChart(x, method = ifelse(paired && paired.lines, "overplot", "stack"), seed = 47, jitter = 0.1 * cex, offset = 1/2, vertical = TRUE, group.names, group.names.cex = cex, drop.unused.levels = TRUE, add = FALSE, at = NULL, xlim = NULL, ylim = NULL, ylab = NULL, xlab = NULL, dlab = "", glab = "", log = "", pch = 1, col = par("fg"), cex = par("cex"), points.cex = cex, axes = TRUE, frame.plot = axes, show.ci = TRUE, location.pch = 16, location.cex = cex, conf.level = 0.95, min.n.for.ci = 2, ci.offset = 3/ifelse(n > 2, (n-1)^(1/3), 1), ci.bar.lwd = cex, ci.bar.ends = TRUE, ci.bar.ends.size = 0.5 * cex, ci.bar.gap = FALSE, n.text = "bottom", n.text.line = ifelse(n.text == "bottom", 2, 0), n.text.cex = cex, location.scale.text = "top", location.scale.digits = 1, nsmall = location.scale.digits, location.scale.text.line = ifelse(location.scale.text == "top", 0, 3.5), location.scale.text.cex = cex * 0.8 * ifelse(n > 6, max(0.4, 1 - (n-6) * 0.06), 1), p.value = FALSE, p.value.digits = 3, p.value.line = 2, p.value.cex = cex, group.difference.ci = p.value, group.difference.conf.level = 0.95, group.difference.digits = location.scale.digits, ci.and.test = "parametric", ci.arg.list = NULL, test.arg.list = NULL, alternative = "two.sided", plot.diff = FALSE, diff.col = col[1], diff.method = "stack", diff.pch = pch[1], paired = FALSE, paired.lines = paired, paired.lty = 1:6, paired.lwd = 1, paired.pch = 1:14, paired.col = NULL, diff.name = NULL, diff.name.cex = group.names.cex, sep.line = TRUE, sep.lty = 2, sep.lwd = cex, sep.col = "gray", diff.lim = NULL, diff.at = NULL, diff.axis.label = NULL, plot.diff.mar = c(5, 4, 4, 4) + 0.1, ...)
stripChart(x, ...) ## S3 method for class 'formula' stripChart(x, data = NULL, dlab = NULL, subset, na.action = NULL, ...) ## Default S3 method: stripChart(x, method = ifelse(paired && paired.lines, "overplot", "stack"), seed = 47, jitter = 0.1 * cex, offset = 1/2, vertical = TRUE, group.names, group.names.cex = cex, drop.unused.levels = TRUE, add = FALSE, at = NULL, xlim = NULL, ylim = NULL, ylab = NULL, xlab = NULL, dlab = "", glab = "", log = "", pch = 1, col = par("fg"), cex = par("cex"), points.cex = cex, axes = TRUE, frame.plot = axes, show.ci = TRUE, location.pch = 16, location.cex = cex, conf.level = 0.95, min.n.for.ci = 2, ci.offset = 3/ifelse(n > 2, (n-1)^(1/3), 1), ci.bar.lwd = cex, ci.bar.ends = TRUE, ci.bar.ends.size = 0.5 * cex, ci.bar.gap = FALSE, n.text = "bottom", n.text.line = ifelse(n.text == "bottom", 2, 0), n.text.cex = cex, location.scale.text = "top", location.scale.digits = 1, nsmall = location.scale.digits, location.scale.text.line = ifelse(location.scale.text == "top", 0, 3.5), location.scale.text.cex = cex * 0.8 * ifelse(n > 6, max(0.4, 1 - (n-6) * 0.06), 1), p.value = FALSE, p.value.digits = 3, p.value.line = 2, p.value.cex = cex, group.difference.ci = p.value, group.difference.conf.level = 0.95, group.difference.digits = location.scale.digits, ci.and.test = "parametric", ci.arg.list = NULL, test.arg.list = NULL, alternative = "two.sided", plot.diff = FALSE, diff.col = col[1], diff.method = "stack", diff.pch = pch[1], paired = FALSE, paired.lines = paired, paired.lty = 1:6, paired.lwd = 1, paired.pch = 1:14, paired.col = NULL, diff.name = NULL, diff.name.cex = group.names.cex, sep.line = TRUE, sep.lty = 2, sep.lwd = cex, sep.col = "gray", diff.lim = NULL, diff.at = NULL, diff.axis.label = NULL, plot.diff.mar = c(5, 4, 4, 4) + 0.1, ...)
x |
the data from which the plots are to be produced. In the default method the data can be
specified as a list or data frame where each component is numeric, a numeric matrix,
or a numeric vector. In the formula method, a symbolic specification of the form
|
data |
for the formula method, a data.frame (or list) from which the variables in |
subset |
for the formula method, an optional vector specifying a subset of observations to be used for plotting. |
na.action |
for the formula method, a function which indicates what should happen when the data
contain |
... |
additional parameters passed to the default method, or by it to |
method |
the method to be used to separate coincident points. When |
seed |
when |
jitter |
when |
offset |
when stacking is used, points are stacked this many line-heights (symbol widths) apart. |
vertical |
when |
group.names |
group labels which will be printed alongside (or underneath) each plot. |
group.names.cex |
numeric scalar indicating the amount by which the group labels should be scaled
relative to the default (see the help file for |
drop.unused.levels |
when |
add |
logical, if true add the chart to the current plot. |
at |
numeric vector giving the locations where the charts should be drawn,
particularly when |
xlim , ylim
|
plot limits: see |
ylab , xlab
|
labels: see |
dlab , glab
|
alternate way to specify axis labels. The |
log |
on which axes to use a log scale: see |
pch , col , cex
|
Graphical parameters: see |
points.cex |
Sets the |
axes , frame.plot
|
Axis control: see |
show.ci |
logical scalar indicating whether to plot the confidence interval. The default is
|
location.pch |
integer indicating which plotting character to use to indicate the estimate of location
(mean or median) for each group (see the help file for |
location.cex |
numeric scalar giving the amount by which the plotting characters indicating the
estimate of location for each group should be scaled relative to the default
(see the help file for |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated with the
confidence interval for the group location (population mean or median).
The default value is |
min.n.for.ci |
integer indicating the minimum sample size required in order to plot a confidence interval
for the group location. The default value is |
ci.offset |
numeric scalar or vector of length equal to the number of groups ( |
ci.bar.lwd |
numeric scalar indicating the line width for the confidence interval bars.
The default is the current value of the graphics parameter |
ci.bar.ends |
logical scalar indicating whether to add flat ends to the confidence interval bars.
The default value is |
ci.bar.ends.size |
numeric scalar in units of |
ci.bar.gap |
logical scalar indicating with to add a gap between the estimate of group location and the
confidence interval bar. The default value is |
n.text |
character string indicating whether and where to indicate the sample size for each group.
Possible values are |
n.text.line |
integer indicating on which plot margin line to show the sample sizes for each group. The
default value is |
n.text.cex |
numeric scalar giving the amount by which the text indicating the sample size for
each group should be scaled relative to the default (see the help file for |
location.scale.text |
character string indicating whether and where to indicate the estimates of location
(mean or median) and scale (standard deviation or interquartile range) for each group.
Possible values are |
location.scale.digits |
integer indicating the number of digits to round the estimates of location and scale. The
default value is |
nsmall |
integer passed to the function |
location.scale.text.line |
integer indicating on which plot margin line to show the estimates of location and scale
for each group. The default value is |
location.scale.text.cex |
numeric scalar giving the amount by which the text indicating the estimates of
location and scale for each group should be scaled relative to the default
(see the help file for |
p.value |
logical scalar indicating whether to show the p-value associated with testing whether all groups
have the same population location. The default value is |
p.value.digits |
integer indicating the number of digits to round to when displaying the p-value associated with
the test of equal group locations. The default value is |
p.value.line |
integer indicating on which plot margin line to show the p-value associated with the test of
equal group locations. The default value is |
p.value.cex |
numeric scalar giving the amount by which the text indicating the p-value associated
with the test of equal group locations should be scaled relative to the default
(see the help file for |
group.difference.ci |
for the case when there are just 2 groups, a logical scalar indicating whether to display
the confidence interval for the difference between group locations. The default is
the value of the |
group.difference.conf.level |
for the case when there are just 2 groups, a numeric scalar between 0 and 1
indicating the confidence level associated with the confidence interval for the
difference between group locations. The default is |
group.difference.digits |
for the case when there are just 2 groups, an integer indicating the number of digits to
round to when displaying the confidence interval for the difference between group locations.
The default value is |
ci.and.test |
character string indicating whether confidence intervals and tests should be based on parametric
or nonparametric ( |
ci.arg.list |
an optional list of arguments to pass to the function used to compute confidence intervals.
The default value is |
test.arg.list |
an optional list of arguments to pass to the function used to test for group differences in location.
The default value is |
alternative |
character string describing the alternative hypothesis for the test of group differences in the
case when there are two groups. Possible values are |
plot.diff |
applicable only to the case when there are two groups: When When |
diff.col |
applicable only to the case when there are two groups and |
diff.method |
applicable only to the case when there are two groups, |
diff.pch |
applicable only to the case when there are two groups, |
paired |
applicable only to the case when there are two groups: |
paired.lines |
applicable only to the case when there are two groups and |
paired.lty |
applicable only to the case when there are two groups, |
paired.lwd |
applicable only to the case when there are two groups, |
paired.pch |
applicable only to the case when there are two groups, |
paired.col |
applicable only to the case when there are two groups, |
diff.name |
applicable only to the case when there are two groups and |
diff.name.cex |
applicable only to the case when there are two groups and |
sep.line |
applicable only to the case when there are two groups and |
sep.lty |
applicable only to the case when there are two groups, |
sep.lwd |
applicable only to the case when there are two groups, |
sep.col |
applicable only to the case when there are two groups, |
diff.lim |
applicable only to the case when there are two groups and |
diff.at |
applicable only to the case when there are two groups and |
diff.axis.label |
applicable only to the case when there are two groups and |
plot.diff.mar |
applicable only to the case when there are two groups, |
stripChart
invisibly returns a list with the following components:
group.centers |
numeric vector of values on the group axis (the |
group.stats |
a matrix with the number of rows equal to the number of groups and six columns indicating the sample size of the group (N), the estimate of the group location parameter (Mean or Median), the estimate of the group scale (SD or IQR), the lower confidence limit for the group location parameter (LCL), the upper confidence limit for the group location parameter (UCL), and the confidence level associated with the confidence interval (Conf.Level) |
In addition, if the argument p.value=TRUE
and/or 1) there are two groups and 2) plot.diff=TRUE
,
the list also includes these components:
group.difference.p.value |
numeric scalar indicating the p-value associated with the test of equal group locations. |
group.difference.conf.int |
numeric vector of two elements indicating the confidence interval for the difference between the group locations. Only present when there are two groups. |
Steven P. Millard ([email protected])
Hollander, M., and D.A. Wolfe. (1999). Nonparametric Statistical Methods. Second Edition. John Wiley and Sons, New York.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
stripchart
, t.test
, wilcox.test
,
aov
, kruskal.test
, t.test
.
#------------------------ # Two Independent Samples #------------------------ # The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # First create one-dimensional scatterplots to compare the # TcCB concentrations between the areas and use a nonparametric # test to test for a difference between areas. dev.new() stripChart(TcCB ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, ci.and.test = "nonparametric", ylab = "TcCB (ppb)") #---------- # Now log-transform the TcCB data and use a parametric test # to compare the areas. dev.new() stripChart(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, ylab = "log10 [ TcCB (ppb) ]") #---------- # Repeat the above procedure, but also plot the confidence interval # for the difference between the means. dev.new() stripChart(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, plot.diff = TRUE, diff.col = "black", ylab = "log10 [ TcCB (ppb) ]") #---------- # Repeat the above procedure, but allow the variances to differ. dev.new() stripChart(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, plot.diff = TRUE, diff.col = "black", ylab = "log10 [ TcCB (ppb) ]", test.arg.list = list(var.equal = FALSE)) #---------- # Repeat the above procedure, but jitter the points instead of # stacking them. dev.new() stripChart(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, plot.diff = TRUE, diff.col = "black", ylab = "log10 [ TcCB (ppb) ]", test.arg.list = list(var.equal = FALSE), method = "jitter", ci.offset = 4) #---------- # Clean up #--------- graphics.off() #==================== #-------------------- # Paired Observations #-------------------- # The data frame ACE.13.TCE.df contians paired observations of # trichloroethylene (TCE; mg/L) at 10 groundwater monitoring wells # before and after remediation. # # Create one-dimensional scatterplots to compare TCE concentrations # before and after remediation and use a paired t-test to # test for a difference between periods. ACE.13.TCE.df # TCE.mg.per.L Well Period #1 20.900 1 Before #2 9.170 2 Before #3 5.960 3 Before #... ...... .. ...... #18 0.520 8 After #19 3.060 9 After #20 1.900 10 After dev.new() stripChart(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, col = c("brown", "green"), p.value = TRUE, paired = TRUE, ylab = "TCE (mg/L)") #---------- # Repeat the above procedure, but also plot the confidence interval # for the mean of the paired differences. dev.new() stripChart(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, col = c("brown", "green"), p.value = TRUE, paired = TRUE, ylab = "TCE (mg/L)", plot.diff = TRUE, diff.col = "blue") #========== # Repeat the last two examples, but use a one-sided alternative since # remediation should decrease TCE concentration. dev.new() stripChart(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, col = c("brown", "green"), p.value = TRUE, paired = TRUE, ylab = "TCE (mg/L)", alternative = "less", group.difference.digits = 2) #---------- # Repeat the above procedure, but also plot the confidence interval # for the mean of the paired differences. # # NOTE: Although stripChart can *report* one-sided confidence intervals # for the difference between two groups (see above example), # when *plotting* the confidence interval for the difference, # only two-sided CIs are allowed. # Here, we will set the confidence level of the confidence # interval for the mean of the paired differences to 90%, # so that the upper bound of the CI corresponds to the upper # bound of a 95% one-sided CI. dev.new() stripChart(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, col = c("brown", "green"), p.value = TRUE, paired = TRUE, ylab = "TCE (mg/L)", group.difference.digits = 2, plot.diff = TRUE, diff.col = "blue", group.difference.conf.level = 0.9) #---------- # Clean up #--------- graphics.off() #========== # The data frame Helsel.Hirsch.02.Mayfly.df contains paired counts # of mayfly nymphs above and below industrial outfalls in 12 streams. # # Create one-dimensional scatterplots to compare the # counts between locations and use a nonparametric test # to compare counts above and below the outfalls. Helsel.Hirsch.02.Mayfly.df # Mayfly.Count Stream Location #1 12 1 Above #2 15 2 Above #3 11 3 Above #... ... .. ..... #22 60 10 Below #23 53 11 Below #24 124 12 Below dev.new() stripChart(Mayfly.Count ~ Location, data = Helsel.Hirsch.02.Mayfly.df, col = c("green", "brown"), p.value = TRUE, paired = TRUE, ci.and.test = "nonparametric", ylab = "Number of Mayfly Nymphs") #---------- # Repeat the above procedure, but also plot the confidence interval # for the pseudomedian of the paired differences. dev.new() stripChart(Mayfly.Count ~ Location, data = Helsel.Hirsch.02.Mayfly.df, col = c("green", "brown"), p.value = TRUE, paired = TRUE, ci.and.test = "nonparametric", ylab = "Number of Mayfly Nymphs", plot.diff = TRUE, diff.col = "blue") #---------- # Clean up #--------- graphics.off()
#------------------------ # Two Independent Samples #------------------------ # The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # First create one-dimensional scatterplots to compare the # TcCB concentrations between the areas and use a nonparametric # test to test for a difference between areas. dev.new() stripChart(TcCB ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, ci.and.test = "nonparametric", ylab = "TcCB (ppb)") #---------- # Now log-transform the TcCB data and use a parametric test # to compare the areas. dev.new() stripChart(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, ylab = "log10 [ TcCB (ppb) ]") #---------- # Repeat the above procedure, but also plot the confidence interval # for the difference between the means. dev.new() stripChart(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, plot.diff = TRUE, diff.col = "black", ylab = "log10 [ TcCB (ppb) ]") #---------- # Repeat the above procedure, but allow the variances to differ. dev.new() stripChart(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, plot.diff = TRUE, diff.col = "black", ylab = "log10 [ TcCB (ppb) ]", test.arg.list = list(var.equal = FALSE)) #---------- # Repeat the above procedure, but jitter the points instead of # stacking them. dev.new() stripChart(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, col = c("red", "blue"), p.value = TRUE, plot.diff = TRUE, diff.col = "black", ylab = "log10 [ TcCB (ppb) ]", test.arg.list = list(var.equal = FALSE), method = "jitter", ci.offset = 4) #---------- # Clean up #--------- graphics.off() #==================== #-------------------- # Paired Observations #-------------------- # The data frame ACE.13.TCE.df contians paired observations of # trichloroethylene (TCE; mg/L) at 10 groundwater monitoring wells # before and after remediation. # # Create one-dimensional scatterplots to compare TCE concentrations # before and after remediation and use a paired t-test to # test for a difference between periods. ACE.13.TCE.df # TCE.mg.per.L Well Period #1 20.900 1 Before #2 9.170 2 Before #3 5.960 3 Before #... ...... .. ...... #18 0.520 8 After #19 3.060 9 After #20 1.900 10 After dev.new() stripChart(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, col = c("brown", "green"), p.value = TRUE, paired = TRUE, ylab = "TCE (mg/L)") #---------- # Repeat the above procedure, but also plot the confidence interval # for the mean of the paired differences. dev.new() stripChart(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, col = c("brown", "green"), p.value = TRUE, paired = TRUE, ylab = "TCE (mg/L)", plot.diff = TRUE, diff.col = "blue") #========== # Repeat the last two examples, but use a one-sided alternative since # remediation should decrease TCE concentration. dev.new() stripChart(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, col = c("brown", "green"), p.value = TRUE, paired = TRUE, ylab = "TCE (mg/L)", alternative = "less", group.difference.digits = 2) #---------- # Repeat the above procedure, but also plot the confidence interval # for the mean of the paired differences. # # NOTE: Although stripChart can *report* one-sided confidence intervals # for the difference between two groups (see above example), # when *plotting* the confidence interval for the difference, # only two-sided CIs are allowed. # Here, we will set the confidence level of the confidence # interval for the mean of the paired differences to 90%, # so that the upper bound of the CI corresponds to the upper # bound of a 95% one-sided CI. dev.new() stripChart(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, col = c("brown", "green"), p.value = TRUE, paired = TRUE, ylab = "TCE (mg/L)", group.difference.digits = 2, plot.diff = TRUE, diff.col = "blue", group.difference.conf.level = 0.9) #---------- # Clean up #--------- graphics.off() #========== # The data frame Helsel.Hirsch.02.Mayfly.df contains paired counts # of mayfly nymphs above and below industrial outfalls in 12 streams. # # Create one-dimensional scatterplots to compare the # counts between locations and use a nonparametric test # to compare counts above and below the outfalls. Helsel.Hirsch.02.Mayfly.df # Mayfly.Count Stream Location #1 12 1 Above #2 15 2 Above #3 11 3 Above #... ... .. ..... #22 60 10 Below #23 53 11 Below #24 124 12 Below dev.new() stripChart(Mayfly.Count ~ Location, data = Helsel.Hirsch.02.Mayfly.df, col = c("green", "brown"), p.value = TRUE, paired = TRUE, ci.and.test = "nonparametric", ylab = "Number of Mayfly Nymphs") #---------- # Repeat the above procedure, but also plot the confidence interval # for the pseudomedian of the paired differences. dev.new() stripChart(Mayfly.Count ~ Location, data = Helsel.Hirsch.02.Mayfly.df, col = c("green", "brown"), p.value = TRUE, paired = TRUE, ci.and.test = "nonparametric", ylab = "Number of Mayfly Nymphs", plot.diff = TRUE, diff.col = "blue") #---------- # Clean up #--------- graphics.off()
summaryFull
is a generic function used to produce a full complement of summary statistics.
The function invokes particular methods
which depend on the class
of
the first argument. The summary statistics include: sample size, number of missing values,
mean, median, trimmed mean, geometric mean, skew, kurtosis, min, max, range, 1st quartile, 3rd quartile,
standard deviation, geometric standard deviation, interquartile range, median absolute deviation, and
coefficient of variation.
summaryFull(object, ...) ## S3 method for class 'formula' summaryFull(object, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: summaryFull(object, group = NULL, combine.groups = FALSE, drop.unused.levels = TRUE, rm.group.na = TRUE, stats = NULL, trim = 0.1, sd.method = "sqrt.unbiased", geo.sd.method = "sqrt.unbiased", skew.list = list(), kurtosis.list = list(), cv.list = list(), digits = max(3, getOption("digits") - 3), digit.type = "signif", stats.in.rows = TRUE, drop0trailing = TRUE, data.name = deparse(substitute(object)), ...) ## S3 method for class 'data.frame' summaryFull(object, ...) ## S3 method for class 'matrix' summaryFull(object, ...) ## S3 method for class 'list' summaryFull(object, ...)
summaryFull(object, ...) ## S3 method for class 'formula' summaryFull(object, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: summaryFull(object, group = NULL, combine.groups = FALSE, drop.unused.levels = TRUE, rm.group.na = TRUE, stats = NULL, trim = 0.1, sd.method = "sqrt.unbiased", geo.sd.method = "sqrt.unbiased", skew.list = list(), kurtosis.list = list(), cv.list = list(), digits = max(3, getOption("digits") - 3), digit.type = "signif", stats.in.rows = TRUE, drop0trailing = TRUE, data.name = deparse(substitute(object)), ...) ## S3 method for class 'data.frame' summaryFull(object, ...) ## S3 method for class 'matrix' summaryFull(object, ...) ## S3 method for class 'list' summaryFull(object, ...)
object |
an object for which summary statistics are desired. In the default method,
the argument |
data |
when |
subset |
when |
na.action |
when |
group |
when |
combine.groups |
logical scalar indicating whether to show summary statistics for all groups combined.
The default value is |
drop.unused.levels |
when |
rm.group.na |
logical scalar indicating whether to remove missing values from the |
stats |
character vector indicating which statistics to compute. Possible elements of the character
vector include: |
trim |
fraction (between 0 and 0.5 inclusive) of values to be trimmed from each end of the ordered data
to compute the trimmed mean. The default value is |
sd.method |
character string specifying what method to use to compute the sample standard deviation.
The possible values are |
geo.sd.method |
character string specifying what method to use to compute the sample standard deviation of the
log-transformed observations prior to exponentiating this quantity. The possible values are
|
skew.list |
list of arguments to supply to the |
kurtosis.list |
list of arguments to supply to the |
cv.list |
list of arguments to supply to the |
digits |
integer indicating the number of digits to use for the summary statistics.
When |
digit.type |
character string indicating whether the |
stats.in.rows |
logical scalar indicating whether to show the summary statistics in the rows or columns of the
output. The default is |
drop0trailing |
logical scalar indicating whether to drop trailing 0's when printing the summary statistics.
The value of this argument is added as an attribute to the returned list and is used by the
|
data.name |
character string indicating the name of the data used for the summary statistics. |
... |
additional arguments affecting the summary statistics produced. |
The function summaryFull
returns summary statistics that are useful to describe various
characteristics of one or more variables. It is an extended version of the built-in R function
summary
specifically for non-factor numeric data. The table below shows what
statistics are computed and what functions are called by summaryFull
to compute these statistics.
The object returned by summaryFull
is useful for printing or report purposes. You may also
use the functions that summaryFull
calls (see table below) to compute summary statistics to
be used by other functions.
See the help files for the functions listed in the table below for more information on these summary statistics.
Summary Statistic | Function Used |
Mean | mean |
Median | median |
Trimmed Mean | mean with trim argument |
Geometric Mean | geoMean |
Skew | skewness |
Kurtosis | kurtosis |
Min | min |
Max | max |
Range | range and diff |
1st Quartile | quantile |
3rd Quartile | quantile |
Standard Deviation | sd |
Geometric Standard Deviation | geoSD |
Interquartile Range | iqr |
Median Absolute Deviation | mad |
Coefficient of Variation | cv |
an object of class "summaryStats"
(see summaryStats.object
.
Objects of class "summaryStats"
are numeric matrices that contain the
summary statisics produced by a call to summaryStats
or summaryFull
.
These objects have a special printing method that by default removes
trailing zeros for sample size entries and prints blanks for statistics that are
normally displayed as NA
(see print.summaryStats
).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.
Leidel, N.A., K.A. Busch, and J.R. Lynch. (1977). Occupational Exposure Sampling Strategy Manual. U.S. Department of Health, Education, and Welfare, Public Health Service, Center for Disease Control, National Institute for Occupational Safety and Health, Cincinnati, Ohio 45226, January, 1977, pp.102-103.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Ott, W.R. (1995). Environmental Statistics and Data Analysis. Lewis Publishers, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis, Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
# Generate 20 observations from a lognormal distribution with # parameters mean=10 and cv=1, and compute the summary statistics. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rlnormAlt(20, mean=10, cv=1) summary(dat) # Min. 1st Qu. Median Mean 3rd Qu. Max. #2.608 4.995 6.235 7.490 9.295 15.440 summaryFull(dat) # dat #N 20 #Mean 7.49 #Median 6.235 #10% Trimmed Mean 7.125 #Geometric Mean 6.674 #Skew 0.9877 #Kurtosis -0.03539 #Min 2.608 #Max 15.44 #Range 12.83 #1st Quartile 4.995 #3rd Quartile 9.295 #Standard Deviation 3.803 #Geometric Standard Deviation 1.634 #Interquartile Range 4.3 #Median Absolute Deviation 2.607 #Coefficient of Variation 0.5078 #---------- # Compare summary statistics for normal and lognormal data: log.dat <- log(dat) summaryFull(list(dat = dat, log.dat = log.dat)) # dat log.dat #N 20 20 #Mean 7.49 1.898 #Median 6.235 1.83 #10% Trimmed Mean 7.125 1.902 #Geometric Mean 6.674 1.835 #Skew 0.9877 0.1319 #Kurtosis -0.03539 -0.4288 #Min 2.608 0.9587 #Max 15.44 2.737 #Range 12.83 1.778 #1st Quartile 4.995 1.607 #3rd Quartile 9.295 2.227 #Standard Deviation 3.803 0.4913 #Geometric Standard Deviation 1.634 1.315 #Interquartile Range 4.3 0.62 #Median Absolute Deviation 2.607 0.4915 #Coefficient of Variation 0.5078 0.2588 # Clean up rm(dat, log.dat) #-------------------------------------------------------------------- # Compute summary statistics for 10 observations from a normal # distribution with parameters mean=0 and sd=1. Note that the # geometric mean and geometric standard deviation are not computed # since some of the observations are non-positive. set.seed(287) dat <- rnorm(10) summaryFull(dat) # dat #N 10 #Mean 0.07406 #Median 0.1095 #10% Trimmed Mean 0.1051 #Skew -0.1646 #Kurtosis -0.7135 #Min -1.549 #Max 1.449 #Range 2.998 #1st Quartile -0.5834 #3rd Quartile 0.6966 #Standard Deviation 0.9412 #Interquartile Range 1.28 #Median Absolute Deviation 1.05 # Clean up rm(dat) #-------------------------------------------------------------------- # Compute summary statistics for the TcCB data given in USEPA (1994b) # (the data are stored in EPA.94b.tccb.df). Arbitrarily set the one # censored observation to the censoring level. Group by the variable # Area. summaryFull(TcCB ~ Area, data = EPA.94b.tccb.df) # Cleanup Reference #N 77 47 #Mean 3.915 0.5985 #Median 0.43 0.54 #10% Trimmed Mean 0.6846 0.5728 #Geometric Mean 0.5784 0.5382 #Skew 7.717 0.9019 #Kurtosis 62.67 0.132 #Min 0.09 0.22 #Max 168.6 1.33 #Range 168.5 1.11 #1st Quartile 0.23 0.39 #3rd Quartile 1.1 0.75 #Standard Deviation 20.02 0.2836 #Geometric Standard Deviation 3.898 1.597 #Interquartile Range 0.87 0.36 #Median Absolute Deviation 0.3558 0.2669 #Coefficient of Variation 5.112 0.4739
# Generate 20 observations from a lognormal distribution with # parameters mean=10 and cv=1, and compute the summary statistics. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rlnormAlt(20, mean=10, cv=1) summary(dat) # Min. 1st Qu. Median Mean 3rd Qu. Max. #2.608 4.995 6.235 7.490 9.295 15.440 summaryFull(dat) # dat #N 20 #Mean 7.49 #Median 6.235 #10% Trimmed Mean 7.125 #Geometric Mean 6.674 #Skew 0.9877 #Kurtosis -0.03539 #Min 2.608 #Max 15.44 #Range 12.83 #1st Quartile 4.995 #3rd Quartile 9.295 #Standard Deviation 3.803 #Geometric Standard Deviation 1.634 #Interquartile Range 4.3 #Median Absolute Deviation 2.607 #Coefficient of Variation 0.5078 #---------- # Compare summary statistics for normal and lognormal data: log.dat <- log(dat) summaryFull(list(dat = dat, log.dat = log.dat)) # dat log.dat #N 20 20 #Mean 7.49 1.898 #Median 6.235 1.83 #10% Trimmed Mean 7.125 1.902 #Geometric Mean 6.674 1.835 #Skew 0.9877 0.1319 #Kurtosis -0.03539 -0.4288 #Min 2.608 0.9587 #Max 15.44 2.737 #Range 12.83 1.778 #1st Quartile 4.995 1.607 #3rd Quartile 9.295 2.227 #Standard Deviation 3.803 0.4913 #Geometric Standard Deviation 1.634 1.315 #Interquartile Range 4.3 0.62 #Median Absolute Deviation 2.607 0.4915 #Coefficient of Variation 0.5078 0.2588 # Clean up rm(dat, log.dat) #-------------------------------------------------------------------- # Compute summary statistics for 10 observations from a normal # distribution with parameters mean=0 and sd=1. Note that the # geometric mean and geometric standard deviation are not computed # since some of the observations are non-positive. set.seed(287) dat <- rnorm(10) summaryFull(dat) # dat #N 10 #Mean 0.07406 #Median 0.1095 #10% Trimmed Mean 0.1051 #Skew -0.1646 #Kurtosis -0.7135 #Min -1.549 #Max 1.449 #Range 2.998 #1st Quartile -0.5834 #3rd Quartile 0.6966 #Standard Deviation 0.9412 #Interquartile Range 1.28 #Median Absolute Deviation 1.05 # Clean up rm(dat) #-------------------------------------------------------------------- # Compute summary statistics for the TcCB data given in USEPA (1994b) # (the data are stored in EPA.94b.tccb.df). Arbitrarily set the one # censored observation to the censoring level. Group by the variable # Area. summaryFull(TcCB ~ Area, data = EPA.94b.tccb.df) # Cleanup Reference #N 77 47 #Mean 3.915 0.5985 #Median 0.43 0.54 #10% Trimmed Mean 0.6846 0.5728 #Geometric Mean 0.5784 0.5382 #Skew 7.717 0.9019 #Kurtosis 62.67 0.132 #Min 0.09 0.22 #Max 168.6 1.33 #Range 168.5 1.11 #1st Quartile 0.23 0.39 #3rd Quartile 1.1 0.75 #Standard Deviation 20.02 0.2836 #Geometric Standard Deviation 3.898 1.597 #Interquartile Range 0.87 0.36 #Median Absolute Deviation 0.3558 0.2669 #Coefficient of Variation 5.112 0.4739
summaryStats
is a generic function used to produce summary statistics, confidence intervals,
and results of hypothesis tests. The function invokes particular methods
which
depend on the class
of the first argument.
The summary statistics include: sample size, number of missing values, mean, standard deviation, median, min, and max. Optional additional summary statistics include 1st quartile, 3rd quartile, and stadard error.
summaryStats(object, ...) ## S3 method for class 'formula' summaryStats(object, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: summaryStats(object, group = NULL, drop.unused.levels = TRUE, se = FALSE, quartiles = FALSE, digits = max(3, getOption("digits") - 3), digit.type = "round", drop0trailing = TRUE, show.na = TRUE, show.0.na = FALSE, p.value = FALSE, p.value.digits = 2, p.value.digit.type = "signif", test = "parametric", paired = FALSE, test.arg.list = NULL, combine.groups = p.value, rm.group.na = TRUE, group.p.value.type = NULL, alternative = "two.sided", ci = NULL, ci.between = NULL, conf.level = 0.95, stats.in.rows = FALSE, data.name = deparse(substitute(object)), ...) ## S3 method for class 'factor' summaryStats(object, group = NULL, drop.unused.levels = TRUE, digits = max(3, getOption("digits") - 3), digit.type = "round", drop0trailing = TRUE, show.na = TRUE, show.0.na = FALSE, p.value = FALSE, p.value.digits = 2, p.value.digit.type = "signif", test = "chisq", test.arg.list = NULL, combine.levels = TRUE, combine.groups = FALSE, rm.group.na = TRUE, ci = p.value & test != "chisq", conf.level = 0.95, stats.in.rows = FALSE, ...) ## S3 method for class 'character' summaryStats(object, ...) ## S3 method for class 'logical' summaryStats(object, ...) ## S3 method for class 'data.frame' summaryStats(object, ...) ## S3 method for class 'matrix' summaryStats(object, ...) ## S3 method for class 'list' summaryStats(object, ...)
summaryStats(object, ...) ## S3 method for class 'formula' summaryStats(object, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: summaryStats(object, group = NULL, drop.unused.levels = TRUE, se = FALSE, quartiles = FALSE, digits = max(3, getOption("digits") - 3), digit.type = "round", drop0trailing = TRUE, show.na = TRUE, show.0.na = FALSE, p.value = FALSE, p.value.digits = 2, p.value.digit.type = "signif", test = "parametric", paired = FALSE, test.arg.list = NULL, combine.groups = p.value, rm.group.na = TRUE, group.p.value.type = NULL, alternative = "two.sided", ci = NULL, ci.between = NULL, conf.level = 0.95, stats.in.rows = FALSE, data.name = deparse(substitute(object)), ...) ## S3 method for class 'factor' summaryStats(object, group = NULL, drop.unused.levels = TRUE, digits = max(3, getOption("digits") - 3), digit.type = "round", drop0trailing = TRUE, show.na = TRUE, show.0.na = FALSE, p.value = FALSE, p.value.digits = 2, p.value.digit.type = "signif", test = "chisq", test.arg.list = NULL, combine.levels = TRUE, combine.groups = FALSE, rm.group.na = TRUE, ci = p.value & test != "chisq", conf.level = 0.95, stats.in.rows = FALSE, ...) ## S3 method for class 'character' summaryStats(object, ...) ## S3 method for class 'logical' summaryStats(object, ...) ## S3 method for class 'data.frame' summaryStats(object, ...) ## S3 method for class 'matrix' summaryStats(object, ...) ## S3 method for class 'list' summaryStats(object, ...)
object |
an object for which summary statistics are desired. In the default method,
the argument |
data |
when |
subset |
when |
na.action |
when |
group |
when |
drop.unused.levels |
when |
se |
for numeric data, logical scalar indicating whether to include
the standard error of the mean in the summary statistics.
The default value is |
quartiles |
for numeric data, logical scalar indicating whether to include
the estimated 25th and 75th percentiles in the summary statistics.
The default value is |
digits |
integer indicating the number of digits to use for the summary statistics.
When |
digit.type |
character string indicating whether the |
drop0trailing |
logical scalar indicating whether to drop trailing 0's when printing the summary statistics.
The value of this argument is added as an attribute to the returned list and is used by the
|
show.na |
logical scalar indicating whether to return the number of missing values.
The default value is |
show.0.na |
logical scalar indicating whether to diplay the number of missing values in the case when
there are no missing values. The default value is |
p.value |
logical scalar indicating whether to return the p-value associated with a test of hypothesis.
The default value is |
p.value.digits |
integer indicating the number of digits to use for the p-value. When |
p.value.digit.type |
character string indicating whether the |
test |
Numeric data: character string indicating whether to compute p-values and confidence
intervals based on parametric ( Factors: character string indicating which test to perform when |
paired |
applicable only to the case when there are two groups: |
test.arg.list |
a list with additional arguments to pass to the test used to compute p-values and confidence
intervals. For numeric data, when |
combine.groups |
logical scalar indicating whether to show summary statistics for all groups combined.
Numeric data: the default value is |
rm.group.na |
logical scalar indicating whether to remove missing values from the |
group.p.value.type |
for numeric data, character string indicating which p-value(s) to compute when
there is more than one group. When |
alternative |
for numeric data, character string indicating which alternative to assume
for p-values and confidence intervals. Possible values are |
ci |
Numeric data: logical scalar indicating whether to compute a confidence interval
for the mean or each group mean. The default value is Factors: logical scalar indicating whether to compute a confidence interval. A confidence
interval is computed only if the number of levels in |
ci.between |
for numeric data, logical scalar indicating whether to compute a confidence interval
for the difference between group means when there are two groups.
The default value is |
conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated with the confidence intervals.
The default value is |
stats.in.rows |
logical scalar indicating whether to show the summary statistics in the rows or columns of the
output. The default is |
data.name |
character string indicating the name of the data used for the summary statistics. |
combine.levels |
for factors, a logical scalar indicating whether to compute summary statistics based on combining all levels of a factor. |
... |
additional arguments affecting the summary statistics produced. |
an object of class "summaryStats"
(see summaryStats.object
.
Objects of class "summaryStats"
are numeric matrices that contain the
summary statisics produced by a call to summaryStats
or summaryFull
.
These objects have a special printing method that by default removes
trailing zeros for sample size entries and prints blanks for statistics that are
normally displayed as NA
(see print.summaryStats
).
Summary statistics for numeric data include sample size, mean, standard deviation, median,
min, and max. Options include the standard error of the mean (when se=TRUE
),
the estimated quartiles (when quartiles=TRUE
), p-values (when p.value=TRUE
),
and/or confidence intervals (when ci=TRUE
and/or ci.between=TRUE
).
Summary statistics for factors include the sample size for each level of the factor and the
percent of the total for that level. Options include a p-value (when p.value=TRUE
).
Note that unlike the R function summary
and the EnvStats function
summaryFull
, by default the digits
argument for the EnvStats function
summaryStats
refers to how many decimal places to round to, not how many
significant digits to use (see the explanation of the argument digit.type
above).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 24.
summary
, summaryFull
, t.test
, anova.lm
,
wilcox.test
, kruskal.test
,
chisq.test
, fisher.test
, binom.test
.
# The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. #---------- # First, create summary statistics by area based on the log-transformed data. summaryStats(log10(TcCB) ~ Area, data = EPA.94b.tccb.df) # N Mean SD Median Min Max #Cleanup 77 -0.2377 0.5908 -0.3665 -1.0458 2.2270 #Reference 47 -0.2691 0.2032 -0.2676 -0.6576 0.1239 #---------- # Now create summary statistics by area based on the log-transformed data # and use the t-test to compare the areas. summaryStats(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, p.value = TRUE) summaryStats(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, p.value = TRUE, stats.in.rows = TRUE) # Cleanup Reference Combined #N 77 47 124 #Mean -0.2377 -0.2691 -0.2496 #SD 0.5908 0.2032 0.481 #Median -0.3665 -0.2676 -0.3143 #Min -1.0458 -0.6576 -1.0458 #Max 2.227 0.1239 2.227 #Diff -0.0313 #p.value.between 0.73 #95%.LCL.between -0.2082 #95%.UCL.between 0.1456 #==================================================================== # Page 9-3 of USEPA (2009) lists trichloroethene # concentrations (TCE; mg/L) collected from groundwater at two wells. # Here, the seven non-detects have been set to their detection limit. #---------- # First, compute summary statistics for all TCE observations. summaryStats(TCE.mg.per.L ~ 1, data = EPA.09.Table.9.1.TCE.df, digits = 3, data.name = "TCE") # N Mean SD Median Min Max NA's N.Total #TCE 27 0.09 0.064 0.1 0.004 0.25 3 30 summaryStats(TCE.mg.per.L ~ 1, data = EPA.09.Table.9.1.TCE.df, se = TRUE, quartiles = TRUE, digits = 3, data.name = "TCE") # N Mean SD SE Median Min Max 1st Qu. 3rd Qu. NA's N.Total #TCE 27 0.09 0.064 0.012 0.1 0.004 0.25 0.031 0.12 3 30 #---------- # Now compute summary statistics by well. summaryStats(TCE.mg.per.L ~ Well, data = EPA.09.Table.9.1.TCE.df, digits = 3) # N Mean SD Median Min Max NA's N.Total #Well.1 14 0.063 0.079 0.031 0.004 0.25 1 15 #Well.2 13 0.118 0.020 0.110 0.099 0.17 2 15 summaryStats(TCE.mg.per.L ~ Well, data = EPA.09.Table.9.1.TCE.df, digits = 3, stats.in.rows = TRUE) # Well.1 Well.2 #N 14 13 #Mean 0.063 0.118 #SD 0.079 0.02 #Median 0.031 0.11 #Min 0.004 0.099 #Max 0.25 0.17 #NA's 1 2 #N.Total 15 15 # If you want to keep trailing 0's, use the drop0trailing argument: summaryStats(TCE.mg.per.L ~ Well, data = EPA.09.Table.9.1.TCE.df, digits = 3, stats.in.rows = TRUE, drop0trailing = FALSE) # Well.1 Well.2 #N 14.000 13.000 #Mean 0.063 0.118 #SD 0.079 0.020 #Median 0.031 0.110 #Min 0.004 0.099 #Max 0.250 0.170 #NA's 1.000 2.000 #N.Total 15.000 15.000 #==================================================================== # Page 13-3 of USEPA (2009) lists iron concentrations (ppm) in # groundwater collected from 6 wells. #---------- # First, compute summary statistics for each well. summaryStats(Iron.ppm ~ Well, data = EPA.09.Ex.13.1.iron.df, combine.groups = FALSE, digits = 2, stats.in.rows = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 #N 4 4 4 4 4 4 #Mean 47.01 55.73 90.86 70.43 145.24 156.32 #SD 12.4 20.34 59.35 25.95 92.16 51.2 #Median 50.05 57.05 76.73 76.95 137.66 171.93 #Min 29.96 32.14 39.25 34.12 60.95 83.1 #Max 57.97 76.71 170.72 93.69 244.69 198.34 #---------- # Note the large differences in standard deviations between wells. # Compute summary statistics for log(Iron), by Well. summaryStats(log(Iron.ppm) ~ Well, data = EPA.09.Ex.13.1.iron.df, combine.groups = FALSE, digits = 2, stats.in.rows = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 #N 4 4 4 4 4 4 #Mean 3.82 3.97 4.35 4.19 4.8 5 #SD 0.3 0.4 0.66 0.45 0.7 0.4 #Median 3.91 4.02 4.29 4.34 4.8 5.14 #Min 3.4 3.47 3.67 3.53 4.11 4.42 #Max 4.06 4.34 5.14 4.54 5.5 5.29 #---------- # Include confidence intervals for the mean log(Fe) concentration # at each well, and also the p-value from the one-way # analysis of variance to test for a difference in well means. summaryStats(log(Iron.ppm) ~ Well, data = EPA.09.Ex.13.1.iron.df, digits = 1, ci = TRUE, p.value = TRUE, stats.in.rows = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 Combined #N 4 4 4 4 4 4 24 #Mean 3.8 4 4.3 4.2 4.8 5 4.4 #SD 0.3 0.4 0.7 0.5 0.7 0.4 0.6 #Median 3.9 4 4.3 4.3 4.8 5.1 4.3 #Min 3.4 3.5 3.7 3.5 4.1 4.4 3.4 #Max 4.1 4.3 5.1 4.5 5.5 5.3 5.5 #95%.LCL 3.3 3.3 3.3 3.5 3.7 4.4 4.1 #95%.UCL 4.3 4.6 5.4 4.9 5.9 5.6 4.6 #p.value.between 0.025 #==================================================================== # Using the built-in dataset HairEyeColor, summarize the frequencies # of hair color and test whether there is a difference in proportions. # NOTE: The data that was originally factor data has already been # collapsed into frequency counts by catetory in the object # HairEyeColor. In the examples in this section, we recreate # the factor objects in order to show how summaryStats works # for factor objects. Hair <- apply(HairEyeColor, 1, sum) Hair #Black Brown Red Blond # 108 286 71 127 Hair.color <- names(Hair) Hair.fac <- factor(rep(Hair.color, times = Hair), levels = Hair.color) #---------- # Compute summary statistics and perform the chi-square test # for equal proportions of hair color summaryStats(Hair.fac, digits = 1, p.value = TRUE) # N Pct ChiSq_p #Black 108 18.2 #Brown 286 48.3 #Red 71 12.0 #Blond 127 21.5 #Combined 592 100.0 2.5e-39 #---------- # Now test the hypothesis that 10% of the population from which # this sample was drawn has Red hair, and compute a 95% confidence # interval for the percent of subjects with red hair. Red.Hair.fac <- factor(Hair.fac == "Red", levels = c(TRUE, FALSE), labels = c("Red", "Not Red")) summaryStats(Red.Hair.fac, digits = 1, p.value = TRUE, ci = TRUE, test = "binom", test.arg.list = list(p = 0.1)) # N Pct Exact_p 95%.LCL 95%.UCL #Red 71 12 9.5 14.9 #Not Red 521 88 #Combined 592 100 0.11 #---------- # Now test whether the percent of people with Green eyes is the # same for people with and without Red hair. HairEye <- apply(HairEyeColor, 1:2, sum) Hair.color <- rownames(HairEye) Eye.color <- colnames(HairEye) n11 <- HairEye[Hair.color == "Red", Eye.color == "Green"] n12 <- sum(HairEye[Hair.color == "Red", Eye.color != "Green"]) n21 <- sum(HairEye[Hair.color != "Red", Eye.color == "Green"]) n22 <- sum(HairEye[Hair.color != "Red", Eye.color != "Green"]) Hair.fac <- factor(rep(c("Red", "Not Red"), c(n11+n12, n21+n22)), levels = c("Red", "Not Red")) Eye.fac <- factor(c(rep("Green", n11), rep("Not Green", n12), rep("Green", n21), rep("Not Green", n22)), levels = c("Green", "Not Green")) #---------- # Here are the results using the chi-square test and computing # confidence limits for the difference between the two percentages summaryStats(Eye.fac, group = Hair.fac, digits = 1, p.value = TRUE, ci = TRUE, test = "prop", stats.in.rows = TRUE, test.arg.list = list(correct = FALSE)) # Green Not Green Combined #Red(N) 14 57 71 #Red(Pct) 19.7 80.3 100 #Not Red(N) 50 471 521 #Not Red(Pct) 9.6 90.4 100 #ChiSq_p 0.01 #95%.LCL.between 0.5 #95%.UCL.between 19.7 #---------- # Here are the results using Fisher's exact test and computing # confidence limits for the odds ratio summaryStats(Eye.fac, group = Hair.fac, digits = 1, p.value = TRUE, ci = TRUE, test = "fisher", stats.in.rows = TRUE) # Green Not Green Combined #Red(N) 14 57 71 #Red(Pct) 19.7 80.3 100 #Not Red(N) 50 471 521 #Not Red(Pct) 9.6 90.4 100 #Fisher_p 0.015 #95%.LCL.OR 1.1 #95%.UCL.OR 4.6 rm(Hair, Hair.color, Hair.fac, Red.Hair.fac, HairEye, Eye.color, n11, n12, n21, n22, Eye.fac) #==================================================================== # The data set EPA.89b.cadmium.df contains information on # cadmium concentrations in groundwater collected from a # background and compliance well. Compare detection frequencies # between the well types and test for a difference using # Fisher's exact test. summaryStats(factor(Censored) ~ Well.type, data = EPA.89b.cadmium.df, digits = 1, p.value = TRUE, test = "fisher") summaryStats(factor(Censored) ~ Well.type, data = EPA.89b.cadmium.df, digits = 1, p.value = TRUE, test = "fisher", stats.in.rows = TRUE) # FALSE TRUE Combined #Background(N) 8 16 24 #Background(Pct) 33.3 66.7 100 #Compliance(N) 24 40 64 #Compliance(Pct) 37.5 62.5 100 #Fisher_p 0.81 #95%.LCL.OR 0.3 #95%.UCL.OR 2.5 #==================================================================== #-------------------- # Paired Observations #-------------------- # The data frame ACE.13.TCE.df contians paired observations of # trichloroethylene (TCE; mg/L) at 10 groundwater monitoring wells # before and after remediation. # # Compare TCE concentrations before and after remediation and # use a paired t-test to test for a difference between periods. summaryStats(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, p.value = TRUE, paired = TRUE) summaryStats(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, p.value = TRUE, paired = TRUE, stats.in.rows = TRUE) # Before After Combined #N 10 10 20 #Mean 21.624 3.6329 12.6284 #SD 13.5113 3.5544 13.3281 #Median 20.3 2.48 8.475 #Min 5.96 0.272 0.272 #Max 41.5 10.7 41.5 #Diff -17.9911 #paired.p.value.between 0.0027 #95%.LCL.between -27.9097 #95%.UCL.between -8.0725 #========== # Repeat the last example, but use a one-sided alternative since # remediation should decrease TCE concentration. summaryStats(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, p.value = TRUE, paired = TRUE, alternative = "less") summaryStats(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, p.value = TRUE, paired = TRUE, alternative = "less", stats.in.rows = TRUE) # Before After Combined #N 10 10 20 #Mean 21.624 3.6329 12.6284 #SD 13.5113 3.5544 13.3281 #Median 20.3 2.48 8.475 #Min 5.96 0.272 0.272 #Max 41.5 10.7 41.5 #Diff -17.9911 #paired.p.value.between.less 0.0013 #95%.LCL.between -Inf #95%.UCL.between -9.9537
# The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. #---------- # First, create summary statistics by area based on the log-transformed data. summaryStats(log10(TcCB) ~ Area, data = EPA.94b.tccb.df) # N Mean SD Median Min Max #Cleanup 77 -0.2377 0.5908 -0.3665 -1.0458 2.2270 #Reference 47 -0.2691 0.2032 -0.2676 -0.6576 0.1239 #---------- # Now create summary statistics by area based on the log-transformed data # and use the t-test to compare the areas. summaryStats(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, p.value = TRUE) summaryStats(log10(TcCB) ~ Area, data = EPA.94b.tccb.df, p.value = TRUE, stats.in.rows = TRUE) # Cleanup Reference Combined #N 77 47 124 #Mean -0.2377 -0.2691 -0.2496 #SD 0.5908 0.2032 0.481 #Median -0.3665 -0.2676 -0.3143 #Min -1.0458 -0.6576 -1.0458 #Max 2.227 0.1239 2.227 #Diff -0.0313 #p.value.between 0.73 #95%.LCL.between -0.2082 #95%.UCL.between 0.1456 #==================================================================== # Page 9-3 of USEPA (2009) lists trichloroethene # concentrations (TCE; mg/L) collected from groundwater at two wells. # Here, the seven non-detects have been set to their detection limit. #---------- # First, compute summary statistics for all TCE observations. summaryStats(TCE.mg.per.L ~ 1, data = EPA.09.Table.9.1.TCE.df, digits = 3, data.name = "TCE") # N Mean SD Median Min Max NA's N.Total #TCE 27 0.09 0.064 0.1 0.004 0.25 3 30 summaryStats(TCE.mg.per.L ~ 1, data = EPA.09.Table.9.1.TCE.df, se = TRUE, quartiles = TRUE, digits = 3, data.name = "TCE") # N Mean SD SE Median Min Max 1st Qu. 3rd Qu. NA's N.Total #TCE 27 0.09 0.064 0.012 0.1 0.004 0.25 0.031 0.12 3 30 #---------- # Now compute summary statistics by well. summaryStats(TCE.mg.per.L ~ Well, data = EPA.09.Table.9.1.TCE.df, digits = 3) # N Mean SD Median Min Max NA's N.Total #Well.1 14 0.063 0.079 0.031 0.004 0.25 1 15 #Well.2 13 0.118 0.020 0.110 0.099 0.17 2 15 summaryStats(TCE.mg.per.L ~ Well, data = EPA.09.Table.9.1.TCE.df, digits = 3, stats.in.rows = TRUE) # Well.1 Well.2 #N 14 13 #Mean 0.063 0.118 #SD 0.079 0.02 #Median 0.031 0.11 #Min 0.004 0.099 #Max 0.25 0.17 #NA's 1 2 #N.Total 15 15 # If you want to keep trailing 0's, use the drop0trailing argument: summaryStats(TCE.mg.per.L ~ Well, data = EPA.09.Table.9.1.TCE.df, digits = 3, stats.in.rows = TRUE, drop0trailing = FALSE) # Well.1 Well.2 #N 14.000 13.000 #Mean 0.063 0.118 #SD 0.079 0.020 #Median 0.031 0.110 #Min 0.004 0.099 #Max 0.250 0.170 #NA's 1.000 2.000 #N.Total 15.000 15.000 #==================================================================== # Page 13-3 of USEPA (2009) lists iron concentrations (ppm) in # groundwater collected from 6 wells. #---------- # First, compute summary statistics for each well. summaryStats(Iron.ppm ~ Well, data = EPA.09.Ex.13.1.iron.df, combine.groups = FALSE, digits = 2, stats.in.rows = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 #N 4 4 4 4 4 4 #Mean 47.01 55.73 90.86 70.43 145.24 156.32 #SD 12.4 20.34 59.35 25.95 92.16 51.2 #Median 50.05 57.05 76.73 76.95 137.66 171.93 #Min 29.96 32.14 39.25 34.12 60.95 83.1 #Max 57.97 76.71 170.72 93.69 244.69 198.34 #---------- # Note the large differences in standard deviations between wells. # Compute summary statistics for log(Iron), by Well. summaryStats(log(Iron.ppm) ~ Well, data = EPA.09.Ex.13.1.iron.df, combine.groups = FALSE, digits = 2, stats.in.rows = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 #N 4 4 4 4 4 4 #Mean 3.82 3.97 4.35 4.19 4.8 5 #SD 0.3 0.4 0.66 0.45 0.7 0.4 #Median 3.91 4.02 4.29 4.34 4.8 5.14 #Min 3.4 3.47 3.67 3.53 4.11 4.42 #Max 4.06 4.34 5.14 4.54 5.5 5.29 #---------- # Include confidence intervals for the mean log(Fe) concentration # at each well, and also the p-value from the one-way # analysis of variance to test for a difference in well means. summaryStats(log(Iron.ppm) ~ Well, data = EPA.09.Ex.13.1.iron.df, digits = 1, ci = TRUE, p.value = TRUE, stats.in.rows = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 Combined #N 4 4 4 4 4 4 24 #Mean 3.8 4 4.3 4.2 4.8 5 4.4 #SD 0.3 0.4 0.7 0.5 0.7 0.4 0.6 #Median 3.9 4 4.3 4.3 4.8 5.1 4.3 #Min 3.4 3.5 3.7 3.5 4.1 4.4 3.4 #Max 4.1 4.3 5.1 4.5 5.5 5.3 5.5 #95%.LCL 3.3 3.3 3.3 3.5 3.7 4.4 4.1 #95%.UCL 4.3 4.6 5.4 4.9 5.9 5.6 4.6 #p.value.between 0.025 #==================================================================== # Using the built-in dataset HairEyeColor, summarize the frequencies # of hair color and test whether there is a difference in proportions. # NOTE: The data that was originally factor data has already been # collapsed into frequency counts by catetory in the object # HairEyeColor. In the examples in this section, we recreate # the factor objects in order to show how summaryStats works # for factor objects. Hair <- apply(HairEyeColor, 1, sum) Hair #Black Brown Red Blond # 108 286 71 127 Hair.color <- names(Hair) Hair.fac <- factor(rep(Hair.color, times = Hair), levels = Hair.color) #---------- # Compute summary statistics and perform the chi-square test # for equal proportions of hair color summaryStats(Hair.fac, digits = 1, p.value = TRUE) # N Pct ChiSq_p #Black 108 18.2 #Brown 286 48.3 #Red 71 12.0 #Blond 127 21.5 #Combined 592 100.0 2.5e-39 #---------- # Now test the hypothesis that 10% of the population from which # this sample was drawn has Red hair, and compute a 95% confidence # interval for the percent of subjects with red hair. Red.Hair.fac <- factor(Hair.fac == "Red", levels = c(TRUE, FALSE), labels = c("Red", "Not Red")) summaryStats(Red.Hair.fac, digits = 1, p.value = TRUE, ci = TRUE, test = "binom", test.arg.list = list(p = 0.1)) # N Pct Exact_p 95%.LCL 95%.UCL #Red 71 12 9.5 14.9 #Not Red 521 88 #Combined 592 100 0.11 #---------- # Now test whether the percent of people with Green eyes is the # same for people with and without Red hair. HairEye <- apply(HairEyeColor, 1:2, sum) Hair.color <- rownames(HairEye) Eye.color <- colnames(HairEye) n11 <- HairEye[Hair.color == "Red", Eye.color == "Green"] n12 <- sum(HairEye[Hair.color == "Red", Eye.color != "Green"]) n21 <- sum(HairEye[Hair.color != "Red", Eye.color == "Green"]) n22 <- sum(HairEye[Hair.color != "Red", Eye.color != "Green"]) Hair.fac <- factor(rep(c("Red", "Not Red"), c(n11+n12, n21+n22)), levels = c("Red", "Not Red")) Eye.fac <- factor(c(rep("Green", n11), rep("Not Green", n12), rep("Green", n21), rep("Not Green", n22)), levels = c("Green", "Not Green")) #---------- # Here are the results using the chi-square test and computing # confidence limits for the difference between the two percentages summaryStats(Eye.fac, group = Hair.fac, digits = 1, p.value = TRUE, ci = TRUE, test = "prop", stats.in.rows = TRUE, test.arg.list = list(correct = FALSE)) # Green Not Green Combined #Red(N) 14 57 71 #Red(Pct) 19.7 80.3 100 #Not Red(N) 50 471 521 #Not Red(Pct) 9.6 90.4 100 #ChiSq_p 0.01 #95%.LCL.between 0.5 #95%.UCL.between 19.7 #---------- # Here are the results using Fisher's exact test and computing # confidence limits for the odds ratio summaryStats(Eye.fac, group = Hair.fac, digits = 1, p.value = TRUE, ci = TRUE, test = "fisher", stats.in.rows = TRUE) # Green Not Green Combined #Red(N) 14 57 71 #Red(Pct) 19.7 80.3 100 #Not Red(N) 50 471 521 #Not Red(Pct) 9.6 90.4 100 #Fisher_p 0.015 #95%.LCL.OR 1.1 #95%.UCL.OR 4.6 rm(Hair, Hair.color, Hair.fac, Red.Hair.fac, HairEye, Eye.color, n11, n12, n21, n22, Eye.fac) #==================================================================== # The data set EPA.89b.cadmium.df contains information on # cadmium concentrations in groundwater collected from a # background and compliance well. Compare detection frequencies # between the well types and test for a difference using # Fisher's exact test. summaryStats(factor(Censored) ~ Well.type, data = EPA.89b.cadmium.df, digits = 1, p.value = TRUE, test = "fisher") summaryStats(factor(Censored) ~ Well.type, data = EPA.89b.cadmium.df, digits = 1, p.value = TRUE, test = "fisher", stats.in.rows = TRUE) # FALSE TRUE Combined #Background(N) 8 16 24 #Background(Pct) 33.3 66.7 100 #Compliance(N) 24 40 64 #Compliance(Pct) 37.5 62.5 100 #Fisher_p 0.81 #95%.LCL.OR 0.3 #95%.UCL.OR 2.5 #==================================================================== #-------------------- # Paired Observations #-------------------- # The data frame ACE.13.TCE.df contians paired observations of # trichloroethylene (TCE; mg/L) at 10 groundwater monitoring wells # before and after remediation. # # Compare TCE concentrations before and after remediation and # use a paired t-test to test for a difference between periods. summaryStats(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, p.value = TRUE, paired = TRUE) summaryStats(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, p.value = TRUE, paired = TRUE, stats.in.rows = TRUE) # Before After Combined #N 10 10 20 #Mean 21.624 3.6329 12.6284 #SD 13.5113 3.5544 13.3281 #Median 20.3 2.48 8.475 #Min 5.96 0.272 0.272 #Max 41.5 10.7 41.5 #Diff -17.9911 #paired.p.value.between 0.0027 #95%.LCL.between -27.9097 #95%.UCL.between -8.0725 #========== # Repeat the last example, but use a one-sided alternative since # remediation should decrease TCE concentration. summaryStats(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, p.value = TRUE, paired = TRUE, alternative = "less") summaryStats(TCE.mg.per.L ~ Period, data = ACE.13.TCE.df, p.value = TRUE, paired = TRUE, alternative = "less", stats.in.rows = TRUE) # Before After Combined #N 10 10 20 #Mean 21.624 3.6329 12.6284 #SD 13.5113 3.5544 13.3281 #Median 20.3 2.48 8.475 #Min 5.96 0.272 0.272 #Max 41.5 10.7 41.5 #Diff -17.9911 #paired.p.value.between.less 0.0013 #95%.LCL.between -Inf #95%.UCL.between -9.9537
Objects of S3 class "summaryStats"
are returned by the functions
summaryStats
and summaryFull
.
Objects of S3 class "summaryStats"
are matrices that contain
information about the summary statistics.
Required Attributes
The following attributes must be included in a legitimate matrix of
class "summaryStats"
.
stats.in.rows |
logical scalar indicating whether the statistics
are stored by row |
drop0trailing |
logical scalar indicating whether to drop trailing 0's when printing the summary statistics. |
Generic functions that have methods for objects of class
"summaryStats"
include: print
.
Steven P. Millard ([email protected])
# Create an object of class "summaryStats", then print it out. #------------------------------------------------------------- summaryStats.obj <- summaryStats(TCE.mg.per.L ~ Well, data = EPA.09.Table.9.1.TCE.df, digits = 3) is.matrix(summaryStats.obj) #[1] TRUE class(summaryStats.obj) #[1] "summaryStats" attributes(summaryStats.obj) #$dim #[1] 2 8 # #$dimnames #$dimnames[[1]] #[1] "Well.1" "Well.2" # #$dimnames[[2]] #[1] "N" "Mean" "SD" "Median" "Min" "Max" #[7] "NA's" "N.Total" # # #$class #[1] "summaryStats" # #$stats.in.rows #[1] FALSE # #$drop0trailing #[1] TRUE summaryStats.obj # N Mean SD Median Min Max NA's N.Total #Well.1 14 0.063 0.079 0.031 0.004 0.25 1 15 #Well.2 13 0.118 0.020 0.110 0.099 0.17 2 15 #---------- # Clean up #--------- rm(summaryStats.obj)
# Create an object of class "summaryStats", then print it out. #------------------------------------------------------------- summaryStats.obj <- summaryStats(TCE.mg.per.L ~ Well, data = EPA.09.Table.9.1.TCE.df, digits = 3) is.matrix(summaryStats.obj) #[1] TRUE class(summaryStats.obj) #[1] "summaryStats" attributes(summaryStats.obj) #$dim #[1] 2 8 # #$dimnames #$dimnames[[1]] #[1] "Well.1" "Well.2" # #$dimnames[[2]] #[1] "N" "Mean" "SD" "Median" "Min" "Max" #[7] "NA's" "N.Total" # # #$class #[1] "summaryStats" # #$stats.in.rows #[1] FALSE # #$drop0trailing #[1] TRUE summaryStats.obj # N Mean SD Median Min Max NA's N.Total #Well.1 14 0.063 0.079 0.031 0.004 0.25 1 15 #Well.2 13 0.118 0.020 0.110 0.099 0.17 2 15 #---------- # Clean up #--------- rm(summaryStats.obj)
Construct a -content or
-expectation tolerance
interval for a gamma distribution.
tolIntGamma(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact", est.method = "mle", normal.approx.transform = "kulkarni.powar") tolIntGammaAlt(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact", est.method = "mle", normal.approx.transform = "kulkarni.powar")
tolIntGamma(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact", est.method = "mle", normal.approx.transform = "kulkarni.powar") tolIntGammaAlt(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact", est.method = "mle", normal.approx.transform = "kulkarni.powar")
x |
numeric vector of non-negative observations. Missing ( |
coverage |
a scalar between 0 and 1 indicating the desired coverage of the tolerance interval.
The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the tolerance
interval. The default value is |
method |
for the case of a two-sided tolerance interval, a character string specifying the
method for constructing the two-sided normal distribution tolerance interval using
the transformed data. This argument is ignored if |
est.method |
character string specifying the method of estimation for the shape and scale
distribution parameters. The possible values are
|
normal.approx.transform |
character string indicating which power transformation to use.
Possible values are |
The function tolIntGamma
returns a tolerance interval as well as
estimates of the shape and scale parameters.
The function tolIntGammaAlt
returns a tolerance interval as well as
estimates of the mean and coefficient of variation.
The tolerance interval is computed by 1) using a power transformation on the original
data to induce approximate normality, 2) using tolIntNorm
to compute
the tolerance interval, and then 3) back-transforming the interval to create a tolerance
interval on the original scale. (Krishnamoorthy et al., 2008).
The value normal.approx.transform="cube.root"
uses
the cube root transformation suggested by Wilson and Hilferty (1931) and used by
Krishnamoorthy et al. (2008) and Singh et al. (2010b), and the value
normal.approx.transform="fourth.root"
uses the fourth root transformation suggested
by Hawkins and Wixley (1986) and used by Singh et al. (2010b).
The default value normal.approx.transform="kulkarni.powar"
uses the "Optimum Power Normal Approximation Method" of Kulkarni and Powar (2010).
The "optimum" power is determined by:
|
if |
|
if |
where denotes the estimate of the shape parameter. Although
Kulkarni and Powar (2010) use the maximum likelihood estimate of shape to
determine the power
, for the functions
tolIntGamma
and tolIntGammaAlt
the power is based on
whatever estimate of shape is used (e.g.,
est.method="mle"
, est.method="bcmle"
, etc.).
A list of class "estimate"
containing the estimated parameters,
the tolerance interval, and other information. See estimate.object
for details.
In addition to the usual components contained in an object of class
"estimate"
, the returned value also includes an additional
component within the "interval"
component:
normal.transform.power |
the value of the power used to transform the original data to approximate normality. |
It is possible for the lower tolerance limit based on the transformed data to be less than 0. In this case, the lower tolerance limit on the original scale is set to 0 and a warning is issued stating that the normal approximation is not accurate in this case.
The gamma distribution takes values on the positive real line. Special cases of the gamma are the exponential distribution and the chi-square distributions. Applications of the gamma include life testing, statistical ecology, queuing theory, inventory control, and precipitation processes. A gamma distribution starts to resemble a normal distribution as the shape parameter a tends to infinity.
Some EPA guidance documents (e.g., Singh et al., 2002; Singh et al., 2010a,b) strongly recommend against using a lognormal model for environmental data and recommend trying a gamma distribuiton instead.
Tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Meeker, 1991). References that discuss tolerance intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.
Ellison, B.E. (1964). On Two-Sided Tolerance Intervals for a Normal Distribution. Annals of Mathematical Statistics 35, 762-772.
Evans, M., N. Hastings, and B. Peacock. (1993). Statistical Distributions. Second Edition. John Wiley and Sons, New York, Chapter 18.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Hawkins, D. M., and R.A.J. Wixley. (1986). A Note on the Transformation of Chi-Squared Variables to Normality. The American Statistician, 40, 296–298.
Johnson, N.L., S. Kotz, and N. Balakrishnan. (1994). Continuous Univariate Distributions, Volume 1. Second Edition. John Wiley and Sons, New York, Chapter 17.
Krishnamoorthy K., T. Mathew, and S. Mukherjee. (2008). Normal-Based Methods for a Gamma Distribution: Prediction and Tolerance Intervals and Stress-Strength Reliability. Technometrics, 50(1), 69–78.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Kulkarni, H.V., and S.K. Powar. (2010). A New Method for Interval Estimation of the Mean of the Gamma Distribution. Lifetime Data Analysis, 16, 431–447.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Singh, A., A.K. Singh, and R.J. Iaci. (2002). Estimation of the Exposure Point Concentration Term Using a Gamma Distribution. EPA/600/R-02/084. October 2002. Technology Support Center for Monitoring and Site Characterization, Office of Research and Development, Office of Solid Waste and Emergency Response, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Wilson, E.B., and M.M. Hilferty. (1931). The Distribution of Chi-Squares. Proceedings of the National Academy of Sciences, 17, 684–688.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
GammaDist
, estimate.object
,
egamma
, tolIntNorm
,
predIntGamma
.
# Generate 20 observations from a gamma distribution with parameters # shape=3 and scale=2, then create a tolerance interval. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rgamma(20, shape = 3, scale = 2) tolIntGamma(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.203862 # scale = 2.174928 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Tolerance Interval Type: two-sided # #Confidence Level: 95% # #Number of Future Observations: 1 # #Tolerance Interval: LTL = 0.2340438 # UTL = 21.2996464 #-------------------------------------------------------------------- # Using the same data as in the previous example, create an upper # one-sided tolerance interval and use the bias-corrected estimate of # shape. tolIntGamma(dat, ti.type = "upper", est.method = "bcmle") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 1.906616 # scale = 2.514005 # #Estimation Method: bcmle # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on bcmle of 'shape' # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.00000 # UTL = 17.72107 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal # distribution. Here we will use the same chrysene data but assume a # gamma distribution. attach(EPA.09.Ex.17.3.chrysene.df) Chrysene <- Chrysene.ppb[Well.type == "Background"] #---------- # First perform a goodness-of-fit test for a gamma distribution gofTest(Chrysene, dist = "gamma") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 2.806929 # scale = 5.286026 # #Estimation Method: mle # #Data: Chrysene # #Sample Size: 8 # #Test Statistic: W = 0.9156306 # #Test Statistic Parameter: n = 8 # #P-value: 0.3954223 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Now compute the upper tolerance limit tolIntGamma(Chrysene, ti.type = "upper", coverage = 0.95, conf.level = 0.95) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.806929 # scale = 5.286026 # #Estimation Method: mle # #Data: Chrysene # #Sample Size: 8 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.00000 # UTL = 69.32425 #---------- # Compare this upper tolerance limit of 69 ppb to the upper tolerance limit # assuming a lognormal distribution. tolIntLnorm(Chrysene, ti.type = "upper", coverage = 0.95, conf.level = 0.95)$interval$limits["UTL"] # UTL #90.9247 #---------- # Clean up rm(Chrysene) detach("EPA.09.Ex.17.3.chrysene.df") #-------------------------------------------------------------------- # Reproduce some of the example on page 73 of # Krishnamoorthy et al. (2008), which uses alkalinity concentrations # reported in Gibbons (1994) and Gibbons et al. (2009) to construct # two-sided and one-sided upper tolerance limits for various values # of coverage using a 95% confidence level. tolIntGamma(Gibbons.et.al.09.Alkilinity.vec, ti.type = "upper", coverage = 0.9, normal.approx.transform = "cube.root") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 9.375013 # scale = 6.202461 # #Estimation Method: mle # #Data: Gibbons.et.al.09.Alkilinity.vec # #Sample Size: 27 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Wilson & Hilferty (1931) cube-root # transformation to Normality # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.00000 # UTL = 97.70502 tolIntGamma(Gibbons.et.al.09.Alkilinity.vec, coverage = 0.99, normal.approx.transform = "cube.root") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 9.375013 # scale = 6.202461 # #Estimation Method: mle # #Data: Gibbons.et.al.09.Alkilinity.vec # #Sample Size: 27 # #Tolerance Interval Coverage: 99% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Wilson & Hilferty (1931) cube-root # transformation to Normality # #Tolerance Interval Type: two-sided # #Confidence Level: 95% # #Tolerance Interval: LTL = 13.14318 # UTL = 148.43876
# Generate 20 observations from a gamma distribution with parameters # shape=3 and scale=2, then create a tolerance interval. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rgamma(20, shape = 3, scale = 2) tolIntGamma(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.203862 # scale = 2.174928 # #Estimation Method: mle # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Tolerance Interval Type: two-sided # #Confidence Level: 95% # #Number of Future Observations: 1 # #Tolerance Interval: LTL = 0.2340438 # UTL = 21.2996464 #-------------------------------------------------------------------- # Using the same data as in the previous example, create an upper # one-sided tolerance interval and use the bias-corrected estimate of # shape. tolIntGamma(dat, ti.type = "upper", est.method = "bcmle") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 1.906616 # scale = 2.514005 # #Estimation Method: bcmle # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on bcmle of 'shape' # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.00000 # UTL = 17.72107 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal # distribution. Here we will use the same chrysene data but assume a # gamma distribution. attach(EPA.09.Ex.17.3.chrysene.df) Chrysene <- Chrysene.ppb[Well.type == "Background"] #---------- # First perform a goodness-of-fit test for a gamma distribution gofTest(Chrysene, dist = "gamma") #Results of Goodness-of-Fit Test #------------------------------- # #Test Method: Shapiro-Wilk GOF Based on # Chen & Balakrisnan (1995) # #Hypothesized Distribution: Gamma # #Estimated Parameter(s): shape = 2.806929 # scale = 5.286026 # #Estimation Method: mle # #Data: Chrysene # #Sample Size: 8 # #Test Statistic: W = 0.9156306 # #Test Statistic Parameter: n = 8 # #P-value: 0.3954223 # #Alternative Hypothesis: True cdf does not equal the # Gamma Distribution. #---------- # Now compute the upper tolerance limit tolIntGamma(Chrysene, ti.type = "upper", coverage = 0.95, conf.level = 0.95) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 2.806929 # scale = 5.286026 # #Estimation Method: mle # #Data: Chrysene # #Sample Size: 8 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Kulkarni & Powar (2010) # transformation to Normality # based on mle of 'shape' # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.00000 # UTL = 69.32425 #---------- # Compare this upper tolerance limit of 69 ppb to the upper tolerance limit # assuming a lognormal distribution. tolIntLnorm(Chrysene, ti.type = "upper", coverage = 0.95, conf.level = 0.95)$interval$limits["UTL"] # UTL #90.9247 #---------- # Clean up rm(Chrysene) detach("EPA.09.Ex.17.3.chrysene.df") #-------------------------------------------------------------------- # Reproduce some of the example on page 73 of # Krishnamoorthy et al. (2008), which uses alkalinity concentrations # reported in Gibbons (1994) and Gibbons et al. (2009) to construct # two-sided and one-sided upper tolerance limits for various values # of coverage using a 95% confidence level. tolIntGamma(Gibbons.et.al.09.Alkilinity.vec, ti.type = "upper", coverage = 0.9, normal.approx.transform = "cube.root") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 9.375013 # scale = 6.202461 # #Estimation Method: mle # #Data: Gibbons.et.al.09.Alkilinity.vec # #Sample Size: 27 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Wilson & Hilferty (1931) cube-root # transformation to Normality # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.00000 # UTL = 97.70502 tolIntGamma(Gibbons.et.al.09.Alkilinity.vec, coverage = 0.99, normal.approx.transform = "cube.root") #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Gamma # #Estimated Parameter(s): shape = 9.375013 # scale = 6.202461 # #Estimation Method: mle # #Data: Gibbons.et.al.09.Alkilinity.vec # #Sample Size: 27 # #Tolerance Interval Coverage: 99% # #Coverage Type: content # #Tolerance Interval Method: Exact using # Wilson & Hilferty (1931) cube-root # transformation to Normality # #Tolerance Interval Type: two-sided # #Confidence Level: 95% # #Tolerance Interval: LTL = 13.14318 # UTL = 148.43876
Estimate the mean and standard deviation on the log-scale for a
lognormal distribution, or estimate the mean
and coefficient of variation for a
lognormal distribution (alternative parameterization),
and construct a -content or
-expectation tolerance
interval.
tolIntLnorm(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact") tolIntLnormAlt(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact", est.method = "mvue")
tolIntLnorm(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact") tolIntLnormAlt(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact", est.method = "mvue")
x |
For For If |
coverage |
a scalar between 0 and 1 indicating the desired coverage of the tolerance interval.
The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the
tolerance interval. The default value is |
method |
for the case of a two-sided tolerance interval, a character string specifying the
method for constructing the tolerance interval. This argument is ignored if
|
est.method |
for |
The function tolIntLnorm
returns a tolerance interval as well as
estimates of the meanlog and sdlog parameters.
The function tolIntLnormAlt
returns a tolerance interval as well as
estimates of the mean and coefficient of variation.
A tolerance interval for a lognormal distribution is constructed by taking the
natural logarithm of the observations and constructing a tolerance interval
based on the normal (Gaussian) distribution by calling tolIntNorm
.
These tolerance limits are then exponentiated to produce a tolerance interval on
the original scale of the data.
If x
is a numeric vector, a list of class
"estimate"
containing the estimated parameters, a component called
interval
containing the tolerance interval information, and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, a list whose
class is the same as x
. The list contains the same
components as x
. If x
already has a component called
interval
, this component is replaced with the tolerance interval
information.
Tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Meeker, 1991; Krishnamoorthy and Mathew, 2009). References that discuss tolerance intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.
Ellison, B.E. (1964). On Two-Sided Tolerance Intervals for a Normal Distribution. Annals of Mathematical Statistics 35, 762-772.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Odeh, R.E., and D.B. Owen. (1980). Tables for Normal Tolerance Limits, Sampling Plans, and Screening. Marcel Dekker, New York.
Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Wald, A., and J. Wolfowitz. (1946). Tolerance Limits for a Normal Distribution. Annals of Mathematical Statistics 17, 208-215.
tolIntNorm
, Lognormal
, LognormalAlt
,
estimate.object
, elnorm
, elnormAlt
,
eqlnorm
, predIntLnorm
,
Tolerance Intervals, Estimating Distribution Parameters,
Estimating Distribution Quantiles.
# Generate 20 observations from a lognormal distribution with parameters # meanlog=0 and sdlog=1. Use tolIntLnorm to estimate # the mean and standard deviation of the log of the true distribution, and # construct a two-sided 90% beta-content tolerance interval with associated # confidence level 95%. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnorm(20) tolIntLnorm(dat, coverage = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.06941976 # sdlog = 0.59011300 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: two-sided # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.237457 # UTL = 3.665369 # The exact two-sided interval that contains 90% of this distribution # is given by: [0.193, 5.18]. qlnorm(p = c(0.05, 0.95)) #[1] 0.1930408 5.1802516 # Clean up rm(dat) #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 with(EPA.09.Ex.17.3.chrysene.df, tolIntLnorm(Chrysene.ppb[Well.type == "Background"], ti.type = "upper", coverage = 0.95, conf.level = 0.95)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.5085773 # sdlog = 0.6279479 # #Estimation Method: mvue # #Data: Chrysene.ppb[Well.type == "Background"] # #Sample Size: 8 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.0000 # UTL = 90.9247 #---------- # Repeat the above example, but estimate the mean and # coefficient of variation on the original scale #----------------------------------------------- with(EPA.09.Ex.17.3.chrysene.df, tolIntLnormAlt(Chrysene.ppb[Well.type == "Background"], ti.type = "upper", coverage = 0.95, conf.level = 0.95)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): mean = 14.5547353 # cv = 0.6390825 # #Estimation Method: mvue # #Data: Chrysene.ppb[Well.type == "Background"] # #Sample Size: 8 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.0000 # UTL = 90.9247
# Generate 20 observations from a lognormal distribution with parameters # meanlog=0 and sdlog=1. Use tolIntLnorm to estimate # the mean and standard deviation of the log of the true distribution, and # construct a two-sided 90% beta-content tolerance interval with associated # confidence level 95%. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rlnorm(20) tolIntLnorm(dat, coverage = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = -0.06941976 # sdlog = 0.59011300 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: two-sided # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.237457 # UTL = 3.665369 # The exact two-sided interval that contains 90% of this distribution # is given by: [0.193, 5.18]. qlnorm(p = c(0.05, 0.95)) #[1] 0.1930408 5.1802516 # Clean up rm(dat) #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 with(EPA.09.Ex.17.3.chrysene.df, tolIntLnorm(Chrysene.ppb[Well.type == "Background"], ti.type = "upper", coverage = 0.95, conf.level = 0.95)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): meanlog = 2.5085773 # sdlog = 0.6279479 # #Estimation Method: mvue # #Data: Chrysene.ppb[Well.type == "Background"] # #Sample Size: 8 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.0000 # UTL = 90.9247 #---------- # Repeat the above example, but estimate the mean and # coefficient of variation on the original scale #----------------------------------------------- with(EPA.09.Ex.17.3.chrysene.df, tolIntLnormAlt(Chrysene.ppb[Well.type == "Background"], ti.type = "upper", coverage = 0.95, conf.level = 0.95)) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Lognormal # #Estimated Parameter(s): mean = 14.5547353 # cv = 0.6390825 # #Estimation Method: mvue # #Data: Chrysene.ppb[Well.type == "Background"] # #Sample Size: 8 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.0000 # UTL = 90.9247
Construct a -content or
-expectation tolerance
interval for a lognormal distribution based on Type I or Type II
censored data.
tolIntLnormCensored(x, censored, censoring.side = "left", coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "mle", ti.method = "exact.for.complete", seed = NULL, nmc = 1000)
tolIntLnormCensored(x, censored, censoring.side = "left", coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "mle", ti.method = "exact.for.complete", seed = NULL, nmc = 1000)
x |
numeric vector of positive observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible values are
|
coverage |
a scalar between 0 and 1 indicating the desired coverage of the tolerance interval.
The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the tolerance
interval. The default value is |
method |
character string indicating the method to use for parameter estimation on the log-scale. |
ti.method |
character string specifying the method for constructing the tolerance
interval. Possible values are: |
seed |
for the case when |
nmc |
for the case when |
A tolerance interval for a lognormal distribution is constructed by taking the
natural logarithm of the observations and constructing a tolerance interval
based on the normal (Gaussian) distribution by calling tolIntNormCensored
.
These tolerance limits are then exponentiated to produce a tolerance interval on
the original scale of the data.
A list of class "estimateCensored"
containing the estimated
parameters, the tolerance interval, and other information.
See estimateCensored.object
for details.
Tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Meeker, 1991; Krishnamoorthy and Mathew, 2009). References that discuss tolerance intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.
Ellison, B.E. (1964). On Two-Sided Tolerance Intervals for a Normal Distribution. Annals of Mathematical Statistics 35, 762-772.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Odeh, R.E., and D.B. Owen. (1980). Tables for Normal Tolerance Limits, Sampling Plans, and Screening. Marcel Dekker, New York.
Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Wald, A., and J. Wolfowitz. (1946). Tolerance Limits for a Normal Distribution. Annals of Mathematical Statistics 17, 208-215.
tolIntNormCensored
,
gpqTolIntNormSinglyCensored
, eqnormCensored
,
enormCensored
, estimateCensored.object
.
# Generate 20 observations from a lognormal distribution with parameters # mean=10 and cv=1, censor the observations less than 5, # then create a one-sided upper tolerance interval with 90% # coverage and 95% confidence based on these Type I left, singly # censored data. # (Note: the call to set.seed allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 1) sort(dat) # [1] 2.608298 3.185459 4.196216 4.383764 4.569752 5.136130 # [7] 5.209538 5.916284 6.199076 6.214755 6.255779 6.778361 #[13] 7.074972 7.100494 8.930845 10.388766 11.402769 14.247062 #[19] 14.559506 15.437340 censored <- dat < 5 dat[censored] <- 5 tolIntLnormCensored(dat, censored, coverage = 0.9, ti.type="upper") #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 5 # #Estimated Parameter(s): meanlog = 1.8993686 # sdlog = 0.4804343 # #Estimation Method: MLE # #Data: dat # #Censoring Variable: censored # #Sample Size: 20 # #Percent Censored: 25% # #Assumed Sample Size: 20 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact for # Complete Data # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.00000 # UTL = 16.85556 ## Not run: # Note: The true 90'th percentile is 20.55231 #--------------------------------------------- qlnormAlt(0.9, mean = 10, cv = 1) #[1] 20.55231 # Compare the result using the method "gpq" tolIntLnormCensored(dat, censored, coverage = 0.9, ti.type="upper", ti.method = "gpq", seed = 432)$interval$limits # LTL UTL # 0.00000 17.85474 # Clean Up #--------- rm(dat, censored) #-------------------------------------------------------------- # Example 15-1 of USEPA (2009, p. 15-10) shows how to estimate # the mean and standard deviation using log-transformed multiply # left-censored manganese concentration data. Here we'll construct a # 95 EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored # 1 1 Well.1 <5 5.0 TRUE # 2 2 Well.1 12.1 12.1 FALSE # 3 3 Well.1 16.9 16.9 FALSE # ... # 23 3 Well.5 3.3 3.3 FALSE # 24 4 Well.5 8.4 8.4 FALSE # 25 5 Well.5 <2 2.0 TRUE with(EPA.09.Ex.15.1.manganese.df, tolIntLnormCensored(Manganese.ppb, Censored, coverage = 0.9, ti.type = "upper")) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: censored # #Sample Size: 25 # #Percent Censored: 24 # #Assumed Sample Size: 25 # #Tolerance Interval Coverage: 90 # #Coverage Type: content # #Tolerance Interval Method: Exact for # Complete Data # #Tolerance Interval Type: upper # #Confidence Level: 95 # #Tolerance Interval: LTL = 0.0000 # UTL = 110.9305 ## End(Not run)
# Generate 20 observations from a lognormal distribution with parameters # mean=10 and cv=1, censor the observations less than 5, # then create a one-sided upper tolerance interval with 90% # coverage and 95% confidence based on these Type I left, singly # censored data. # (Note: the call to set.seed allows you to reproduce this example.) set.seed(250) dat <- rlnormAlt(20, mean = 10, cv = 1) sort(dat) # [1] 2.608298 3.185459 4.196216 4.383764 4.569752 5.136130 # [7] 5.209538 5.916284 6.199076 6.214755 6.255779 6.778361 #[13] 7.074972 7.100494 8.930845 10.388766 11.402769 14.247062 #[19] 14.559506 15.437340 censored <- dat < 5 dat[censored] <- 5 tolIntLnormCensored(dat, censored, coverage = 0.9, ti.type="upper") #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 5 # #Estimated Parameter(s): meanlog = 1.8993686 # sdlog = 0.4804343 # #Estimation Method: MLE # #Data: dat # #Censoring Variable: censored # #Sample Size: 20 # #Percent Censored: 25% # #Assumed Sample Size: 20 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact for # Complete Data # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = 0.00000 # UTL = 16.85556 ## Not run: # Note: The true 90'th percentile is 20.55231 #--------------------------------------------- qlnormAlt(0.9, mean = 10, cv = 1) #[1] 20.55231 # Compare the result using the method "gpq" tolIntLnormCensored(dat, censored, coverage = 0.9, ti.type="upper", ti.method = "gpq", seed = 432)$interval$limits # LTL UTL # 0.00000 17.85474 # Clean Up #--------- rm(dat, censored) #-------------------------------------------------------------- # Example 15-1 of USEPA (2009, p. 15-10) shows how to estimate # the mean and standard deviation using log-transformed multiply # left-censored manganese concentration data. Here we'll construct a # 95 EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored # 1 1 Well.1 <5 5.0 TRUE # 2 2 Well.1 12.1 12.1 FALSE # 3 3 Well.1 16.9 16.9 FALSE # ... # 23 3 Well.5 3.3 3.3 FALSE # 24 4 Well.5 8.4 8.4 FALSE # 25 5 Well.5 <2 2.0 TRUE with(EPA.09.Ex.15.1.manganese.df, tolIntLnormCensored(Manganese.ppb, Censored, coverage = 0.9, ti.type = "upper")) #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Lognormal # #Censoring Side: left # #Censoring Level(s): 2 5 # #Estimated Parameter(s): meanlog = 2.215905 # sdlog = 1.356291 # #Estimation Method: MLE # #Data: Manganese.ppb # #Censoring Variable: censored # #Sample Size: 25 # #Percent Censored: 24 # #Assumed Sample Size: 25 # #Tolerance Interval Coverage: 90 # #Coverage Type: content # #Tolerance Interval Method: Exact for # Complete Data # #Tolerance Interval Type: upper # #Confidence Level: 95 # #Tolerance Interval: LTL = 0.0000 # UTL = 110.9305 ## End(Not run)
Construct a -content or
-expectation tolerance
interval for a normal distribution.
tolIntNorm(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact")
tolIntNorm(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact")
x |
numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a normal (Gaussian) distribution
(i.e., |
coverage |
a scalar between 0 and 1 indicating the desired coverage of the tolerance interval.
The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the tolerance
interval. The default value is |
method |
for the case of a two-sided tolerance interval, a character string specifying the method for
constructing the tolerance interval. This argument is ignored if |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
A tolerance interval for some population is an interval on the real line constructed so as to
contain of the population (i.e.,
of all
future observations), where
. The quantity
is called
the coverage.
There are two kinds of tolerance intervals (Guttman, 1970):
A -content tolerance interval with confidence level
is
constructed so that it contains at least
of the population (i.e., the
coverage is at least
) with probability
, where
. The quantity
is called the confidence level or
confidence coefficient associated with the tolerance interval.
A -expectation tolerance interval is constructed so that the average coverage of
the interval is
.
Note: A -expectation tolerance interval with coverage
is
equivalent to a prediction interval for one future observation with associated confidence level
. Note that there is no explicit confidence level associated with a
-expectation tolerance interval. If a
-expectation tolerance interval is
treated as a
-content tolerance interval, the confidence level associated with this
tolerance interval is usually around 50% (e.g., Guttman, 1970, Table 4.2, p.76).
For a normal distribution, the form of a two-sided tolerance
interval is:
where denotes the sample
mean,
denotes the sample standard deviation, and
denotes a constant
that depends on the sample size
, the coverage, and, for a
-content
tolerance interval (but not a
-expectation tolerance interval), the
confidence level.
Similarly, the form of a one-sided lower tolerance interval is:
and the form of a one-sided upper tolerance interval is:
but differs for one-sided versus two-sided tolerance intervals.
The derivation of the constant
is explained in the help file for
tolIntNormK
.
If x
is a numeric vector, tolIntNorm
returns a list of class
"estimate"
containing the estimated parameters, a component called
interval
containing the tolerance interval information, and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, tolIntNorm
returns a list whose class is the same as x
. The list contains the same
components as x
. If x
already has a component called
interval
, this component is replaced with the tolerance interval
information.
Tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Meeker, 1991; Krishnamoorthy and Mathew, 2009). References that discuss tolerance intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.
Ellison, B.E. (1964). On Two-Sided Tolerance Intervals for a Normal Distribution. Annals of Mathematical Statistics 35, 762-772.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Odeh, R.E., and D.B. Owen. (1980). Tables for Normal Tolerance Limits, Sampling Plans, and Screening. Marcel Dekker, New York.
Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Wald, A., and J. Wolfowitz. (1946). Tolerance Limits for a Normal Distribution. Annals of Mathematical Statistics 17, 208-215.
tolIntNormK
, tolIntLnorm
, Normal
,
estimate.object
, enorm
, eqnorm
,
predIntNorm
, Tolerance Intervals,
Estimating Distribution Parameters, Estimating Distribution Quantiles.
# Generate 20 observations from a normal distribution with parameters # mean=10 and sd=2, then create a tolerance interval. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rnorm(20, mean = 10, sd = 2) tolIntNorm(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 9.861160 # sd = 1.180226 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: two-sided # #Confidence Level: 95% # #Tolerance Interval: LTL = 6.603328 # UTL = 13.118993 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. # Here we will first take the log of the data and # then construct the tolerance interval; note however that it is # easier to call the function tolIntLnorm instead using the # original data. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 tol.int.list <- with(EPA.09.Ex.17.3.chrysene.df, tolIntNorm(log(Chrysene.ppb[Well.type == "Background"]), ti.type = "upper", coverage = 0.95, conf.level = 0.95)) tol.int.list #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 2.5085773 # sd = 0.6279479 # #Estimation Method: mvue # #Data: log(Chrysene.ppb[Well.type == "Background"]) # #Sample Size: 8 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = -Inf # UTL = 4.510032 # Compute the upper tolerance interaval on the original scale # by exponentiating the upper tolerance limit: exp(tol.int.list$interval$limits["UTL"]) # UTL #90.9247 #---------- # Clean up rm(tol.int.list)
# Generate 20 observations from a normal distribution with parameters # mean=10 and sd=2, then create a tolerance interval. # (Note: the call to set.seed simply allows you to reproduce this # example.) set.seed(250) dat <- rnorm(20, mean = 10, sd = 2) tolIntNorm(dat) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 9.861160 # sd = 1.180226 # #Estimation Method: mvue # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: two-sided # #Confidence Level: 95% # #Tolerance Interval: LTL = 6.603328 # UTL = 13.118993 #---------- # Clean up rm(dat) #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. # Here we will first take the log of the data and # then construct the tolerance interval; note however that it is # easier to call the function tolIntLnorm instead using the # original data. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 tol.int.list <- with(EPA.09.Ex.17.3.chrysene.df, tolIntNorm(log(Chrysene.ppb[Well.type == "Background"]), ti.type = "upper", coverage = 0.95, conf.level = 0.95)) tol.int.list #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Normal # #Estimated Parameter(s): mean = 2.5085773 # sd = 0.6279479 # #Estimation Method: mvue # #Data: log(Chrysene.ppb[Well.type == "Background"]) # #Sample Size: 8 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = -Inf # UTL = 4.510032 # Compute the upper tolerance interaval on the original scale # by exponentiating the upper tolerance limit: exp(tol.int.list$interval$limits["UTL"]) # UTL #90.9247 #---------- # Clean up rm(tol.int.list)
Construct a -content or
-expectation tolerance
interval for a normal distribution based on Type I or Type II
censored data.
tolIntNormCensored(x, censored, censoring.side = "left", coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "mle", ti.method = "exact.for.complete", seed = NULL, nmc = 1000)
tolIntNormCensored(x, censored, censoring.side = "left", coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "mle", ti.method = "exact.for.complete", seed = NULL, nmc = 1000)
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible values are
|
coverage |
a scalar between 0 and 1 indicating the desired coverage of the tolerance interval.
The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the tolerance
interval. The default value is |
method |
character string indicating the method to use for parameter estimation. |
ti.method |
character string specifying the method for constructing the tolerance
interval. Possible values are: |
seed |
for the case when |
nmc |
for the case when |
See the help file for tolIntNorm
for an explanation of
tolerance intervals. When ti.method="gpq"
, the tolerance interval
is constructed using the method of Generalized Pivotal Quantities as
explained in Krishnamoorthy and Mathew (2009, p. 327). When
ti.method="exact.for.complete"
or
ti.method="wald.wolfowitz.for.complete"
, the tolerance interval
is constructed by first computing the maximum likelihood estimates of
the mean and standard deviation by calling enormCensored
,
then passing these values to the function tolIntNorm
to
produce the tolerance interval as if the estimates were based on
complete rather than censored data. These last two methods are purely
ad-hoc and their properties need to be studied.
A list of class "estimateCensored"
containing the estimated
parameters, the tolerance interval, and other information.
See estimateCensored.object
for details.
Tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Meeker, 1991; Krishnamoorthy and Mathew, 2009). References that discuss tolerance intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.
Ellison, B.E. (1964). On Two-Sided Tolerance Intervals for a Normal Distribution. Annals of Mathematical Statistics 35, 762-772.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Odeh, R.E., and D.B. Owen. (1980). Tables for Normal Tolerance Limits, Sampling Plans, and Screening. Marcel Dekker, New York.
Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Wald, A., and J. Wolfowitz. (1946). Tolerance Limits for a Normal Distribution. Annals of Mathematical Statistics 17, 208-215.
gpqTolIntNormSinglyCensored
, eqnormCensored
,
enormCensored
, estimateCensored.object
.
# Generate 20 observations from a normal distribution with parameters # mean=10 and sd=3, censor the observations less than 9, # then create a one-sided upper tolerance interval with 90% # coverage and 95% confidence based on these Type I left, singly # censored data. # (Note: the call to set.seed allows you to reproduce this example. set.seed(250) dat <- sort(rnorm(20, mean = 10, sd = 3)) dat # [1] 6.406313 7.126621 8.119660 8.277216 8.426941 8.847961 # [7] 8.899098 9.357509 9.525756 9.534858 9.558567 9.847663 #[13] 10.001989 10.014964 10.841384 11.386264 11.721850 12.524300 #[19] 12.602469 12.813429 censored <- dat < 9 dat[censored] <- 9 tolIntNormCensored(dat, censored, coverage = 0.9, ti.type="upper") #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 9 # #Estimated Parameter(s): mean = 9.700962 # sd = 1.845067 # #Estimation Method: MLE # #Data: dat # #Censoring Variable: censored # #Sample Size: 20 # #Percent Censored: 35% # #Assumed Sample Size: 20 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact for # Complete Data # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = -Inf # UTL = 13.25454 ## Not run: # Note: The true 90'th percentile is 13.84465 #--------------------------------------------- qnorm(0.9, mean = 10, sd = 3) # [1] 13.84465 # Compare the result using the method "gpq" tolIntNormCensored(dat, censored, coverage = 0.9, ti.type="upper", ti.method = "gpq", seed = 432)$interval$limits # LTL UTL # -Inf 13.56826 # Clean Up #--------- rm(dat, censored) #========== # Example 15-1 of USEPA (2009, p. 15-10) shows how to estimate # the mean and standard deviation using log-transformed multiply # left-censored manganese concentration data. Here we'll construct a # 95 EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored # 1 1 Well.1 <5 5.0 TRUE # 2 2 Well.1 12.1 12.1 FALSE # 3 3 Well.1 16.9 16.9 FALSE # ... # 23 3 Well.5 3.3 3.3 FALSE # 24 4 Well.5 8.4 8.4 FALSE # 25 5 Well.5 <2 2.0 TRUE with(EPA.09.Ex.15.1.manganese.df, tolIntNormCensored(log(Manganese.ppb), Censored, coverage = 0.9, ti.type = "upper")) # Results of Distribution Parameter Estimation # Based on Type I Censored Data # -------------------------------------------- # # Assumed Distribution: Normal # # Censoring Side: left # # Censoring Level(s): 0.6931472 1.6094379 # # Estimated Parameter(s): mean = 2.215905 # sd = 1.356291 # # Estimation Method: MLE # # Data: log(Manganese.ppb) # # Censoring Variable: censored # # Sample Size: 25 # # Percent Censored: 24 # # Assumed Sample Size: 25 # # Tolerance Interval Coverage: 90 # # Coverage Type: content # # Tolerance Interval Method: Exact for # Complete Data # # Tolerance Interval Type: upper # # Confidence Level: 95 # # Tolerance Interval: LTL = -Inf # UTL = 4.708904 ## End(Not run)
# Generate 20 observations from a normal distribution with parameters # mean=10 and sd=3, censor the observations less than 9, # then create a one-sided upper tolerance interval with 90% # coverage and 95% confidence based on these Type I left, singly # censored data. # (Note: the call to set.seed allows you to reproduce this example. set.seed(250) dat <- sort(rnorm(20, mean = 10, sd = 3)) dat # [1] 6.406313 7.126621 8.119660 8.277216 8.426941 8.847961 # [7] 8.899098 9.357509 9.525756 9.534858 9.558567 9.847663 #[13] 10.001989 10.014964 10.841384 11.386264 11.721850 12.524300 #[19] 12.602469 12.813429 censored <- dat < 9 dat[censored] <- 9 tolIntNormCensored(dat, censored, coverage = 0.9, ti.type="upper") #Results of Distribution Parameter Estimation #Based on Type I Censored Data #-------------------------------------------- # #Assumed Distribution: Normal # #Censoring Side: left # #Censoring Level(s): 9 # #Estimated Parameter(s): mean = 9.700962 # sd = 1.845067 # #Estimation Method: MLE # #Data: dat # #Censoring Variable: censored # #Sample Size: 20 # #Percent Censored: 35% # #Assumed Sample Size: 20 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact for # Complete Data # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Interval: LTL = -Inf # UTL = 13.25454 ## Not run: # Note: The true 90'th percentile is 13.84465 #--------------------------------------------- qnorm(0.9, mean = 10, sd = 3) # [1] 13.84465 # Compare the result using the method "gpq" tolIntNormCensored(dat, censored, coverage = 0.9, ti.type="upper", ti.method = "gpq", seed = 432)$interval$limits # LTL UTL # -Inf 13.56826 # Clean Up #--------- rm(dat, censored) #========== # Example 15-1 of USEPA (2009, p. 15-10) shows how to estimate # the mean and standard deviation using log-transformed multiply # left-censored manganese concentration data. Here we'll construct a # 95 EPA.09.Ex.15.1.manganese.df # Sample Well Manganese.Orig.ppb Manganese.ppb Censored # 1 1 Well.1 <5 5.0 TRUE # 2 2 Well.1 12.1 12.1 FALSE # 3 3 Well.1 16.9 16.9 FALSE # ... # 23 3 Well.5 3.3 3.3 FALSE # 24 4 Well.5 8.4 8.4 FALSE # 25 5 Well.5 <2 2.0 TRUE with(EPA.09.Ex.15.1.manganese.df, tolIntNormCensored(log(Manganese.ppb), Censored, coverage = 0.9, ti.type = "upper")) # Results of Distribution Parameter Estimation # Based on Type I Censored Data # -------------------------------------------- # # Assumed Distribution: Normal # # Censoring Side: left # # Censoring Level(s): 0.6931472 1.6094379 # # Estimated Parameter(s): mean = 2.215905 # sd = 1.356291 # # Estimation Method: MLE # # Data: log(Manganese.ppb) # # Censoring Variable: censored # # Sample Size: 25 # # Percent Censored: 24 # # Assumed Sample Size: 25 # # Tolerance Interval Coverage: 90 # # Coverage Type: content # # Tolerance Interval Method: Exact for # Complete Data # # Tolerance Interval Type: upper # # Confidence Level: 95 # # Tolerance Interval: LTL = -Inf # UTL = 4.708904 ## End(Not run)
Compute the half-width of a tolerance interval for a normal distribution.
tolIntNormHalfWidth(n, sigma.hat = 1, coverage = 0.95, cov.type = "content", conf.level = 0.95, method = "wald.wolfowitz")
tolIntNormHalfWidth(n, sigma.hat = 1, coverage = 0.95, cov.type = "content", conf.level = 0.95, method = "wald.wolfowitz")
n |
numeric vector of positive integers greater than 1 indicating the sample size upon
which the prediction interval is based.
Missing ( |
sigma.hat |
numeric vector specifying the value(s) of the estimated standard deviation(s).
The default value is |
coverage |
numeric vector of values between 0 and 1 indicating the desired coverage of the
tolerance interval. The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval. The
possible values are |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
method |
character string specifying the method for constructing the tolerance interval.
The possible values are |
If the arguments n
, sigma.hat
, coverage
, and
conf.level
are not all the same length, they are replicated to be the
same length as the length of the longest argument.
The help files for tolIntNorm
and tolIntNormK
give formulas for a two-sided tolerance interval based on the sample size, the
observed sample mean and sample standard deviation, and specified confidence level
and coverage. Specifically, the two-sided tolerance interval is given by:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, and the coverage (see the help file for
tolIntNormK
). Thus, the half-width of the tolerance interval is
given by:
numeric vector of half-widths.
See the help file for tolIntNorm
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, confidence level, and half-width
if one of the objectives of the sampling program is to produce tolerance intervals.
The functions tolIntNormHalfWidth
, tolIntNormN
, and
plotTolIntNormDesign
can be used to investigate these relationships
for the case of normally-distributed observations.
Steven P. Millard ([email protected])
See the help file for tolIntNorm
.
tolIntNorm
, tolIntNormK
,
tolIntNormN
, plotTolIntNormDesign
,
Normal
.
# Look at how the half-width of a tolerance interval increases with # increasing coverage: seq(0.5, 0.9, by=0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(tolIntNormHalfWidth(n = 10, coverage = seq(0.5, 0.9, by = 0.1)), 2) #[1] 1.17 1.45 1.79 2.21 2.84 #---------- # Look at how the half-width of a tolerance interval decreases with # increasing sample size: 2:5 #[1] 2 3 4 5 round(tolIntNormHalfWidth(n = 2:5), 2) #[1] 37.67 9.92 6.37 5.08 #---------- # Look at how the half-width of a tolerance interval increases with # increasing estimated standard deviation for a fixed sample size: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 round(tolIntNormHalfWidth(n = 10, sigma.hat = seq(0.5, 2, by = 0.5)), 2) #[1] 1.69 3.38 5.07 6.76 #---------- # Look at how the half-width of a tolerance interval increases with # increasing confidence level for a fixed sample size: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(tolIntNormHalfWidth(n = 5, conf = seq(0.5, 0.9, by = 0.1)), 2) #[1] 2.34 2.58 2.89 3.33 4.15 #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. # Here we will first take the log of the data and then estimate the # standard deviation based on the two background wells. We will use this # estimate of standard deviation to compute the half-widths of # future tolerance intervals on the log-scale for various sample sizes. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 summary.stats <- summaryStats(log(Chrysene.ppb) ~ Well.type, data = EPA.09.Ex.17.3.chrysene.df) summary.stats # N Mean SD Median Min Max #Background 8 2.5086 0.6279 2.4359 1.7405 3.6687 #Compliance 12 3.4173 0.4361 3.4111 2.7081 4.2195 sigma.hat <- summary.stats["Background", "SD"] sigma.hat #[1] 0.6279 tolIntNormHalfWidth(n = c(4, 8, 16), sigma.hat = sigma.hat) #[1] 3.999681 2.343160 1.822759 #========== # Clean up #--------- rm(summary.stats, sigma.hat)
# Look at how the half-width of a tolerance interval increases with # increasing coverage: seq(0.5, 0.9, by=0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(tolIntNormHalfWidth(n = 10, coverage = seq(0.5, 0.9, by = 0.1)), 2) #[1] 1.17 1.45 1.79 2.21 2.84 #---------- # Look at how the half-width of a tolerance interval decreases with # increasing sample size: 2:5 #[1] 2 3 4 5 round(tolIntNormHalfWidth(n = 2:5), 2) #[1] 37.67 9.92 6.37 5.08 #---------- # Look at how the half-width of a tolerance interval increases with # increasing estimated standard deviation for a fixed sample size: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 round(tolIntNormHalfWidth(n = 10, sigma.hat = seq(0.5, 2, by = 0.5)), 2) #[1] 1.69 3.38 5.07 6.76 #---------- # Look at how the half-width of a tolerance interval increases with # increasing confidence level for a fixed sample size: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(tolIntNormHalfWidth(n = 5, conf = seq(0.5, 0.9, by = 0.1)), 2) #[1] 2.34 2.58 2.89 3.33 4.15 #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. # Here we will first take the log of the data and then estimate the # standard deviation based on the two background wells. We will use this # estimate of standard deviation to compute the half-widths of # future tolerance intervals on the log-scale for various sample sizes. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 summary.stats <- summaryStats(log(Chrysene.ppb) ~ Well.type, data = EPA.09.Ex.17.3.chrysene.df) summary.stats # N Mean SD Median Min Max #Background 8 2.5086 0.6279 2.4359 1.7405 3.6687 #Compliance 12 3.4173 0.4361 3.4111 2.7081 4.2195 sigma.hat <- summary.stats["Background", "SD"] sigma.hat #[1] 0.6279 tolIntNormHalfWidth(n = c(4, 8, 16), sigma.hat = sigma.hat) #[1] 3.999681 2.343160 1.822759 #========== # Clean up #--------- rm(summary.stats, sigma.hat)
for a Tolerance Interval for a Normal Distribution
Compute the value of (the multiplier of estimated standard deviation) used
to construct a tolerance interval based on data from a normal distribution.
tolIntNormK(n, df = n - 1, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact", rel.tol = 1e-07, abs.tol = rel.tol)
tolIntNormK(n, df = n - 1, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95, method = "exact", rel.tol = 1e-07, abs.tol = rel.tol)
n |
a positive integer greater than 2 indicating the sample size upon which the tolerance interval is based. |
df |
the degrees of freedom associated with the tolerance interval. The default is
|
coverage |
a scalar between 0 and 1 indicating the desired coverage of the tolerance interval.
The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the tolerance
interval. The default value is |
method |
for the case of a two-sided tolerance interval, a character string specifying the method for
constructing the tolerance interval. This argument is ignored if |
rel.tol |
in the case when |
abs.tol |
in the case when |
A tolerance interval for some population is an interval on the real line constructed so as to
contain of the population (i.e.,
of all future observations),
where
. The quantity
is called the coverage.
There are two kinds of tolerance intervals (Guttman, 1970):
A -content tolerance interval with confidence level
is
constructed so that it contains at least
of the population (i.e., the
coverage is at least
) with probability
, where
. The quantity
is called the confidence level or
confidence coefficient associated with the tolerance interval.
A -expectation tolerance interval is constructed so that the
average coverage of the interval is
.
Note: A -expectation tolerance interval with coverage
is
equivalent to a prediction interval for one future observation with associated confidence level
. Note that there is no explicit confidence level associated with a
-expectation tolerance interval. If a
-expectation tolerance interval is
treated as a
-content tolerance interval, the confidence level associated with this
tolerance interval is usually around 50% (e.g., Guttman, 1970, Table 4.2, p.76).
For a normal distribution, the form of a two-sided tolerance
interval is:
where denotes the sample
mean,
denotes the sample standard deviation, and
denotes a constant
that depends on the sample size
, the coverage, and, for a
-content
tolerance interval (but not a
-expectation tolerance interval),
the confidence level.
Similarly, the form of a one-sided lower tolerance interval is:
and the form of a one-sided upper tolerance interval is:
but differs for one-sided versus two-sided tolerance intervals.
The Derivation of for a
-Content Tolerance Interval
One-Sided Case
When ti.type="upper"
or ti.type="lower"
, the constant for a
-content tolerance interval with associated
confidence level
is given by:
where denotes the
'th quantile of a non-central
t-distribution with
degrees of freedom and noncentrality parameter
(see the help file for TDist), and
denotes the
'th quantile of a standard normal distribution.
Two-Sided Case
When ti.type="two-sided"
and method="exact"
, the exact formula for
the constant for a
-content tolerance interval
with associated confidence level
requires numerical integration
and has been derived by several different authors, including Odeh (1978),
Eberhardt et al. (1989), Jilek (1988), Fujino (1989), and Janiga and Miklos (2001).
Specifically, for given values of the sample size
, degrees of freedom
,
confidence level
, and coverage
, the constant
is the
solution to the equation:
where denotes the upper-tail area from
to
of the chi-squared distribution with
degrees of freedom, and
is the solution to the equation:
where
denotes the standard normal cumulative distribuiton function.
When ti.type="two-sided"
and method="wald.wolfowitz"
, the approximate formula
due to Wald and Wolfowitz (1946) for the constant for a
-content tolerance interval with associated confidence level
is given by:
where is the solution to the equation:
denotes the standard normal cumulative distribuiton function, and
is
given by:
where denotes the
'th quantile of the chi-squared
distribution with
degrees of freedom.
The Derivation of for a
-Expectation Tolerance Interval
As stated above, a -expectation tolerance interval with coverage
is
equivalent to a prediction interval for one future observation with associated confidence level
. This is because the probability that any single future observation will fall
into this interval is
, so the distribution of the number of
future
observations that will fall into this interval is binomial with parameters
size =
and
prob =
(see the help file for Binomial). Hence the expected proportion
of future observations that will fall into this interval is
and is independent of
the value of
. See the help file for
predIntNormK
for information on
how to derive for these intervals.
The value of , a numeric scalar used to construct tolerance intervals for a normal
(Gaussian) distribution.
Tabled values of are given in Gibbons et al. (2009), Gilbert (1987),
Guttman (1970), Krishnamoorthy and Mathew (2009), Owen (1962), Odeh and Owen (1980),
and USEPA (2009).
Tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Meeker, 1991; Krishnamoorthy and Mathew, 2009). References that discuss tolerance intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York.
Eberhardt, K.R., R.W. Mee, and C.P. Reeve. (1989). Computing Factors for Exact Two-Sided Tolerance Limits for a Normal Distribution. Communications in Statistics, Part B-Simulation and Computation 18, 397-413.
Ellison, B.E. (1964). On Two-Sided Tolerance Intervals for a Normal Distribution. Annals of Mathematical Statistics 35, 762-772.
Fujino, T. (1989). Exact Two-Sided Tolerance Limits for a Normal Distribution. Japanese Journal of Applied Statistics 18, 29-36.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.
Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.
Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Jilek, M. (1988). Statisticke Tolerancni Meze. SNTL, Praha.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Janiga, I., and R. Miklos. (2001). Statistical Tolerance Intervals for a Normal Distribution. Measurement Science Review 11, 29-32.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Odeh, R.E. (1978). Tables of Two-Sided Tolerance Factors for a Normal Distribution. Communications in Statistics, Part B-Simulation and Computation 7, 183-201.
Odeh, R.E., and D.B. Owen. (1980). Tables for Normal Tolerance Limits, Sampling Plans, and Screening. Marcel Dekker, New York.
Owen, D.B. (1962). Handbook of Statistical Tables. Addison-Wesley, Reading, MA.
Singh, A., R. Maichle, and N. Armbya. (2010a). ProUCL Version 4.1.00 User Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
Singh, A., N. Armbya, and A. Singh. (2010b). ProUCL Version 4.1.00 Technical Guide (Draft). EPA/600/R-07/041, May 2010. Office of Research and Development, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Wald, A., and J. Wolfowitz. (1946). Tolerance Limits for a Normal Distribution. Annals of Mathematical Statistics 17, 208-215.
tolIntNorm
, predIntNorm
, Normal,
estimate.object
, enorm
, eqnorm
,
Tolerance Intervals, Prediction Intervals,
Estimating Distribution Parameters,
Estimating Distribution Quantiles.
# Compute the value of K for a two-sided 95% beta-content # tolerance interval with associated confidence level 95% # given a sample size of n=20. #---------- # Exact method tolIntNormK(n = 20) #[1] 2.760346 #---------- # Approximate method due to Wald and Wolfowitz (1946) tolIntNormK(n = 20, method = "wald") # [1] 2.751789 #-------------------------------------------------------------------- # Compute the value of K for a one-sided upper tolerance limit # with 99% coverage and associated confidence level 90% # given a samle size of n=20. tolIntNormK(n = 20, ti.type = "upper", coverage = 0.99, conf.level = 0.9) #[1] 3.051543 #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal # distribution. The sample size is n = 8 observations from # the two compliance wells. Here we will compute the # multiplier for the log-transformed data. tolIntNormK(n = 8, ti.type = "upper") #[1] 3.187294
# Compute the value of K for a two-sided 95% beta-content # tolerance interval with associated confidence level 95% # given a sample size of n=20. #---------- # Exact method tolIntNormK(n = 20) #[1] 2.760346 #---------- # Approximate method due to Wald and Wolfowitz (1946) tolIntNormK(n = 20, method = "wald") # [1] 2.751789 #-------------------------------------------------------------------- # Compute the value of K for a one-sided upper tolerance limit # with 99% coverage and associated confidence level 90% # given a samle size of n=20. tolIntNormK(n = 20, ti.type = "upper", coverage = 0.99, conf.level = 0.9) #[1] 3.051543 #-------------------------------------------------------------------- # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal # distribution. The sample size is n = 8 observations from # the two compliance wells. Here we will compute the # multiplier for the log-transformed data. tolIntNormK(n = 8, ti.type = "upper") #[1] 3.187294
Compute the sample size necessary to achieve a specified half-width of a tolerance interval for a normal distribution, given the estimated standard deviation, coverage, and confidence level.
tolIntNormN(half.width, sigma.hat = 1, coverage = 0.95, cov.type = "content", conf.level = 0.95, method = "wald.wolfowitz", round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
tolIntNormN(half.width, sigma.hat = 1, coverage = 0.95, cov.type = "content", conf.level = 0.95, method = "wald.wolfowitz", round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
half.width |
numeric vector of (positive) half-widths.
Missing ( |
sigma.hat |
numeric vector specifying the value(s) of the estimated standard deviation(s).
The default value is |
coverage |
numeric vector of values between 0 and 1 indicating the desired coverage of the
tolerance interval. The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval. The
possible values are |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level of the
prediction interval. The default value is |
method |
character string specifying the method for constructing the tolerance interval.
The possible values are |
round.up |
logical scalar indicating whether to round up the values of the computed sample
size(s) to the next smallest integer. The default value is |
n.max |
positive integer greater than 1 specifying the maximum possible sample size.
The default value is |
tol |
numeric scalar indicating the tolerance to use in the |
maxiter |
positive integer indicating the maximum number of iterations to use in the
|
If the arguments half.width
, sigma.hat
, coverage
, and
conf.level
are not all the same length, they are replicated to be the same
length as the length of the longest argument.
The help files for tolIntNorm
and tolIntNormK
give formulas for a two-sided tolerance interval based on the sample size, the
observed sample mean and sample standard deviation, and specified confidence level
and coverage. Specifically, the two-sided tolerance interval is given by:
where denotes the sample mean:
denotes the sample standard deviation:
and denotes a constant that depends on the sample size
, the
confidence level, and the coverage (see the help file for
tolIntNormK
). Thus, the half-width of the tolerance interval is
given by:
The function tolIntNormN
uses the uniroot
search algorithm to
determine the sample size for specified values of the half-width, sample
standard deviation, coverage, and confidence level. Note that unlike a
confidence interval, the half-width of a tolerance interval does not
approach 0 as the sample size increases.
numeric vector of sample sizes.
See the help file for tolIntNorm
.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, confidence level, and half-width
if one of the objectives of the sampling program is to produce tolerance intervals.
The functions tolIntNormHalfWidth
, tolIntNormN
, and
plotTolIntNormDesign
can be used to investigate these
relationships for the case of normally-distributed observations.
Steven P. Millard ([email protected])
See the help file for tolIntNorm
.
tolIntNorm
, tolIntNormK
,
tolIntNormHalfWidth
, plotTolIntNormDesign
,
Normal
.
# Look at how the required sample size for a tolerance interval increases # with increasing coverage: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 tolIntNormN(half.width = 3, coverage = seq(0.5, 0.9, by = 0.1)) #[1] 4 4 5 6 9 #---------- # Look at how the required sample size for a tolerance interval decreases # with increasing half-width: 3:6 #[1] 3 4 5 6 tolIntNormN(half.width = 3:6) #[1] 15 8 6 5 tolIntNormN(3:6, round = FALSE) #[1] 14.199735 7.022572 5.092374 4.214371 #---------- # Look at how the required sample size for a tolerance interval increases # with increasing estimated standard deviation for a fixed half-width: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 tolIntNormN(half.width = 4, sigma.hat = seq(0.5, 2, by = 0.5)) #[1] 4 8 24 3437 #---------- # Look at how the required sample size for a tolerance interval increases # with increasing confidence level for a fixed half-width: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 tolIntNormN(half.width = 3, conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 3 4 5 7 11 #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. # Here we will first take the log of the data and then estimate the # standard deviation based on the two background wells. We will use this # estimate of standard deviation to compute required sample sizes for # various half-widths on the log-scale. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 summary.stats <- summaryStats(log(Chrysene.ppb) ~ Well.type, data = EPA.09.Ex.17.3.chrysene.df) summary.stats # N Mean SD Median Min Max #Background 8 2.5086 0.6279 2.4359 1.7405 3.6687 #Compliance 12 3.4173 0.4361 3.4111 2.7081 4.2195 sigma.hat <- summary.stats["Background", "SD"] sigma.hat #[1] 0.6279 tolIntNormN(half.width = c(4, 2, 1), sigma.hat = sigma.hat) #[1] 4 12 NA #Warning message: #In tolIntNormN(half.width = c(4, 2, 1), sigma.hat = sigma.hat) : # Value of 'half.width' is too smallfor element3. # Try increasing the value of 'n.max'. # NOTE: We cannot achieve a half-width of 1 for the given value of # sigma.hat for a tolerance interval with 95% coverage and # 95% confidence. The default value of n.max is 5000, but in fact, # even with a million observations the half width is greater than 1. tolIntNormHalfWidth(n = 1e6, sigma.hat = sigma.hat) #[1] 1.232095 #========== # Clean up #--------- rm(summary.stats, sigma.hat)
# Look at how the required sample size for a tolerance interval increases # with increasing coverage: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 tolIntNormN(half.width = 3, coverage = seq(0.5, 0.9, by = 0.1)) #[1] 4 4 5 6 9 #---------- # Look at how the required sample size for a tolerance interval decreases # with increasing half-width: 3:6 #[1] 3 4 5 6 tolIntNormN(half.width = 3:6) #[1] 15 8 6 5 tolIntNormN(3:6, round = FALSE) #[1] 14.199735 7.022572 5.092374 4.214371 #---------- # Look at how the required sample size for a tolerance interval increases # with increasing estimated standard deviation for a fixed half-width: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 tolIntNormN(half.width = 4, sigma.hat = seq(0.5, 2, by = 0.5)) #[1] 4 8 24 3437 #---------- # Look at how the required sample size for a tolerance interval increases # with increasing confidence level for a fixed half-width: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 tolIntNormN(half.width = 3, conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 3 4 5 7 11 #========== # Example 17-3 of USEPA (2009, p. 17-17) shows how to construct a # beta-content upper tolerance limit with 95% coverage and 95% # confidence using chrysene data and assuming a lognormal distribution. # The data for this example are stored in EPA.09.Ex.17.3.chrysene.df, # which contains chrysene concentration data (ppb) found in water # samples obtained from two background wells (Wells 1 and 2) and # three compliance wells (Wells 3, 4, and 5). The tolerance limit # is based on the data from the background wells. # Here we will first take the log of the data and then estimate the # standard deviation based on the two background wells. We will use this # estimate of standard deviation to compute required sample sizes for # various half-widths on the log-scale. head(EPA.09.Ex.17.3.chrysene.df) # Month Well Well.type Chrysene.ppb #1 1 Well.1 Background 19.7 #2 2 Well.1 Background 39.2 #3 3 Well.1 Background 7.8 #4 4 Well.1 Background 12.8 #5 1 Well.2 Background 10.2 #6 2 Well.2 Background 7.2 longToWide(EPA.09.Ex.17.3.chrysene.df, "Chrysene.ppb", "Month", "Well") # Well.1 Well.2 Well.3 Well.4 Well.5 #1 19.7 10.2 68.0 26.8 47.0 #2 39.2 7.2 48.9 17.7 30.5 #3 7.8 16.1 30.1 31.9 15.0 #4 12.8 5.7 38.1 22.2 23.4 summary.stats <- summaryStats(log(Chrysene.ppb) ~ Well.type, data = EPA.09.Ex.17.3.chrysene.df) summary.stats # N Mean SD Median Min Max #Background 8 2.5086 0.6279 2.4359 1.7405 3.6687 #Compliance 12 3.4173 0.4361 3.4111 2.7081 4.2195 sigma.hat <- summary.stats["Background", "SD"] sigma.hat #[1] 0.6279 tolIntNormN(half.width = c(4, 2, 1), sigma.hat = sigma.hat) #[1] 4 12 NA #Warning message: #In tolIntNormN(half.width = c(4, 2, 1), sigma.hat = sigma.hat) : # Value of 'half.width' is too smallfor element3. # Try increasing the value of 'n.max'. # NOTE: We cannot achieve a half-width of 1 for the given value of # sigma.hat for a tolerance interval with 95% coverage and # 95% confidence. The default value of n.max is 5000, but in fact, # even with a million observations the half width is greater than 1. tolIntNormHalfWidth(n = 1e6, sigma.hat = sigma.hat) #[1] 1.232095 #========== # Clean up #--------- rm(summary.stats, sigma.hat)
Construct a -content or
-expectation tolerance interval
nonparametrically without making any assumptions about the form of the
distribution except that it is continuous.
tolIntNpar(x, coverage, conf.level, cov.type = "content", ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), lb = -Inf, ub = Inf, ti.type = "two-sided")
tolIntNpar(x, coverage, conf.level, cov.type = "content", ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), lb = -Inf, ub = Inf, ti.type = "two-sided")
x |
numeric vector of observations. Missing ( |
coverage |
a scalar between 0 and 1 indicating the desired coverage of the |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ltl.rank |
positive integer indicating the rank of the order statistic to use for the lower bound
of the tolerance interval. If |
n.plus.one.minus.utl.rank |
positive integer related to the rank of the order statistic to use for
the upper bound of the toleracne interval. A value of
|
lb , ub
|
scalars indicating lower and upper bounds on the distribution. By default, |
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
A tolerance interval for some population is an interval on the real line constructed so as to
contain of the population (i.e.,
of all
future observations), where
. The quantity
is called
the coverage.
There are two kinds of tolerance intervals (Guttman, 1970):
A -content tolerance interval with confidence level
is
constructed so that it contains at least
of the population (i.e., the
coverage is at least
) with probability
, where
. The quantity
is called the confidence level or
confidence coefficient associated with the tolerance interval.
A -expectation tolerance interval is constructed so that the average coverage of
the interval is
.
Note: A -expectation tolerance interval with coverage
is
equivalent to a prediction interval for one future observation with associated confidence level
. Note that there is no explicit confidence level associated with a
-expectation tolerance interval. If a
-expectation tolerance interval is
treated as a
-content tolerance interval, the confidence level associated with this
tolerance interval is usually around 50% (e.g., Guttman, 1970, Table 4.2, p.76).
The Form of a Nonparametric Tolerance Interval
Let denote a random sample of
independent observations
from some continuous distribution and let
denote the
'th order
statistic in
. A two-sided nonparametric tolerance interval is
constructed as:
where and
are positive integers between
and
, and
. That is,
denotes the rank of the lower tolerance limit, and
denotes the rank of the upper tolerance limit. To make it easier to write
some equations later on, we can also write the tolerance interval (1) in a slightly
different way as:
where
so that is a positive integer between
and
, and
.
In terms of the arguments to the function
tolIntNpar
, the argument
ltl.rank
corresponds to , and the argument
n.plus.one.minus.utl.rank
corresponds to .
If we allow and
and define lower and upper bounds as:
then equation (2) above can also represent a one-sided lower or one-sided upper tolerance interval as well. That is, a one-sided lower nonparametric tolerance interval is constructed as:
and a one-sided upper nonparametric tolerance interval is constructed as:
Usually, or
and
.
Let be a random variable denoting the coverage of the above nonparametric
tolerance intervals. Wilks (1941) showed that the distribution of
follows a
beta distribution with parameters
shape1=
and
shape2=
when the unknown distribution is continuous.
Computations for a -Content Tolerance Interval
For a -content tolerance interval, if the coverage
is specified,
then the associated confidence level
is computed as:
where denotes the cumulative distribution function of a
beta random variable with parameters
shape1=
and
shape2=
evaluated at
.
Similarly, if the confidence level associated with the tolerance interval is specified as
, then the coverage
is computed as:
where denotes the
'th quantile of a
beta distribution with parameters
shape1=
and
shape2=
.
Computations for a -Expectation Tolerance Interval
For a -expectation tolerance interval, the expected coverage is simply
the mean of a beta random variable with parameters
shape1=
and
shape2=
, which is given by:
As stated above, a -expectation tolerance interval with coverage
is equivalent to a prediction interval for one future observation
with associated confidence level
. This is because the probability
that any single future observation will fall into this interval is
,
so the distribution of the number of
future observations that will fall into
this interval is binomial with parameters
size=
and
prob=
. Hence the expected proportion of future observations
that fall into this interval is
and is independent of the value of
.
See the help file for
predIntNpar
for more information on constructing
a nonparametric prediction interval.
A list of class "estimate"
containing the estimated parameters,
the tolerance interval, and other information. See estimate.object
for details.
Tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Meeker, 1991; Krishnamoorthy and Mathew, 2009). References that discuss tolerance intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Steven P. Millard ([email protected])
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Danziger, L., and S. Davis. (1964). Tables of Distribution-Free Tolerance Limits. Annals of Mathematical Statistics 35(5), 1361–1365.
Davis, C.B. (1994). Environmental Regulatory Statistics. In Patil, G.P., and C.R. Rao, eds., Handbook of Statistics, Vol. 12: Environmental Statistics. North-Holland, Amsterdam, a division of Elsevier, New York, NY, Chapter 26, 817–865.
Davis, C.B., and R.J. McNichols. (1994a). Ground Water Monitoring Statistics Update: Part I: Progress Since 1988. Ground Water Monitoring and Remediation 14(4), 148–158.
Gibbons, R.D. (1991b). Statistical Tolerance Limits for Ground-Water Monitoring. Ground Water 29, 563–570.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT, Chapter 2.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York, 392pp.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, pp.88-90.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Wilks, S.S. (1941). Determination of Sample Sizes for Setting Tolerance Limits. Annals of Mathematical Statistics 12, 91–96.
eqnpar
, estimate.object
,
tolIntNparN
, Tolerance Intervals,
Estimating Distribution Parameters, Estimating Distribution Quantiles.
# Generate 20 observations from a lognormal mixture distribution # with parameters mean1=1, cv1=0.5, mean2=5, cv2=1, and p.mix=0.1. # The exact two-sided interval that contains 90% of this distribution is given by: # [0.682312, 13.32052]. Use tolIntNpar to construct a two-sided 90% # \eqn{\beta}-content tolerance interval. Note that the associated confidence level # is only 61%. A larger sample size is required to obtain a larger confidence # level (see the help file for tolIntNparN). # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(23) dat <- rlnormMixAlt(20, 1, 0.5, 5, 1, 0.1) tolIntNpar(dat, coverage = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: two-sided # #Confidence Level: 60.8253% # #Tolerance Limit Rank(s): 1 20 # #Tolerance Interval: LTL = 0.5035035 # UTL = 9.9504662 #---------- # Clean up rm(dat) #---------- # Reproduce Example 17-4 on page 17-21 of USEPA (2009). This example uses # copper concentrations (ppb) from 3 background wells to set an upper # limit for 2 compliance wells. The maximum value from the 3 wells is set # to the 95% confidence upper tolerance limit, and we need to determine the # coverage of this tolerance interval. The data are stored in EPA.92c.copper2.df. # Note that even though these data are Type I left singly censored, it is still # possible to compute an upper tolerance interval using any of the uncensored # observations as the upper limit. EPA.92c.copper2.df # Copper.orig Copper Censored Month Well Well.type #1 <5 5.0 TRUE 1 1 Background #2 <5 5.0 TRUE 2 1 Background #3 7.5 7.5 FALSE 3 1 Background #... #9 9.2 9.2 FALSE 1 2 Background #10 <5 5.0 TRUE 2 2 Background #11 <5 5.0 TRUE 3 2 Background #... #17 <5 5.0 TRUE 1 3 Background #18 5.4 5.4 FALSE 2 3 Background #19 6.7 6.7 FALSE 3 3 Background #... #29 6.2 6.2 FALSE 5 4 Compliance #30 <5 5.0 TRUE 6 4 Compliance #31 7.8 7.8 FALSE 7 4 Compliance #... #38 <5 5.0 TRUE 6 5 Compliance #39 5.6 5.6 FALSE 7 5 Compliance #40 <5 5.0 TRUE 8 5 Compliance with(EPA.92c.copper2.df, tolIntNpar(Copper[Well.type=="Background"], conf.level = 0.95, lb = 0, ti.type = "upper")) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Copper[Well.type == "Background"] # #Sample Size: 24 # #Tolerance Interval Coverage: 88.26538% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Limit Rank(s): 24 # #Tolerance Interval: LTL = 0.0 # UTL = 9.2 #---------- # Repeat the last example, except compute an upper # \eqn{\beta}-expectation tolerance interval: with(EPA.92c.copper2.df, tolIntNpar(Copper[Well.type=="Background"], cov.type = "expectation", lb = 0, ti.type = "upper")) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Copper[Well.type == "Background"] # #Sample Size: 24 # #Tolerance Interval Coverage: 96% # #Coverage Type: expectation # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Tolerance Limit Rank(s): 24 # #Tolerance Interval: LTL = 0.0 # UTL = 9.2
# Generate 20 observations from a lognormal mixture distribution # with parameters mean1=1, cv1=0.5, mean2=5, cv2=1, and p.mix=0.1. # The exact two-sided interval that contains 90% of this distribution is given by: # [0.682312, 13.32052]. Use tolIntNpar to construct a two-sided 90% # \eqn{\beta}-content tolerance interval. Note that the associated confidence level # is only 61%. A larger sample size is required to obtain a larger confidence # level (see the help file for tolIntNparN). # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(23) dat <- rlnormMixAlt(20, 1, 0.5, 5, 1, 0.1) tolIntNpar(dat, coverage = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 90% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: two-sided # #Confidence Level: 60.8253% # #Tolerance Limit Rank(s): 1 20 # #Tolerance Interval: LTL = 0.5035035 # UTL = 9.9504662 #---------- # Clean up rm(dat) #---------- # Reproduce Example 17-4 on page 17-21 of USEPA (2009). This example uses # copper concentrations (ppb) from 3 background wells to set an upper # limit for 2 compliance wells. The maximum value from the 3 wells is set # to the 95% confidence upper tolerance limit, and we need to determine the # coverage of this tolerance interval. The data are stored in EPA.92c.copper2.df. # Note that even though these data are Type I left singly censored, it is still # possible to compute an upper tolerance interval using any of the uncensored # observations as the upper limit. EPA.92c.copper2.df # Copper.orig Copper Censored Month Well Well.type #1 <5 5.0 TRUE 1 1 Background #2 <5 5.0 TRUE 2 1 Background #3 7.5 7.5 FALSE 3 1 Background #... #9 9.2 9.2 FALSE 1 2 Background #10 <5 5.0 TRUE 2 2 Background #11 <5 5.0 TRUE 3 2 Background #... #17 <5 5.0 TRUE 1 3 Background #18 5.4 5.4 FALSE 2 3 Background #19 6.7 6.7 FALSE 3 3 Background #... #29 6.2 6.2 FALSE 5 4 Compliance #30 <5 5.0 TRUE 6 4 Compliance #31 7.8 7.8 FALSE 7 4 Compliance #... #38 <5 5.0 TRUE 6 5 Compliance #39 5.6 5.6 FALSE 7 5 Compliance #40 <5 5.0 TRUE 8 5 Compliance with(EPA.92c.copper2.df, tolIntNpar(Copper[Well.type=="Background"], conf.level = 0.95, lb = 0, ti.type = "upper")) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Copper[Well.type == "Background"] # #Sample Size: 24 # #Tolerance Interval Coverage: 88.26538% # #Coverage Type: content # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Confidence Level: 95% # #Tolerance Limit Rank(s): 24 # #Tolerance Interval: LTL = 0.0 # UTL = 9.2 #---------- # Repeat the last example, except compute an upper # \eqn{\beta}-expectation tolerance interval: with(EPA.92c.copper2.df, tolIntNpar(Copper[Well.type=="Background"], cov.type = "expectation", lb = 0, ti.type = "upper")) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: None # #Data: Copper[Well.type == "Background"] # #Sample Size: 24 # #Tolerance Interval Coverage: 96% # #Coverage Type: expectation # #Tolerance Interval Method: Exact # #Tolerance Interval Type: upper # #Tolerance Limit Rank(s): 24 # #Tolerance Interval: LTL = 0.0 # UTL = 9.2
Compute the confidence level associated with a nonparametric -content tolerance
interval for a continuous distribution given the sample size, coverage, and ranks of the
order statistics used for the interval.
tolIntNparConfLevel(n, coverage = 0.95, ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), ti.type = "two.sided")
tolIntNparConfLevel(n, coverage = 0.95, ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), ti.type = "two.sided")
n |
vector of positive integers specifying the sample sizes.
Missing ( |
coverage |
numeric vector of values between 0 and 1 indicating the desired coverage of the
|
ltl.rank |
vector of positive integers indicating the rank of the order statistic to use for the lower bound
of the tolerance interval. If |
n.plus.one.minus.utl.rank |
vector of positive integers related to the rank of the order statistic to use for
the upper bound of the tolerance interval. A value of
|
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
If the arguments n
, coverage
, ltl.rank
, and
n.plus.one.minus.utl.rank
are not all the same length, they are replicated to be the
same length as the length of the longest argument.
The help file for tolIntNpar
explains how nonparametric -content
tolerance intervals are constructed and how the confidence level
associated with the tolerance interval is computed based on specified values
for the sample size, the coverage, and the ranks of the order statistics used for
the bounds of the tolerance interval.
vector of values between 0 and 1 indicating the confidence level associated with the specified nonparametric tolerance interval.
See the help file for tolIntNpar
.
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, coverage, and confidence level if one of the objectives of
the sampling program is to produce tolerance intervals. The functions
tolIntNparN
, tolIntNparCoverage
, tolIntNparConfLevel
, and
plotTolIntNparDesign
can be used to investigate these relationships for
constructing nonparametric tolerance intervals.
Steven P. Millard ([email protected])
See the help file for tolIntNpar
.
tolIntNpar
, tolIntNparN
, tolIntNparCoverage
,
plotTolIntNparDesign
.
# Look at how the confidence level of a nonparametric tolerance interval increases with # increasing sample size: seq(10, 60, by=10) #[1] 10 20 30 40 50 60 round(tolIntNparConfLevel(n = seq(10, 60, by = 10)), 2) #[1] 0.09 0.26 0.45 0.60 0.72 0.81 #---------- # Look at how the confidence level of a nonparametric tolerance interval decreases with # increasing coverage: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(tolIntNparConfLevel(n = 10, coverage = seq(0.5, 0.9, by = 0.1)), 2) #[1] 0.99 0.95 0.85 0.62 0.26 #---------- # Look at how the confidence level of a nonparametric tolerance interval decreases with the # rank of the lower tolerance limit: round(tolIntNparConfLevel(n = 60, ltl.rank = 1:5), 2) #[1] 0.81 0.58 0.35 0.18 0.08 #========== # Example 17-4 on page 17-21 of USEPA (2009) uses copper concentrations (ppb) from 3 # background wells to set an upper limit for 2 compliance wells. There are 6 observations # per well, and the maximum value from the 3 wells is set to the 95% confidence upper # tolerance limit, and we need to determine the coverage of this tolerance interval. tolIntNparCoverage(n = 24, conf.level = 0.95, ti.type = "upper") #[1] 0.8826538 # Here we will modify the example and determine the confidence level of the tolerance # interval when we set the coverage to 95%. tolIntNparConfLevel(n = 24, coverage = 0.95, ti.type = "upper") # [1] 0.708011
# Look at how the confidence level of a nonparametric tolerance interval increases with # increasing sample size: seq(10, 60, by=10) #[1] 10 20 30 40 50 60 round(tolIntNparConfLevel(n = seq(10, 60, by = 10)), 2) #[1] 0.09 0.26 0.45 0.60 0.72 0.81 #---------- # Look at how the confidence level of a nonparametric tolerance interval decreases with # increasing coverage: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(tolIntNparConfLevel(n = 10, coverage = seq(0.5, 0.9, by = 0.1)), 2) #[1] 0.99 0.95 0.85 0.62 0.26 #---------- # Look at how the confidence level of a nonparametric tolerance interval decreases with the # rank of the lower tolerance limit: round(tolIntNparConfLevel(n = 60, ltl.rank = 1:5), 2) #[1] 0.81 0.58 0.35 0.18 0.08 #========== # Example 17-4 on page 17-21 of USEPA (2009) uses copper concentrations (ppb) from 3 # background wells to set an upper limit for 2 compliance wells. There are 6 observations # per well, and the maximum value from the 3 wells is set to the 95% confidence upper # tolerance limit, and we need to determine the coverage of this tolerance interval. tolIntNparCoverage(n = 24, conf.level = 0.95, ti.type = "upper") #[1] 0.8826538 # Here we will modify the example and determine the confidence level of the tolerance # interval when we set the coverage to 95%. tolIntNparConfLevel(n = 24, coverage = 0.95, ti.type = "upper") # [1] 0.708011
Compute the coverage associated with a nonparametric tolerance interval for a continuous
distribution given the sample size, confidence level, coverage type
(-content versus
-expectation), and ranks of the order statistics
used for the interval.
tolIntNparCoverage(n, conf.level = 0.95, cov.type = "content", ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), ti.type = "two.sided")
tolIntNparCoverage(n, conf.level = 0.95, cov.type = "content", ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), ti.type = "two.sided")
n |
vector of positive integers specifying the sample sizes.
Missing ( |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level of the tolerance interval. |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ltl.rank |
vector of positive integers indicating the rank of the order statistic to use for the lower bound
of the tolerance interval. If |
n.plus.one.minus.utl.rank |
vector of positive integers related to the rank of the order statistic to use for
the upper bound of the tolerance interval. A value of
|
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
If the arguments n
, conf.level
, ltl.rank
, and
n.plus.one.minus.utl.rank
are not all the same length, they are replicated to be the
same length as the length of the longest argument.
The help file for tolIntNpar
explains how nonparametric -content
tolerance intervals are constructed and how the coverage
associated with the tolerance interval is computed based on specified values
for the sample size, the confidence level, and the ranks of the order statistics used for
the bounds of the tolerance interval.
vector of values between 0 and 1 indicating the coverage associated with the specified nonparametric tolerance interval.
See the help file for tolIntNpar
.
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, coverage, and confidence level if one of the objectives of
the sampling program is to produce tolerance intervals. The functions
tolIntNparN
, tolIntNparConfLevel
, tolIntNparCoverage
, and
plotTolIntNparDesign
can be used to investigate these relationships for
constructing nonparametric tolerance intervals.
Steven P. Millard ([email protected])
See the help file for tolIntNpar
.
tolIntNpar
, tolIntNparN
, tolIntNparConfLevel
,
plotTolIntNparDesign
.
# Look at how the coverage of a nonparametric tolerance interval increases with # increasing sample size: seq(10, 60, by=10) #[1] 10 20 30 40 50 60 round(tolIntNparCoverage(n = seq(10, 60, by = 10)), 2) #[1] 0.61 0.78 0.85 0.89 0.91 0.92 #--------- # Look at how the coverage of a nonparametric tolerance interval decreases with # increasing confidence level: seq(0.5, 0.9, by=0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(tolIntNparCoverage(n = 10, conf.level = seq(0.5, 0.9, by = 0.1)), 2) #[1] 0.84 0.81 0.77 0.73 0.66 #---------- # Look at how the coverage of a nonparametric tolerance interval decreases with # the rank of the lower tolerance limit: round(tolIntNparCoverage(n = 60, ltl.rank = 1:5), 2) #[1] 0.92 0.90 0.88 0.85 0.83 #========== # Example 17-4 on page 17-21 of USEPA (2009) uses copper concentrations (ppb) from 3 # background wells to set an upper limit for 2 compliance wells. The maximum value from # the 3 wells is set to the 95% confidence upper tolerance limit, and we need to # determine the coverage of this tolerance interval. tolIntNparCoverage(n = 24, conf.level = 0.95, ti.type = "upper") #[1] 0.8826538
# Look at how the coverage of a nonparametric tolerance interval increases with # increasing sample size: seq(10, 60, by=10) #[1] 10 20 30 40 50 60 round(tolIntNparCoverage(n = seq(10, 60, by = 10)), 2) #[1] 0.61 0.78 0.85 0.89 0.91 0.92 #--------- # Look at how the coverage of a nonparametric tolerance interval decreases with # increasing confidence level: seq(0.5, 0.9, by=0.1) #[1] 0.5 0.6 0.7 0.8 0.9 round(tolIntNparCoverage(n = 10, conf.level = seq(0.5, 0.9, by = 0.1)), 2) #[1] 0.84 0.81 0.77 0.73 0.66 #---------- # Look at how the coverage of a nonparametric tolerance interval decreases with # the rank of the lower tolerance limit: round(tolIntNparCoverage(n = 60, ltl.rank = 1:5), 2) #[1] 0.92 0.90 0.88 0.85 0.83 #========== # Example 17-4 on page 17-21 of USEPA (2009) uses copper concentrations (ppb) from 3 # background wells to set an upper limit for 2 compliance wells. The maximum value from # the 3 wells is set to the 95% confidence upper tolerance limit, and we need to # determine the coverage of this tolerance interval. tolIntNparCoverage(n = 24, conf.level = 0.95, ti.type = "upper") #[1] 0.8826538
Compute the sample size necessary for a nonparametric tolerance interval (for a continuous
distribution) with a specified coverage and, in the case of a -content tolerance
interval, a specified confidence level, given the ranks of the order statistics used for the
interval.
tolIntNparN(coverage = 0.95, conf.level = 0.95, cov.type = "content", ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), ti.type = "two.sided")
tolIntNparN(coverage = 0.95, conf.level = 0.95, cov.type = "content", ltl.rank = ifelse(ti.type == "upper", 0, 1), n.plus.one.minus.utl.rank = ifelse(ti.type == "lower", 0, 1), ti.type = "two.sided")
coverage |
numeric vector of values between 0 and 1 indicating the desired coverage of the tolerance interval. |
conf.level |
numeric vector of values between 0 and 1 indicating the confidence level of the tolerance interval. |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ltl.rank |
vector of positive integers indicating the rank of the order statistic to use for the lower bound
of the tolerance interval. If |
n.plus.one.minus.utl.rank |
vector of positive integers related to the rank of the order statistic to use for
the upper bound of the tolerance interval. A value of
|
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
If the arguments coverage
, conf.level
, ltl.rank
, and
n.plus.one.minus.utl.rank
are not all the same length, they are replicated to be the
same length as the length of the longest argument.
The help file for tolIntNpar
explains how nonparametric tolerance intervals
are constructed.
Computing Required Sample Size for a -Content Tolerance Interval (
cov.type="content"
)
For a -content tolerance interval, if the coverage
is specified, then the
associated confidence level
is computed as:
where denotes the cumulative distribution function of a
beta random variable with parameters
shape1=
and
shape2=
evaluated at
. The value of
is determined by
the argument
conf.level
. The value of is determined by the argument
coverage
. The value of is determined by the argument
ltl.rank
. The value
of is determined by the argument
n.plus.one.minus.utl.rank
. Once these values
have been determined, the above equation can be solved implicitly for , since
Computing Required Sample Size for a -Expectation Tolerance Interval (
cov.type="expectation"
)
For a -expectation tolerance interval, the expected coverage is simply the mean of a
beta random variable with parameters
shape1=
and
shape2=
, which is given by:
or, using Equation (2) above, we can re-write the formula for the expected coverage as:
Thus, for user-specified values of (
ltl.rank
),
(
n.plus.one.minus.utl.rank
), and expected coverage, the required sample
size is computed as:
where denotes the smallest integer greater than or equal to
.
(See the R help file for
ceiling
).
A vector of positive integers indicating the required sample size(s) for the specified nonparametric tolerance interval(s).
See the help file for tolIntNpar
.
In the course of designing a sampling program, an environmental scientist may wish to determine
the relationship between sample size, coverage, and confidence level if one of the objectives of
the sampling program is to produce tolerance intervals. The functions
tolIntNparN
, tolIntNparCoverage
, tolIntNparConfLevel
, and
plotTolIntNparDesign
can be used to investigate these relationships for
constructing nonparametric tolerance intervals.
Steven P. Millard ([email protected])
See the help file for tolIntNpar
.
tolIntNpar
, tolIntNparConfLevel
, tolIntNparCoverage
,
plotTolIntNparDesign
.
# Look at how the required sample size for a nonparametric tolerance interval increases # with increasing confidence level: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 tolIntNparN(conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 34 40 49 59 77 #---------- # Look at how the required sample size for a nonparametric tolerance interval increases # with increasing coverage: tolIntNparN(coverage = seq(0.5, 0.9, by = 0.1)) #[1] 8 10 14 22 46 #---------- # Look at how the required sample size for a nonparametric tolerance interval increases # with the rank of the lower tolerance limit: tolIntNparN(ltl.rank = 1:5) #[1] 93 124 153 181 208 #========== # Example 17-4 on page 17-21 of USEPA (2009) uses copper concentrations (ppb) from 3 # background wells to set an upper limit for 2 compliance wells. The maximum value from # the 3 wells is set to the 95% confidence upper tolerance limit, and we need to # determine the coverage of this tolerance interval. tolIntNparCoverage(n = 24, conf.level = 0.95, ti.type = "upper") #[1] 0.8826538 # Here we will modify the example and determine the sample size required to produce # a tolerance interval with 95% confidence level AND 95% coverage. tolIntNparN(coverage = 0.95, conf.level = 0.95, ti.type = "upper") #[1] 59
# Look at how the required sample size for a nonparametric tolerance interval increases # with increasing confidence level: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 tolIntNparN(conf.level = seq(0.5, 0.9, by = 0.1)) #[1] 34 40 49 59 77 #---------- # Look at how the required sample size for a nonparametric tolerance interval increases # with increasing coverage: tolIntNparN(coverage = seq(0.5, 0.9, by = 0.1)) #[1] 8 10 14 22 46 #---------- # Look at how the required sample size for a nonparametric tolerance interval increases # with the rank of the lower tolerance limit: tolIntNparN(ltl.rank = 1:5) #[1] 93 124 153 181 208 #========== # Example 17-4 on page 17-21 of USEPA (2009) uses copper concentrations (ppb) from 3 # background wells to set an upper limit for 2 compliance wells. The maximum value from # the 3 wells is set to the 95% confidence upper tolerance limit, and we need to # determine the coverage of this tolerance interval. tolIntNparCoverage(n = 24, conf.level = 0.95, ti.type = "upper") #[1] 0.8826538 # Here we will modify the example and determine the sample size required to produce # a tolerance interval with 95% confidence level AND 95% coverage. tolIntNparN(coverage = 0.95, conf.level = 0.95, ti.type = "upper") #[1] 59
Construct a -content or
-expectation tolerance
interval for a Poisson distribution.
tolIntPois(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95)
tolIntPois(x, coverage = 0.95, cov.type = "content", ti.type = "two-sided", conf.level = 0.95)
x |
numeric vector of observations, or an object resulting from a call to an
estimating function that assumes a Poisson distribution
(i.e., |
coverage |
a scalar between 0 and 1 indicating the desired coverage of the tolerance interval.
The default value is |
cov.type |
character string specifying the coverage type for the tolerance interval.
The possible values are |
ti.type |
character string indicating what kind of tolerance interval to compute.
The possible values are |
conf.level |
a scalar between 0 and 1 indicating the confidence level associated with the tolerance
interval. The default value is |
If x
contains any missing (NA
), undefined (NaN
) or
infinite (Inf
, -Inf
) values, they will be removed prior to
performing the estimation.
A tolerance interval for some population is an interval on the real line constructed so as to
contain of the population (i.e.,
of all
future observations), where
. The quantity
is called
the coverage.
There are two kinds of tolerance intervals (Guttman, 1970):
A -content tolerance interval with confidence level
is
constructed so that it contains at least
of the population (i.e., the
coverage is at least
) with probability
, where
. The quantity
is called the confidence level or
confidence coefficient associated with the tolerance interval.
A -expectation tolerance interval is constructed so that the average coverage of
the interval is
.
Note: A -expectation tolerance interval with coverage
is
equivalent to a prediction interval for one future observation with associated confidence level
. Note that there is no explicit confidence level associated with a
-expectation tolerance interval. If a
-expectation tolerance interval is
treated as a
-content tolerance interval, the confidence level associated with this
tolerance interval is usually around 50% (e.g., Guttman, 1970, Table 4.2, p.76).
Because of the discrete nature of the Poisson distribution,
even true tolerance intervals (tolerance intervals based on the true value of
) will usually not contain exactly
of the population.
For example, for the Poisson distribution with parameter
lambda=2
, the
interval [0, 4] contains 94.7% of this distribution and the interval [0, 5]
contains 98.3% of this distribution. Thus, no interval can contain exactly 95%
of this distribution.
-Content Tolerance Intervals for a Poisson Distribution
Zacks (1970) showed that for monotone likelihood ratio (MLR) families of discrete
distributions, a uniformly most accurate upper
-content
tolerance interval with associated confidence level
is
constructed by finding the upper
confidence limit for the
parameter associated with the distribution, and then computing the
'th
quantile of the distribution assuming the true value of the parameter is equal to
the upper confidence limit. This idea can be extended to one-sided lower and
two-sided tolerance limits.
It can be shown that all distributions that are one parameter exponential families have the MLR property, and the Poisson distribution is a one-parameter exponential family, so the method of Zacks (1970) can be applied to a Poisson distribution.
Let denote a Poisson random variable with parameter
lambda=
. Let
denote the
'th quantile
of this distribution. That is,
Note that due to the discrete nature of the Poisson distribution, there will be
several values of associated with one value of
. For example, for
, the value 1 is the
'th quantile for any value of
between 0.140 and 0.406.
Let denote a vector of
observations from a
Poisson distribution with parameter
lambda=
.
When
ti.type="upper"
, the first step is to compute the one-sided upper
confidence limit for
based on the observations
(see the help file for
epois
). Denote this upper
confidence limit by . The one-sided upper
tolerance limit
is then given by:
Similarly, when ti.type="lower"
, the first step is to compute the one-sided
lower confidence limit for
based on the
observations
. Denote this lower confidence limit by
.
The one-sided lower
tolerance limit is then given by:
Finally, when ti.type="two-sided"
, the first step is to compute the two-sided
confidence limits for
based on the
observations
. Denote these confidence limits by
and
. The two-sided
tolerance limit is then given by:
Note that the function tolIntPois
uses the exact confidence limits for
when computing
-content tolerance limits (see
epois
).
-Expectation Tolerance Intervals for a Poisson Distribution
As stated above, a -expectation tolerance interval with coverage
is equivalent to a prediction interval for one future observation
with associated confidence level
. This is because the probability
that any single future observation will fall into this interval is
,
so the distribution of the number of
future observations that will fall into
this interval is binomial with parameters
size=
and
prob=
. Hence the expected proportion of
future observations that fall into this interval is
and is
independent of the value of
. See the help file for
predIntPois
for information on how these intervals are constructed.
If x
is a numeric vector, tolIntPois
returns a list of class
"estimate"
containing the estimated parameters, a component called
interval
containing the tolerance interval information, and other
information. See estimate.object
for details.
If x
is the result of calling an estimation function, tolIntPois
returns a list whose class is the same as x
. The list contains the same
components as x
. If x
already has a component called
interval
, this component is replaced with the tolerance interval
information.
Tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Meeker, 1991; Krishnamoorthy and Mathew, 2009). References that discuss tolerance intervals in the context of environmental monitoring include: Berthouex and Brown (2002, Chapter 21), Gibbons et al. (2009), Millard and Neerchal (2001, Chapter 6), Singh et al. (2010b), and USEPA (2009).
Gibbons (1987b) used the Poisson distribution to model the number of detected
compounds per scan of the 32 volatile organic priority pollutants (VOC), and
also to model the distribution of chemical concentration (in ppb). He explained
the derivation of a one-sided upper -content tolerance limit for a
Poisson distribution based on the work of Zacks (1970) using the Pearson-Hartley
approximation to the confidence limits for the mean parameter
(see the help file for
epois
). Note that there are several
typographical errors in the derivation and examples on page 575 of Gibbons (1987b)
because there is confusion between where the value of (the coverage)
should be and where the value of
(the confidence level) should be.
Gibbons et al. (2009, pp.103-104) gives correct formulas.
Steven P. Millard ([email protected])
Gibbons, R.D. (1987b). Statistical Models for the Analysis of Volatile Organic Compounds in Waste Disposal Sites. Ground Water 25, 572–580.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Guttman, I. (1970). Statistical Tolerance Regions: Classical and Bayesian. Hafner Publishing Co., Darien, CT.
Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.
Johnson, N. L., S. Kotz, and A. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, Chapter 4.
Krishnamoorthy K., and T. Mathew. (2009). Statistical Tolerance Regions: Theory, Applications, and Computation. John Wiley and Sons, Hoboken.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton.
Zacks, S. (1970). Uniformly Most Accurate Upper Tolerance Limits for Monotone Likelihood Ratio Families of Discrete Distributions. Journal of the American Statistical Association 65, 307–316.
Poisson
, epois
, eqpois
,
estimate.object
, Tolerance Intervals,
Estimating Distribution Parameters, Estimating Distribution Quantiles.
# Generate 20 observations from a Poisson distribution with parameter # lambda=2. The interval [0, 4] contains 94.7% of this distribution and # the interval [0,5] contains 98.3% of this distribution. Thus, because # of the discrete nature of the Poisson distribution, no interval contains # exactly 95% of this distribution. Use tolIntPois to estimate the mean # parameter of the true distribution, and construct a one-sided upper 95% # beta-content tolerance interval with associated confidence level 90%. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpois(20, 2) tolIntPois(dat, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Zacks # #Tolerance Interval Type: two-sided # #Confidence Level: 90% # #Tolerance Interval: LTL = 0 # UTL = 6 #------ # Clean up rm(dat)
# Generate 20 observations from a Poisson distribution with parameter # lambda=2. The interval [0, 4] contains 94.7% of this distribution and # the interval [0,5] contains 98.3% of this distribution. Thus, because # of the discrete nature of the Poisson distribution, no interval contains # exactly 95% of this distribution. Use tolIntPois to estimate the mean # parameter of the true distribution, and construct a one-sided upper 95% # beta-content tolerance interval with associated confidence level 90%. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rpois(20, 2) tolIntPois(dat, conf.level = 0.9) #Results of Distribution Parameter Estimation #-------------------------------------------- # #Assumed Distribution: Poisson # #Estimated Parameter(s): lambda = 1.8 # #Estimation Method: mle/mme/mvue # #Data: dat # #Sample Size: 20 # #Tolerance Interval Coverage: 95% # #Coverage Type: content # #Tolerance Interval Method: Zacks # #Tolerance Interval Type: two-sided # #Confidence Level: 90% # #Tolerance Interval: LTL = 0 # UTL = 6 #------ # Clean up rm(dat)
Monthly estimated total phosphorus mass (mg) within a water column at two different stations for the 5-year time period October 1984 to September 1989 from a study on phosphorus concentration conducted in the Chesapeake Bay.
Total.P.df
Total.P.df
A data frame with 60 observations on the following 4 variables.
CB3.1
a numeric vector of phosphorus concentrations at station CB3.1
CB3.3e
a numeric vector phosphorus concentrations at station CB3.3e
Month
a factor indicating the month the observation was taken
Year
a numeric vector indicating the year an observation was taken
Neerchal, N. K., and S. L. Brunenmeister. (1993). Estimation of Trend in Chesapeake Bay Water Quality Data. In Patil, G.P., and C.R. Rao, eds., Handbook of Statistics, Vol. 6: Multivariate Environmental Statistics. North-Holland, Amsterdam, Chapter 19, 407-422.
Density, distribution function, quantile function, and random generation
for the triangular distribution with parameters min
, max
,
and mode
.
dtri(x, min = 0, max = 1, mode = 1/2) ptri(q, min = 0, max = 1, mode = 1/2) qtri(p, min = 0, max = 1, mode = 1/2) rtri(n, min = 0, max = 1, mode = 1/2)
dtri(x, min = 0, max = 1, mode = 1/2) ptri(q, min = 0, max = 1, mode = 1/2) qtri(p, min = 0, max = 1, mode = 1/2) rtri(n, min = 0, max = 1, mode = 1/2)
x |
vector of quantiles. Missing values ( |
q |
vector of quantiles. Missing values ( |
p |
vector of probabilities between 0 and 1. Missing values ( |
n |
sample size. If |
min |
vector of minimum values of the distribution of the random variable.
The default value is |
max |
vector of maximum values of the random variable.
The default value is |
mode |
vector of modes of the random variable.
The default value is |
Let be a triangular random variable with parameters
min=
,
max=
, and
mode=
.
Probability Density and Cumulative Distribution Function
The density function of is given by:
|
|
for |
|
for |
|
where .
The cumulative distribution function of is given by:
|
|
for |
|
for |
|
where .
Quantiles
The quantile of
is given by:
|
|
for |
|
for |
|
where .
Random Numbers
Random numbers are generated using the inverse transformation method:
where is a random deviate from a uniform
distribution.
Mean and Variance
The mean and variance of are given by:
dtri
gives the density, ptri
gives the distribution function,
qtri
gives the quantile function, and rtri
generates random
deviates.
The triangular distribution is so named because of the shape of its probability
density function. The average of two independent identically distributed
uniform random variables with parameters min=
and
max=
has a triangular distribution with parameters
min=
,
max=
, and
mode=
.
The triangular distribution is sometimes used as an input distribution in probability risk assessment.
Steven P. Millard ([email protected])
Forbes, C., M. Evans, N. Hastings, and B. Peacock. (2011). Statistical Distributions. Fourth Edition. John Wiley and Sons, Hoboken, NJ.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York.
Uniform, Probability Distributions and Random Numbers.
# Density of a triangular distribution with parameters # min=10, max=15, and mode=12, evaluated at 12, 13 and 14: dtri(12:14, 10, 15, 12) #[1] 0.4000000 0.2666667 0.1333333 #---------- # The cdf of a triangular distribution with parameters # min=2, max=7, and mode=5, evaluated at 3, 4, and 5: ptri(3:5, 2, 7, 5) #[1] 0.06666667 0.26666667 0.60000000 #---------- # The 25'th percentile of a triangular distribution with parameters # min=1, max=4, and mode=3: qtri(0.25, 1, 4, 3) #[1] 2.224745 #---------- # A random sample of 4 numbers from a triangular distribution with # parameters min=3 , max=20, and mode=12. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(10) rtri(4, 3, 20, 12) #[1] 11.811593 9.850955 11.081885 13.539496
# Density of a triangular distribution with parameters # min=10, max=15, and mode=12, evaluated at 12, 13 and 14: dtri(12:14, 10, 15, 12) #[1] 0.4000000 0.2666667 0.1333333 #---------- # The cdf of a triangular distribution with parameters # min=2, max=7, and mode=5, evaluated at 3, 4, and 5: ptri(3:5, 2, 7, 5) #[1] 0.06666667 0.26666667 0.60000000 #---------- # The 25'th percentile of a triangular distribution with parameters # min=1, max=4, and mode=3: qtri(0.25, 1, 4, 3) #[1] 2.224745 #---------- # A random sample of 4 numbers from a triangular distribution with # parameters min=3 , max=20, and mode=12. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(10) rtri(4, 3, 20, 12) #[1] 11.811593 9.850955 11.081885 13.539496
Compute the Type I Error level necessary to achieve a specified power for a one- or two-sample t-test, given the sample size(s) and scaled difference.
tTestAlpha(n.or.n1, n2 = n.or.n1, delta.over.sigma = 0, power = 0.95, sample.type = ifelse(!missing(n2) && !is.null(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE, tol = 1e-07, maxiter = 1000)
tTestAlpha(n.or.n1, n2 = n.or.n1, delta.over.sigma = 0, power = 0.95, sample.type = ifelse(!missing(n2) && !is.null(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE, tol = 1e-07, maxiter = 1000)
n.or.n1 |
numeric vector of sample sizes. When |
n2 |
numeric vector of sample sizes for group 2. The default value is the value of
|
delta.over.sigma |
numeric vector specifying the ratio of the true difference ( |
power |
numeric vector of numbers between 0 and 1 indicating the power
associated with the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
tol |
numeric scalar indicating the tolerance argument to pass to the
|
maxiter |
positive integer indicating the maximum number of iterations
argument to pass to the |
Formulas for the power of the t-test for specified values of
the sample size, scaled difference, and Type I error level are given in
the help file for tTestPower
. The function tTestAlpha
uses the uniroot
search algorithm to determine the
required Type I error level for specified values of the sample size, power,
and scaled difference.
numeric vector of Type I error levels.
See tTestPower
.
Steven P. Millard ([email protected])
See tTestPower
.
tTestPower
, tTestScaledMdd
,
tTestN
,
plotTTestDesign
, Normal,
t.test
, Hypothesis Tests.
# Look at how the required Type I error level for the one-sample t-test # decreases with increasing sample size. Set the power to 80% and # the scaled difference to 0.5. seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 alpha <- tTestAlpha(n.or.n1 = seq(5, 30, by = 5), power = 0.8, delta.over.sigma = 0.5) round(alpha, 2) #[1] 0.65 0.45 0.29 0.18 0.11 0.07 #---------- # Repeat the last example, but use the approximation. # Note how the approximation underestimates the power # for the smaller sample sizes. #---------------------------------------------------- alpha <- tTestAlpha(n.or.n1 = seq(5, 30, by = 5), power = 0.8, delta.over.sigma = 0.5, approx = TRUE) round(alpha, 2) #[1] 0.63 0.46 0.30 0.18 0.11 0.07 #---------- # Look at how the required Type I error level for the two-sample # t-test decreases with increasing scaled difference. Use # a power of 90% and a sample size of 10 in each group. seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 alpha <- tTestAlpha(10, sample.type = "two.sample", power = 0.9, delta.over.sigma = seq(0.5, 2, by = 0.5)) round(alpha, 2) #[1] 0.82 0.35 0.06 0.01 #---------- # Look at how the required Type I error level for the two-sample # t-test increases with increasing values of required power. Use # a sample size of 20 for each group and a scaled difference of # 1. alpha <- tTestAlpha(20, sample.type = "two.sample", delta.over.sigma = 1, power = c(0.8, 0.9, 0.95)) round(alpha, 2) #[1] 0.03 0.07 0.14 #---------- # Clean up #--------- rm(alpha)
# Look at how the required Type I error level for the one-sample t-test # decreases with increasing sample size. Set the power to 80% and # the scaled difference to 0.5. seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 alpha <- tTestAlpha(n.or.n1 = seq(5, 30, by = 5), power = 0.8, delta.over.sigma = 0.5) round(alpha, 2) #[1] 0.65 0.45 0.29 0.18 0.11 0.07 #---------- # Repeat the last example, but use the approximation. # Note how the approximation underestimates the power # for the smaller sample sizes. #---------------------------------------------------- alpha <- tTestAlpha(n.or.n1 = seq(5, 30, by = 5), power = 0.8, delta.over.sigma = 0.5, approx = TRUE) round(alpha, 2) #[1] 0.63 0.46 0.30 0.18 0.11 0.07 #---------- # Look at how the required Type I error level for the two-sample # t-test decreases with increasing scaled difference. Use # a power of 90% and a sample size of 10 in each group. seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 alpha <- tTestAlpha(10, sample.type = "two.sample", power = 0.9, delta.over.sigma = seq(0.5, 2, by = 0.5)) round(alpha, 2) #[1] 0.82 0.35 0.06 0.01 #---------- # Look at how the required Type I error level for the two-sample # t-test increases with increasing values of required power. Use # a sample size of 20 for each group and a scaled difference of # 1. alpha <- tTestAlpha(20, sample.type = "two.sample", delta.over.sigma = 1, power = c(0.8, 0.9, 0.95)) round(alpha, 2) #[1] 0.03 0.07 0.14 #---------- # Clean up #--------- rm(alpha)
Compute the sample size necessary to achieve a specified power for a one- or two-sample t-test, given the ratio of means, coefficient of variation, and significance level, assuming lognormal data.
tTestLnormAltN(ratio.of.means, cv = 1, alpha = 0.05, power = 0.95, sample.type = ifelse(!is.null(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE, n2 = NULL, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
tTestLnormAltN(ratio.of.means, cv = 1, alpha = 0.05, power = 0.95, sample.type = ifelse(!is.null(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE, n2 = NULL, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
ratio.of.means |
numeric vector specifying the ratio of the first mean to the second mean.
When |
cv |
numeric vector of positive value(s) specifying the coefficient of
variation. When |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
power |
numeric vector of numbers between 0 and 1 indicating the power
associated with the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
n2 |
numeric vector of sample sizes for group 2. The default value is
|
round.up |
logical scalar indicating whether to round up the values of the computed
sample size(s) to the next smallest integer. The default value is
|
n.max |
positive integer greater than 1 indicating the maximum sample size when |
tol |
numeric scalar indicating the toloerance to use in the
|
maxiter |
positive integer indicating the maximum number of iterations
argument to pass to the |
If the arguments ratio.of.means
, cv
, alpha
, power
, and
n2
are not all the same length, they are replicated to be the same length as
the length of the longest argument.
Formulas for the power of the t-test for lognormal data for specified values of
the sample size, ratio of means, and Type I error level are given in
the help file for tTestLnormAltPower
. The function
tTestLnormAltN
uses the uniroot
search algorithm to determine
the required sample size(s) for specified values of the power,
scaled difference, and Type I error level.
When sample.type="one.sample"
, or sample.type="two.sample"
and n2
is not supplied (so equal sample sizes for each group is
assumed), tTestLnormAltN
returns a numeric vector of sample sizes. When
sample.type="two.sample"
and n2
is supplied,
tTestLnormAltN
returns a list with two components called n1
and
n2
, specifying the sample sizes for each group.
See tTestLnormAltPower
.
Steven P. Millard ([email protected])
See tTestLnormAltPower
.
tTestLnormAltPower
, tTestLnormAltRatioOfMeans
,
plotTTestLnormAltDesign
, LognormalAlt,
t.test
, Hypothesis Tests.
# Look at how the required sample size for the one-sample test increases with # increasing required power: seq(0.5, 0.9, by = 0.1) # [1] 0.5 0.6 0.7 0.8 0.9 tTestLnormAltN(ratio.of.means = 1.5, power = seq(0.5, 0.9, by = 0.1)) # [1] 19 23 28 36 47 #---------- # Repeat the last example, but compute the sample size based on the approximate # power instead of the exact power: tTestLnormAltN(ratio.of.means = 1.5, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) # [1] 19 23 29 36 47 #========== # Look at how the required sample size for the two-sample t-test decreases with # increasing ratio of means: seq(1.5, 2, by = 0.1) #[1] 1.5 1.6 1.7 1.8 1.9 2.0 tTestLnormAltN(ratio.of.means = seq(1.5, 2, by = 0.1), sample.type = "two") #[1] 111 83 65 54 45 39 #---------- # Look at how the required sample size for the two-sample t-test decreases with # increasing values of Type I error: tTestLnormAltN(ratio.of.means = 1.5, alpha = c(0.001, 0.01, 0.05, 0.1), sample.type = "two") #[1] 209 152 111 92 #---------- # For the two-sample t-test, compare the total sample size required to detect a # ratio of means of 2 for equal sample sizes versus the case when the sample size # for the second group is constrained to be 30. Assume a coefficient of variation # of 1, a 5% significance level, and 95% power. Note that for the case of equal # sample sizes, a total of 78 samples (39+39) are required, whereas when n2 is # constrained to be 30, a total of 84 samples (54 + 30) are required. tTestLnormAltN(ratio.of.means = 2, sample.type = "two") #[1] 39 tTestLnormAltN(ratio.of.means = 2, n2 = 30) #$n1: #[1] 54 # #$n2: #[1] 30 #========== # The guidance document Soil Screening Guidance: Technical Background Document # (USEPA, 1996c, Part 4) discusses sampling design and sample size calculations # for studies to determine whether the soil at a potentially contaminated site # needs to be investigated for possible remedial action. Let 'theta' denote the # average concentration of the chemical of concern. The guidance document # establishes the following goals for the decision rule (USEPA, 1996c, p.87): # # Pr[Decide Don't Investigate | theta > 2 * SSL] = 0.05 # # Pr[Decide to Investigate | theta <= (SSL/2)] = 0.2 # # where SSL denotes the pre-established soil screening level. # # These goals translate into a Type I error of 0.2 for the null hypothesis # # H0: [theta / (SSL/2)] <= 1 # # and a power of 95% for the specific alternative hypothesis # # Ha: [theta / (SSL/2)] = 4 # # Assuming a lognormal distribution and the above values for Type I error and # power, determine the required samples sizes associated with various values of # the coefficient of variation for the one-sample test. Based on these calculations, # you need to take at least 6 soil samples to satisfy the requirements for the # Type I and Type II errors when the coefficient of variation is 2. cv <- c(0.5, 1, 2) N <- tTestLnormAltN(ratio.of.means = 4, cv = cv, alpha = 0.2, alternative = "greater") names(N) <- paste("CV=", cv, sep = "") N #CV=0.5 CV=1 CV=2 # 2 3 6 #---------- # Repeat the last example, but use the approximate power calculation instead of the # exact. Using the approximate power calculation, you need 7 soil samples when the # coefficient of variation is 2 (because the approximation underestimates the # true power). N <- tTestLnormAltN(ratio.of.means = 4, cv = cv, alpha = 0.2, alternative = "greater", approx = TRUE) names(N) <- paste("CV=", cv, sep = "") N #CV=0.5 CV=1 CV=2 # 3 5 7 #---------- # Repeat the last example, but use a Type I error of 0.05. N <- tTestLnormAltN(ratio.of.means = 4, cv = cv, alternative = "greater", approx = TRUE) names(N) <- paste("CV=", cv, sep = "") N #CV=0.5 CV=1 CV=2 # 4 6 12 #========== # Reproduce the second column of Table 2 in van Belle and Martin (1993, p.167). tTestLnormAltN(ratio.of.means = 1.10, cv = seq(0.1, 0.8, by = 0.1), power = 0.8, sample.type = "two.sample", approx = TRUE) #[1] 19 69 150 258 387 533 691 856 #========== # Clean up #--------- rm(cv, N)
# Look at how the required sample size for the one-sample test increases with # increasing required power: seq(0.5, 0.9, by = 0.1) # [1] 0.5 0.6 0.7 0.8 0.9 tTestLnormAltN(ratio.of.means = 1.5, power = seq(0.5, 0.9, by = 0.1)) # [1] 19 23 28 36 47 #---------- # Repeat the last example, but compute the sample size based on the approximate # power instead of the exact power: tTestLnormAltN(ratio.of.means = 1.5, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) # [1] 19 23 29 36 47 #========== # Look at how the required sample size for the two-sample t-test decreases with # increasing ratio of means: seq(1.5, 2, by = 0.1) #[1] 1.5 1.6 1.7 1.8 1.9 2.0 tTestLnormAltN(ratio.of.means = seq(1.5, 2, by = 0.1), sample.type = "two") #[1] 111 83 65 54 45 39 #---------- # Look at how the required sample size for the two-sample t-test decreases with # increasing values of Type I error: tTestLnormAltN(ratio.of.means = 1.5, alpha = c(0.001, 0.01, 0.05, 0.1), sample.type = "two") #[1] 209 152 111 92 #---------- # For the two-sample t-test, compare the total sample size required to detect a # ratio of means of 2 for equal sample sizes versus the case when the sample size # for the second group is constrained to be 30. Assume a coefficient of variation # of 1, a 5% significance level, and 95% power. Note that for the case of equal # sample sizes, a total of 78 samples (39+39) are required, whereas when n2 is # constrained to be 30, a total of 84 samples (54 + 30) are required. tTestLnormAltN(ratio.of.means = 2, sample.type = "two") #[1] 39 tTestLnormAltN(ratio.of.means = 2, n2 = 30) #$n1: #[1] 54 # #$n2: #[1] 30 #========== # The guidance document Soil Screening Guidance: Technical Background Document # (USEPA, 1996c, Part 4) discusses sampling design and sample size calculations # for studies to determine whether the soil at a potentially contaminated site # needs to be investigated for possible remedial action. Let 'theta' denote the # average concentration of the chemical of concern. The guidance document # establishes the following goals for the decision rule (USEPA, 1996c, p.87): # # Pr[Decide Don't Investigate | theta > 2 * SSL] = 0.05 # # Pr[Decide to Investigate | theta <= (SSL/2)] = 0.2 # # where SSL denotes the pre-established soil screening level. # # These goals translate into a Type I error of 0.2 for the null hypothesis # # H0: [theta / (SSL/2)] <= 1 # # and a power of 95% for the specific alternative hypothesis # # Ha: [theta / (SSL/2)] = 4 # # Assuming a lognormal distribution and the above values for Type I error and # power, determine the required samples sizes associated with various values of # the coefficient of variation for the one-sample test. Based on these calculations, # you need to take at least 6 soil samples to satisfy the requirements for the # Type I and Type II errors when the coefficient of variation is 2. cv <- c(0.5, 1, 2) N <- tTestLnormAltN(ratio.of.means = 4, cv = cv, alpha = 0.2, alternative = "greater") names(N) <- paste("CV=", cv, sep = "") N #CV=0.5 CV=1 CV=2 # 2 3 6 #---------- # Repeat the last example, but use the approximate power calculation instead of the # exact. Using the approximate power calculation, you need 7 soil samples when the # coefficient of variation is 2 (because the approximation underestimates the # true power). N <- tTestLnormAltN(ratio.of.means = 4, cv = cv, alpha = 0.2, alternative = "greater", approx = TRUE) names(N) <- paste("CV=", cv, sep = "") N #CV=0.5 CV=1 CV=2 # 3 5 7 #---------- # Repeat the last example, but use a Type I error of 0.05. N <- tTestLnormAltN(ratio.of.means = 4, cv = cv, alternative = "greater", approx = TRUE) names(N) <- paste("CV=", cv, sep = "") N #CV=0.5 CV=1 CV=2 # 4 6 12 #========== # Reproduce the second column of Table 2 in van Belle and Martin (1993, p.167). tTestLnormAltN(ratio.of.means = 1.10, cv = seq(0.1, 0.8, by = 0.1), power = 0.8, sample.type = "two.sample", approx = TRUE) #[1] 19 69 150 258 387 533 691 856 #========== # Clean up #--------- rm(cv, N)
Compute the power of a one- or two-sample t-test, given the sample size, ratio of means, coefficient of variation, and significance level, assuming lognormal data.
tTestLnormAltPower(n.or.n1, n2 = n.or.n1, ratio.of.means = 1, cv = 1, alpha = 0.05, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE)
tTestLnormAltPower(n.or.n1, n2 = n.or.n1, ratio.of.means = 1, cv = 1, alpha = 0.05, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE)
n.or.n1 |
numeric vector of sample sizes. When |
n2 |
numeric vector of sample sizes for group 2. The default value is the value of
|
ratio.of.means |
numeric vector specifying the ratio of the first mean to the second mean.
When |
cv |
numeric vector of positive value(s) specifying the coefficient of
variation. When |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
If the arguments n.or.n1
, n2
, ratio.of.means
, cv
, and
alpha
are not all the same length, they are replicated to be the same length
as the length of the longest argument.
One-Sample Case (sample.type="one.sample"
)
Let denote a vector of
observations from a lognormal distribution with mean
and coefficient of variation
, and consider the null hypothesis:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
):
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative (alternative="two.sided"
)
To test the null hypothesis (1) versus any of the three alternatives (2)-(4), one might be tempted to use Student's t-test based on the log-transformed observations. Unlike the two-sample case with equal coefficients of variation (see below), in the one-sample case Student's t-test applied to the log-transformed observations will not test the correct hypothesis, as now explained.
Let
Then denote
observations from a
normal distribution with mean
and standard deviation
, where
(see the help file for LognormalAlt). Hence, by Equations (6) and (8) above,
the Student's t-test on the log-transformed data would involve a test of hypothesis
on both the parameters and
, not just on
.
To test the null hypothesis (1) above versus any of the alternatives (2)-(4), you
can use the function elnormAlt
to compute a confidence interval for
, and use the relationship between confidence intervals and hypothesis
tests. To test the null hypothesis (1) above versus the upper one-sided alternative
(2), you can also use
Chen's modified t-test for skewed distributions.
Although you can't use Student's t-test based on the log-transformed observations to
test a hypothesis about , you can use the t-distribution to estimate the
power of a test about
that is based on confidence intervals or
Chen's modified t-test, if you are willing to assume the population coefficient of
variation
stays constant for all possible values of
you are
interested in, and you are willing to postulate possible values for
.
First, let's re-write the hypotheses (1)-(4) as follows. The null hypothesis (1) is equivalent to:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
)
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative (alternative="two.sided"
)
For a constant coefficient of variation , the standard deviation of the
log-transformed observations
is also constant (see Equation (7) above).
Hence, by Equation (8), the ratio of the true mean to the hypothesized mean can be
written as:
which only involves the difference
Thus, for given values of and
, the power of the test of the null
hypothesis (10) against any of the alternatives (11)-(13) can be computed based on
the power of a one-sample t-test with
(see the help file for tTestPower
). Note that for the function
tTestLnormAltPower
, corresponds to the argument
ratio.of.means
,
and corresponds to the argument
cv
.
Two-Sample Case (sample.type="two.sample"
)
Let denote a vector of
observations from a lognormal distribution with mean
and coefficient of variaiton
, and let
denote a vector of
observations from a lognormal distribution with mean
and
coefficient of variation
, and consider the null hypothesis:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
):
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative (alternative="two.sided"
)
Because we are assuming the coefficient of variation is the same for
both populations, the test of the null hypothesis (17) versus any of the three
alternatives (18)-(20) can be based on the Student t-statistic using the
log-transformed observations.
To show this, first, let's re-write the hypotheses (17)-(20) as follows. The null hypothesis (17) is equivalent to:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
)
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative (alternative="two.sided"
)
If coefficient of variation is the same for both populations, then the
standard deviation of the log-transformed observations
is also the
same for both populations (see Equation (7) above). Hence, by Equation (8), the
ratio of the means can be written as:
which only involves the difference
Thus, for given values of and
, the power of the test of the null
hypothesis (21) against any of the alternatives (22)-(24) can be computed based on
the power of a two-sample t-test with
(see the help file for tTestPower
). Note that for the function
tTestLnormAltPower
, corresponds to the argument
ratio.of.means
,
and corresponds to the argument
cv
.
a numeric vector powers.
The normal distribution and
lognormal distribution are probably the two most
frequently used distributions to model environmental data. Often, you need to
determine whether a population mean is significantly different from a specified
standard (e.g., an MCL or ACL, USEPA, 1989b, Section 6), or whether two different
means are significantly different from each other (e.g., USEPA 2009, Chapter 16).
When you have lognormally-distributed data, you have to be careful about making
statements regarding inference for the mean. For the two-sample case with
assumed equal coefficients of variation, you can perform the
Student's t-test on the log-transformed observations.
For the one-sample case, you can perform a hypothesis test by constructing a
confidence interval for the mean using elnormAlt
, or use
Chen's t-test modified for skewed data.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether a mean differs from a specified level or two means differ from each other.
The functions tTestLnormAltPower
, tTestLnormAltN
,
tTestLnormAltRatioOfMeans
, and plotTTestLnormAltDesign
can be used to investigate these relationships for the case of
lognormally-distributed observations.
Steven P. Millard ([email protected])
van Belle, G., and D.C. Martin. (1993). Sample Size as a Function of Coefficient of Variation and Ratio of Means. The American Statistician 47(3), 165–167.
Also see the list of references in the help file for tTestPower
.
tTestLnormAltN
, tTestLnormAltRatioOfMeans
,
plotTTestLnormAltDesign
, LognormalAlt,
t.test
, Hypothesis Tests.
# Look at how the power of the one-sample test increases with increasing # sample size: seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 power <- tTestLnormAltPower(n.or.n1 = seq(5, 30, by = 5), ratio.of.means = 1.5, cv = 1) round(power, 2) #[1] 0.14 0.28 0.42 0.54 0.65 0.73 #---------- # Repeat the last example, but use the approximation to the power instead of the # exact power. Note how the approximation underestimates the true power for # the smaller sample sizes: power <- tTestLnormAltPower(n.or.n1 = seq(5, 30, by = 5), ratio.of.means = 1.5, cv = 1, approx = TRUE) round(power, 2) #[1] 0.09 0.25 0.40 0.53 0.64 0.73 #========== # Look at how the power of the two-sample t-test increases with increasing # ratio of means: power <- tTestLnormAltPower(n.or.n1 = 20, sample.type = "two", ratio.of.means = c(1.1, 1.5, 2), cv = 1) round(power, 2) #[1] 0.06 0.32 0.73 #---------- # Look at how the power of the two-sample t-test increases with increasing # values of Type I error: power <- tTestLnormAltPower(30, sample.type = "two", ratio.of.means = 1.5, cv = 1, alpha = c(0.001, 0.01, 0.05, 0.1)) round(power, 2) #[1] 0.07 0.23 0.46 0.59 #========== # The guidance document Soil Screening Guidance: Technical Background Document # (USEPA, 1996c, Part 4) discusses sampling design and sample size calculations # for studies to determine whether the soil at a potentially contaminated site # needs to be investigated for possible remedial action. Let 'theta' denote the # average concentration of the chemical of concern. The guidance document # establishes the following goals for the decision rule (USEPA, 1996c, p.87): # # Pr[Decide Don't Investigate | theta > 2 * SSL] = 0.05 # # Pr[Decide to Investigate | theta <= (SSL/2)] = 0.2 # # where SSL denotes the pre-established soil screening level. # # These goals translate into a Type I error of 0.2 for the null hypothesis # # H0: [theta / (SSL/2)] <= 1 # # and a power of 95% for the specific alternative hypothesis # # Ha: [theta / (SSL/2)] = 4 # # Assuming a lognormal distribution with a coefficient of variation of 2, # determine the power associated with various sample sizes for this one-sample test. # Based on these calculations, you need to take at least 6 soil samples to # satisfy the requirements for the Type I and Type II errors. power <- tTestLnormAltPower(n.or.n1 = 2:8, ratio.of.means = 4, cv = 2, alpha = 0.2, alternative = "greater") names(power) <- paste("N=", 2:8, sep = "") round(power, 2) # N=2 N=3 N=4 N=5 N=6 N=7 N=8 #0.65 0.80 0.88 0.93 0.96 0.97 0.98 #---------- # Repeat the last example, but use the approximate power calculation instead of # the exact one. Using the approximate power calculation, you need at least # 7 soil samples instead of 6 (because the approximation underestimates the power). power <- tTestLnormAltPower(n.or.n1 = 2:8, ratio.of.means = 4, cv = 2, alpha = 0.2, alternative = "greater", approx = TRUE) names(power) <- paste("N=", 2:8, sep = "") round(power, 2) # N=2 N=3 N=4 N=5 N=6 N=7 N=8 #0.55 0.75 0.84 0.90 0.93 0.95 0.97 #========== # Clean up #--------- rm(power)
# Look at how the power of the one-sample test increases with increasing # sample size: seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 power <- tTestLnormAltPower(n.or.n1 = seq(5, 30, by = 5), ratio.of.means = 1.5, cv = 1) round(power, 2) #[1] 0.14 0.28 0.42 0.54 0.65 0.73 #---------- # Repeat the last example, but use the approximation to the power instead of the # exact power. Note how the approximation underestimates the true power for # the smaller sample sizes: power <- tTestLnormAltPower(n.or.n1 = seq(5, 30, by = 5), ratio.of.means = 1.5, cv = 1, approx = TRUE) round(power, 2) #[1] 0.09 0.25 0.40 0.53 0.64 0.73 #========== # Look at how the power of the two-sample t-test increases with increasing # ratio of means: power <- tTestLnormAltPower(n.or.n1 = 20, sample.type = "two", ratio.of.means = c(1.1, 1.5, 2), cv = 1) round(power, 2) #[1] 0.06 0.32 0.73 #---------- # Look at how the power of the two-sample t-test increases with increasing # values of Type I error: power <- tTestLnormAltPower(30, sample.type = "two", ratio.of.means = 1.5, cv = 1, alpha = c(0.001, 0.01, 0.05, 0.1)) round(power, 2) #[1] 0.07 0.23 0.46 0.59 #========== # The guidance document Soil Screening Guidance: Technical Background Document # (USEPA, 1996c, Part 4) discusses sampling design and sample size calculations # for studies to determine whether the soil at a potentially contaminated site # needs to be investigated for possible remedial action. Let 'theta' denote the # average concentration of the chemical of concern. The guidance document # establishes the following goals for the decision rule (USEPA, 1996c, p.87): # # Pr[Decide Don't Investigate | theta > 2 * SSL] = 0.05 # # Pr[Decide to Investigate | theta <= (SSL/2)] = 0.2 # # where SSL denotes the pre-established soil screening level. # # These goals translate into a Type I error of 0.2 for the null hypothesis # # H0: [theta / (SSL/2)] <= 1 # # and a power of 95% for the specific alternative hypothesis # # Ha: [theta / (SSL/2)] = 4 # # Assuming a lognormal distribution with a coefficient of variation of 2, # determine the power associated with various sample sizes for this one-sample test. # Based on these calculations, you need to take at least 6 soil samples to # satisfy the requirements for the Type I and Type II errors. power <- tTestLnormAltPower(n.or.n1 = 2:8, ratio.of.means = 4, cv = 2, alpha = 0.2, alternative = "greater") names(power) <- paste("N=", 2:8, sep = "") round(power, 2) # N=2 N=3 N=4 N=5 N=6 N=7 N=8 #0.65 0.80 0.88 0.93 0.96 0.97 0.98 #---------- # Repeat the last example, but use the approximate power calculation instead of # the exact one. Using the approximate power calculation, you need at least # 7 soil samples instead of 6 (because the approximation underestimates the power). power <- tTestLnormAltPower(n.or.n1 = 2:8, ratio.of.means = 4, cv = 2, alpha = 0.2, alternative = "greater", approx = TRUE) names(power) <- paste("N=", 2:8, sep = "") round(power, 2) # N=2 N=3 N=4 N=5 N=6 N=7 N=8 #0.55 0.75 0.84 0.90 0.93 0.95 0.97 #========== # Clean up #--------- rm(power)
Compute the minimal or maximal detectable ratio of means associated with a one- or two-sample t-test, given the sample size, coefficient of variation, significance level, and power, assuming lognormal data.
tTestLnormAltRatioOfMeans(n.or.n1, n2 = n.or.n1, cv = 1, alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, tol = 1e-07, maxiter = 1000)
tTestLnormAltRatioOfMeans(n.or.n1, n2 = n.or.n1, cv = 1, alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, tol = 1e-07, maxiter = 1000)
n.or.n1 |
numeric vector of sample sizes. When |
n2 |
numeric vector of sample sizes for group 2. The default value is the value of
|
cv |
numeric vector of positive value(s) specifying the coefficient of
variation. When |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
power |
numeric vector of numbers between 0 and 1 indicating the power
associated with the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
two.sided.direction |
character string indicating the direction (greater than 1 or less than 1) for the
detectable ratio of means when |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
tol |
numeric scalar indicating the toloerance to use in the
|
maxiter |
positive integer indicating the maximum number of iterations
argument to pass to the |
If the arguments n.or.n1
, n2
, cv
, alpha
, and
power
are not all the same length, they are replicated to be the same length
as the length of the longest argument.
Formulas for the power of the t-test for lognormal data for specified values of
the sample size, ratio of means, and Type I error level are given in
the help file for tTestLnormAltPower
. The function
tTestLnormAltRatioOfMeans
uses the uniroot
search algorithm
to determine the required ratio of means for specified values of the power,
sample size, and Type I error level.
a numeric vector of computed minimal or maximal detectable ratios of means. When alternative="less"
, or alternative="two.sided"
and
two.sided.direction="less"
, the computed ratios are less than 1
(but greater than 0). Otherwise, the ratios are greater than 1.
See tTestLnormAltPower
.
Steven P. Millard ([email protected])
See tTestLnormAltPower
.
tTestLnormAltPower
, tTestLnormAltN
,
plotTTestLnormAltDesign
, LognormalAlt,
t.test
, Hypothesis Tests.
# Look at how the minimal detectable ratio of means for the one-sample t-test # increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = 20, power = seq(0.5, 0.9, by = 0.1)) round(ratio.of.means, 2) #[1] 1.47 1.54 1.63 1.73 1.89 #---------- # Repeat the last example, but compute the minimal detectable ratio of means # based on the approximate power instead of the exact: ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = 20, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) round(ratio.of.means, 2) #[1] 1.48 1.55 1.63 1.73 1.89 #========== # Look at how the minimal detectable ratio of means for the two-sample t-test # decreases with increasing sample size: seq(10, 50, by = 10) #[1] 10 20 30 40 50 ratio.of.means <- tTestLnormAltRatioOfMeans(seq(10, 50, by = 10), sample.type="two") round(ratio.of.means, 2) #[1] 4.14 2.65 2.20 1.97 1.83 #---------- # Look at how the minimal detectable ratio of means for the two-sample t-test # decreases with increasing values of Type I error: ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = 20, alpha = c(0.001, 0.01, 0.05, 0.1), sample.type = "two") round(ratio.of.means, 2) #[1] 4.06 3.20 2.65 2.42 #========== # The guidance document Soil Screening Guidance: Technical Background Document # (USEPA, 1996c, Part 4) discusses sampling design and sample size calculations # for studies to determine whether the soil at a potentially contaminated site # needs to be investigated for possible remedial action. Let 'theta' denote the # average concentration of the chemical of concern. The guidance document # establishes the following goals for the decision rule (USEPA, 1996c, p.87): # # Pr[Decide Don't Investigate | theta > 2 * SSL] = 0.05 # # Pr[Decide to Investigate | theta <= (SSL/2)] = 0.2 # # where SSL denotes the pre-established soil screening level. # # These goals translate into a Type I error of 0.2 for the null hypothesis # # H0: [theta / (SSL/2)] <= 1 # # and a power of 95% for the specific alternative hypothesis # # Ha: [theta / (SSL/2)] = 4 # # Assuming a lognormal distribution, the above values for Type I and power, and a # coefficient of variation of 2, determine the minimal detectable increase above # the soil screening level associated with various sample sizes for the one-sample # test. Based on these calculations, you need to take at least 6 soil samples to # satisfy the requirements for the Type I and Type II errors when the coefficient # of variation is 2. N <- 2:8 ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = N, cv = 2, alpha = 0.2, alternative = "greater") names(ratio.of.means) <- paste("N=", N, sep = "") round(ratio.of.means, 1) # N=2 N=3 N=4 N=5 N=6 N=7 N=8 #19.9 7.7 5.4 4.4 3.8 3.4 3.1 #---------- # Repeat the last example, but use the approximate power calculation instead of # the exact. Using the approximate power calculation, you need 7 soil samples # when the coefficient of variation is 2. Note how poorly the approximation # works in this case for small sample sizes! ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = N, cv = 2, alpha = 0.2, alternative = "greater", approx = TRUE) names(ratio.of.means) <- paste("N=", N, sep = "") round(ratio.of.means, 1) # N=2 N=3 N=4 N=5 N=6 N=7 N=8 #990.8 18.5 8.3 5.7 4.6 3.9 3.5 #========== # Clean up #--------- rm(ratio.of.means, N)
# Look at how the minimal detectable ratio of means for the one-sample t-test # increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = 20, power = seq(0.5, 0.9, by = 0.1)) round(ratio.of.means, 2) #[1] 1.47 1.54 1.63 1.73 1.89 #---------- # Repeat the last example, but compute the minimal detectable ratio of means # based on the approximate power instead of the exact: ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = 20, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) round(ratio.of.means, 2) #[1] 1.48 1.55 1.63 1.73 1.89 #========== # Look at how the minimal detectable ratio of means for the two-sample t-test # decreases with increasing sample size: seq(10, 50, by = 10) #[1] 10 20 30 40 50 ratio.of.means <- tTestLnormAltRatioOfMeans(seq(10, 50, by = 10), sample.type="two") round(ratio.of.means, 2) #[1] 4.14 2.65 2.20 1.97 1.83 #---------- # Look at how the minimal detectable ratio of means for the two-sample t-test # decreases with increasing values of Type I error: ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = 20, alpha = c(0.001, 0.01, 0.05, 0.1), sample.type = "two") round(ratio.of.means, 2) #[1] 4.06 3.20 2.65 2.42 #========== # The guidance document Soil Screening Guidance: Technical Background Document # (USEPA, 1996c, Part 4) discusses sampling design and sample size calculations # for studies to determine whether the soil at a potentially contaminated site # needs to be investigated for possible remedial action. Let 'theta' denote the # average concentration of the chemical of concern. The guidance document # establishes the following goals for the decision rule (USEPA, 1996c, p.87): # # Pr[Decide Don't Investigate | theta > 2 * SSL] = 0.05 # # Pr[Decide to Investigate | theta <= (SSL/2)] = 0.2 # # where SSL denotes the pre-established soil screening level. # # These goals translate into a Type I error of 0.2 for the null hypothesis # # H0: [theta / (SSL/2)] <= 1 # # and a power of 95% for the specific alternative hypothesis # # Ha: [theta / (SSL/2)] = 4 # # Assuming a lognormal distribution, the above values for Type I and power, and a # coefficient of variation of 2, determine the minimal detectable increase above # the soil screening level associated with various sample sizes for the one-sample # test. Based on these calculations, you need to take at least 6 soil samples to # satisfy the requirements for the Type I and Type II errors when the coefficient # of variation is 2. N <- 2:8 ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = N, cv = 2, alpha = 0.2, alternative = "greater") names(ratio.of.means) <- paste("N=", N, sep = "") round(ratio.of.means, 1) # N=2 N=3 N=4 N=5 N=6 N=7 N=8 #19.9 7.7 5.4 4.4 3.8 3.4 3.1 #---------- # Repeat the last example, but use the approximate power calculation instead of # the exact. Using the approximate power calculation, you need 7 soil samples # when the coefficient of variation is 2. Note how poorly the approximation # works in this case for small sample sizes! ratio.of.means <- tTestLnormAltRatioOfMeans(n.or.n1 = N, cv = 2, alpha = 0.2, alternative = "greater", approx = TRUE) names(ratio.of.means) <- paste("N=", N, sep = "") round(ratio.of.means, 1) # N=2 N=3 N=4 N=5 N=6 N=7 N=8 #990.8 18.5 8.3 5.7 4.6 3.9 3.5 #========== # Clean up #--------- rm(ratio.of.means, N)
Compute the sample size necessary to achieve a specified power for a one- or two-sample t-test, given the scaled difference and significance level.
tTestN(delta.over.sigma, alpha = 0.05, power = 0.95, sample.type = ifelse(!is.null(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE, n2 = NULL, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
tTestN(delta.over.sigma, alpha = 0.05, power = 0.95, sample.type = ifelse(!is.null(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE, n2 = NULL, round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
delta.over.sigma |
numeric vector specifying the ratio of the true difference |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
power |
numeric vector of numbers between 0 and 1 indicating the power
associated with the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible values are:
|
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
n2 |
numeric vector of sample sizes for group 2. The default value is
|
round.up |
logical scalar indicating whether to round up the values of the computed
sample size(s) to the next smallest integer. The default value is
|
n.max |
positive integer greater than 1 indicating the maximum sample size when |
tol |
numeric scalar indicating the toloerance to use in the
|
maxiter |
positive integer indicating the maximum number of iterations
argument to pass to the |
Formulas for the power of the t-test for specified values of
the sample size, scaled difference, and Type I error level are given in
the help file for tTestPower
. The function tTestN
uses the uniroot
search algorithm to determine the
required sample size(s) for specified values of the power,
scaled difference, and Type I error level.
When sample.type="one.sample"
, tTestN
returns a numeric vector of sample sizes.
When sample.type="two.sample"
and n2
is not supplied,
equal sample sizes for each group is assumed and tTestN
returns a numeric vector of
sample sizes indicating the required sample size for each group.
When sample.type="two.sample"
and n2
is supplied,
tTestN
returns a list with two components called n1
and
n2
, specifying the sample sizes for each group.
See tTestPower
.
Steven P. Millard ([email protected])
See tTestPower
.
tTestPower
, tTestScaledMdd
, tTestAlpha
,
plotTTestDesign
, Normal,
t.test
, Hypothesis Tests.
# Look at how the required sample size for the one-sample t-test # increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 tTestN(delta.over.sigma = 0.5, power = seq(0.5, 0.9, by = 0.1)) #[1] 18 22 27 34 44 #---------- # Repeat the last example, but compute the sample size based on the # approximation to the power instead of the exact method: tTestN(delta.over.sigma = 0.5, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) #[1] 18 22 27 34 45 #========== # Look at how the required sample size for the two-sample t-test # decreases with increasing scaled difference: seq(0.5, 2,by = 0.5) #[1] 0.5 1.0 1.5 2.0 tTestN(delta.over.sigma = seq(0.5, 2, by = 0.5), sample.type = "two") #[1] 105 27 13 8 #---------- # Look at how the required sample size for the two-sample t-test decreases # with increasing values of Type I error: tTestN(delta.over.sigma = 0.5, alpha = c(0.001, 0.01, 0.05, 0.1), sample.type="two") #[1] 198 145 105 88 #---------- # For the two-sample t-test, compare the total sample size required to # detect a scaled difference of 1 for equal sample sizes versus the case # when the sample size for the second group is constrained to be 20. # Assume a 5% significance level and 95% power. Note that for the case # of equal sample sizes, a total of 54 samples (27+27) are required, # whereas when n2 is constrained to be 20, a total of 62 samples # (42 + 20) are required. tTestN(1, sample.type="two") #[1] 27 tTestN(1, n2 = 20) #$n1 #[1] 42 # #$n2 #[1] 20 #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), determine the # required sample size to detect a mean aldicarb level greater than the MCL # of 7 ppb at the third compliance well with a power of 95%, assuming the # true mean is 10 or 14. Use the estimated standard deviation from the # first four months of data to estimate the true population standard # deviation, use a Type I error level of alpha=0.01, and assume an # upper one-sided alternative (third compliance well mean larger than 7). # (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) # Note that the required sample size changes from 11 to 5 as the true mean # increases from 10 to 14. EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #5 1 Well.2 23.7 #6 2 Well.2 21.9 #7 3 Well.2 26.9 #8 4 Well.2 26.1 #9 1 Well.3 5.6 #10 2 Well.3 3.3 #11 3 Well.3 2.3 #12 4 Well.3 6.9 sigma <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well == "Well.3"])) sigma #[1] 2.101388 tTestN(delta.over.sigma = (c(10, 14) - 7)/sigma, alpha = 0.01, sample.type="one", alternative="greater") #[1] 11 5 # Clean up #--------- rm(sigma)
# Look at how the required sample size for the one-sample t-test # increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 tTestN(delta.over.sigma = 0.5, power = seq(0.5, 0.9, by = 0.1)) #[1] 18 22 27 34 44 #---------- # Repeat the last example, but compute the sample size based on the # approximation to the power instead of the exact method: tTestN(delta.over.sigma = 0.5, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) #[1] 18 22 27 34 45 #========== # Look at how the required sample size for the two-sample t-test # decreases with increasing scaled difference: seq(0.5, 2,by = 0.5) #[1] 0.5 1.0 1.5 2.0 tTestN(delta.over.sigma = seq(0.5, 2, by = 0.5), sample.type = "two") #[1] 105 27 13 8 #---------- # Look at how the required sample size for the two-sample t-test decreases # with increasing values of Type I error: tTestN(delta.over.sigma = 0.5, alpha = c(0.001, 0.01, 0.05, 0.1), sample.type="two") #[1] 198 145 105 88 #---------- # For the two-sample t-test, compare the total sample size required to # detect a scaled difference of 1 for equal sample sizes versus the case # when the sample size for the second group is constrained to be 20. # Assume a 5% significance level and 95% power. Note that for the case # of equal sample sizes, a total of 54 samples (27+27) are required, # whereas when n2 is constrained to be 20, a total of 62 samples # (42 + 20) are required. tTestN(1, sample.type="two") #[1] 27 tTestN(1, n2 = 20) #$n1 #[1] 42 # #$n2 #[1] 20 #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), determine the # required sample size to detect a mean aldicarb level greater than the MCL # of 7 ppb at the third compliance well with a power of 95%, assuming the # true mean is 10 or 14. Use the estimated standard deviation from the # first four months of data to estimate the true population standard # deviation, use a Type I error level of alpha=0.01, and assume an # upper one-sided alternative (third compliance well mean larger than 7). # (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) # Note that the required sample size changes from 11 to 5 as the true mean # increases from 10 to 14. EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #5 1 Well.2 23.7 #6 2 Well.2 21.9 #7 3 Well.2 26.9 #8 4 Well.2 26.1 #9 1 Well.3 5.6 #10 2 Well.3 3.3 #11 3 Well.3 2.3 #12 4 Well.3 6.9 sigma <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well == "Well.3"])) sigma #[1] 2.101388 tTestN(delta.over.sigma = (c(10, 14) - 7)/sigma, alpha = 0.01, sample.type="one", alternative="greater") #[1] 11 5 # Clean up #--------- rm(sigma)
Compute the power of a one- or two-sample t-test, given the sample size, scaled difference, and significance level.
tTestPower(n.or.n1, n2 = n.or.n1, delta.over.sigma = 0, alpha = 0.05, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE)
tTestPower(n.or.n1, n2 = n.or.n1, delta.over.sigma = 0, alpha = 0.05, sample.type = ifelse(!missing(n2), "two.sample", "one.sample"), alternative = "two.sided", approx = FALSE)
n.or.n1 |
numeric vector of sample sizes. When |
n2 |
numeric vector of sample sizes for group 2. The default value is the value of
|
delta.over.sigma |
numeric vector specifying the ratio of the true difference The default value is |
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible values are:
|
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
If the arguments n.or.n1
, n2
, delta.over.sigma
, and
alpha
are not all the same length, they are replicated to be the same
length as the length of the longest argument.
One-Sample Case (sample.type="one.sample"
)
Let denote a vector of
observations from a normal distribution with mean
and standard deviation
, and consider the null hypothesis:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
):
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative (alternative="two.sided"
)
The test of the null hypothesis (1) versus any of the three alternatives (2)-(4) is based on the Student t-statistic:
where
Under the null hypothesis (1), the t-statistic in (5) follows a
Student's t-distribution with degrees of freedom
(Zar, 2010, Chapter 7; Johnson et al., 1995, pp.362-363).
The formula for the power of the test depends on which alternative is being tested.
The two subsections below describe exact and approximate formulas for the power of
the one-sample t-test. Note that none of the equations for the power of the t-test
requires knowledge of the values (Equation (12) below) or
(the population standard deviation), only the ratio
. The
argument
delta.over.sigma
is this ratio, and it is referred to as the
“scaled difference”.
Exact Power Calculations (approx=FALSE
)
This subsection describes the exact formulas for the power of the one-sample
t-test.
Upper one-sided alternative (alternative="greater"
)
The standard Student's t-test rejects the null hypothesis (1) in favor of the
upper alternative hypothesis (2) at level- if
where
and denotes the
'th quantile of Student's t-distribution
with
degrees of freedom (Zar, 2010; Berthouex and Brown, 2002).
The power of this test, denoted by
, where
denotes the
probability of a Type II error, is given by:
where
and denotes a
non-central Student's t-random variable with
degrees of freedom and non-centrality parameter
, and
denotes the cumulative distribution function of this
random variable evaluated at
(Johnson et al., 1995, pp.508-510).
Lower one-sided alternative (alternative="less"
)
The standard Student's t-test rejects the null hypothesis (1) in favor of the
lower alternative hypothesis (3) at level- if
and the power of this test is given by:
Two-sided alternative (alternative="two.sided"
)
The standard Student's t-test rejects the null hypothesis (1) in favor of the
two-sided alternative hypothesis (4) at level- if
and the power of this test is given by:
The power of the t-test given in Equation (16) can also be expressed in terms of the
cumulative distribution function of the non-central F-distribution
as follows. Let denote a
non-central F random variable with
and
degrees of freedom and non-centrality parameter
, and let
denote the cumulative distribution function of this
random variable evaluated at
. Also, let
denote
the
'th quantile of the central F-distribution with
and
degrees of freedom. It can be shown that
where denotes “equal in distribution”. Thus, it follows that
so the formula for the power of the t-test given in Equation (16) can also be written as:
Approximate Power Calculations (approx=TRUE
)
Zar (2010, pp.115–118) presents an approximation to the power for the t-test
given in Equations (10), (14), and (16) above. His approximation to the power
can be derived by using the approximation
where denotes “approximately equal to”. Zar's approximation
can be summarized in terms of the cumulative distribution function of the
non-central t-distribution as follows:
where denotes the cumulative distribution function of the
central Student's t-distribution with
degrees of freedom evaluated at
.
The following three subsections explicitly derive the approximation to the power of
the t-test for each of the three alternative hypotheses.
Upper one-sided alternative (alternative="greater"
)
The power for the upper one-sided alternative (2) given in Equation (10) can be
approximated as:
where denotes a central Student's t-random variable with
degrees of freedom.
Lower one-sided alternative (alternative="less"
)
The power for the lower one-sided alternative (3) given in Equation (14) can be
approximated as:
Two-sided alternative (alternative="two.sided"
)
The power for the two-sided alternative (4) given in Equation (16) can be
approximated as:
Two-Sample Case (sample.type="two.sample"
)
Let denote a vector of
observations from a normal distribution with mean
and standard deviation
, and let
denote a vector of
observations from a normal distribution with mean
and
standard deviation
, and consider the null hypothesis:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
):
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative (alternative="two.sided"
)
The test of the null hypothesis (25) versus any of the three alternatives (26)-(28) is based on the Student t-statistic:
where
Under the null hypothesis (25), the t-statistic in (29) follows a
Student's t-distribution with degrees of
freedom (Zar, 2010, Chapter 8; Johnson et al., 1995, pp.508–510,
Helsel and Hirsch, 1992, pp.124–128).
The formulas for the power of the two-sample t-test are precisely the same as those for the one-sample case, with the following modifications:
Note that none of the equations for the power of the t-test requires knowledge of
the values or
(the population standard deviation for
both populations), only the ratio
. The argument
delta.over.sigma
is this ratio, and it is referred to as the
“scaled difference”.
a numeric vector powers.
The normal distribution and lognormal distribution are probably the two most frequently used distributions to model environmental data. Often, you need to determine whether a population mean is significantly different from a specified standard (e.g., an MCL or ACL, USEPA, 1989b, Section 6), or whether two different means are significantly different from each other (e.g., USEPA 2009, Chapter 16). In this case, assuming normally distributed data, you can perform the Student's t-test.
In the course of designing a sampling program, an environmental scientist may wish
to determine the relationship between sample size, significance level, power, and
scaled difference if one of the objectives of the sampling program is to determine
whether a mean differs from a specified level or two means differ from each other.
The functions tTestPower
, tTestN
,
tTestScaledMdd
, and plotTTestDesign
can be used to
investigate these relationships for the case of normally-distributed observations.
Steven P. Millard ([email protected])
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.
Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapters 28, 31
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
USEPA. (1989b). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities, Interim Final Guidance. EPA/530-SW-89-026. Office of Solid Waste, U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
tTestN
, tTestScaledMdd
, tTestAlpha
,
plotTTestDesign
, Normal,
t.test
, Hypothesis Tests.
# Look at how the power of the one-sample t-test increases with # increasing sample size: seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 power <- tTestPower(n.or.n1 = seq(5, 30, by = 5), delta.over.sigma = 0.5) round(power, 2) #[1] 0.14 0.29 0.44 0.56 0.67 0.75 #---------- # Repeat the last example, but use the approximation. # Note how the approximation underestimates the power # for the smaller sample sizes. #---------------------------------------------------- power <- tTestPower(n.or.n1 = seq(5, 30, by = 5), delta.over.sigma = 0.5, approx = TRUE) round(power, 2) #[1] 0.10 0.26 0.42 0.56 0.67 0.75 #---------- # Look at how the power of the two-sample t-test increases with increasing # scaled difference: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 power <- tTestPower(10, sample.type = "two.sample", delta.over.sigma = seq(0.5, 2, by = 0.5)) round(power, 2) #[1] 0.19 0.56 0.89 0.99 #---------- # Look at how the power of the two-sample t-test increases with increasing values # of Type I error: power <- tTestPower(20, sample.type = "two.sample", delta.over.sigma = 0.5, alpha = c(0.001, 0.01, 0.05, 0.1)) round(power, 2) #[1] 0.03 0.14 0.34 0.46 #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), determine how # adding another four months of observations to increase the sample size from # 4 to 8 for any one particular compliance well will affect the power of a # one-sample t-test that compares the mean for the well with the MCL of # 7 ppb. Use alpha = 0.01, assume an upper one-sided alternative # (i.e., compliance well mean larger than 7 ppb), and assume a scaled # difference of 2. (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) # Note that the power changes from 49% to 98% by increasing the sample size # from 4 to 8. tTestPower(n.or.n1 = c(4, 8), delta.over.sigma = 2, alpha = 0.01, sample.type = "one.sample", alternative = "greater") #[1] 0.4865800 0.9835401 #========== # Clean up #--------- rm(power)
# Look at how the power of the one-sample t-test increases with # increasing sample size: seq(5, 30, by = 5) #[1] 5 10 15 20 25 30 power <- tTestPower(n.or.n1 = seq(5, 30, by = 5), delta.over.sigma = 0.5) round(power, 2) #[1] 0.14 0.29 0.44 0.56 0.67 0.75 #---------- # Repeat the last example, but use the approximation. # Note how the approximation underestimates the power # for the smaller sample sizes. #---------------------------------------------------- power <- tTestPower(n.or.n1 = seq(5, 30, by = 5), delta.over.sigma = 0.5, approx = TRUE) round(power, 2) #[1] 0.10 0.26 0.42 0.56 0.67 0.75 #---------- # Look at how the power of the two-sample t-test increases with increasing # scaled difference: seq(0.5, 2, by = 0.5) #[1] 0.5 1.0 1.5 2.0 power <- tTestPower(10, sample.type = "two.sample", delta.over.sigma = seq(0.5, 2, by = 0.5)) round(power, 2) #[1] 0.19 0.56 0.89 0.99 #---------- # Look at how the power of the two-sample t-test increases with increasing values # of Type I error: power <- tTestPower(20, sample.type = "two.sample", delta.over.sigma = 0.5, alpha = c(0.001, 0.01, 0.05, 0.1)) round(power, 2) #[1] 0.03 0.14 0.34 0.46 #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), determine how # adding another four months of observations to increase the sample size from # 4 to 8 for any one particular compliance well will affect the power of a # one-sample t-test that compares the mean for the well with the MCL of # 7 ppb. Use alpha = 0.01, assume an upper one-sided alternative # (i.e., compliance well mean larger than 7 ppb), and assume a scaled # difference of 2. (The data are stored in EPA.09.Ex.21.1.aldicarb.df.) # Note that the power changes from 49% to 98% by increasing the sample size # from 4 to 8. tTestPower(n.or.n1 = c(4, 8), delta.over.sigma = 2, alpha = 0.01, sample.type = "one.sample", alternative = "greater") #[1] 0.4865800 0.9835401 #========== # Clean up #--------- rm(power)
Compute the scaled minimal detectable difference necessary to achieve a specified power for a one- or two-sample t-test, given the sample size(s) and Type I error level.
tTestScaledMdd(n.or.n1, n2 = n.or.n1, alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2) && !is.null(n2), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, tol = 1e-07, maxiter = 1000)
tTestScaledMdd(n.or.n1, n2 = n.or.n1, alpha = 0.05, power = 0.95, sample.type = ifelse(!missing(n2) && !is.null(n2), "two.sample", "one.sample"), alternative = "two.sided", two.sided.direction = "greater", approx = FALSE, tol = 1e-07, maxiter = 1000)
n.or.n1 |
numeric vector of sample sizes. When |
n2 |
numeric vector of sample sizes for group 2. The default value is the value of
|
alpha |
numeric vector of numbers between 0 and 1 indicating the Type I error level
associated with the hypothesis test. The default value is |
power |
numeric vector of numbers between 0 and 1 indicating the power
associated with the hypothesis test. The default value is |
sample.type |
character string indicating whether to compute power based on a one-sample or
two-sample hypothesis test. When |
alternative |
character string indicating the kind of alternative hypothesis. The possible values are:
|
two.sided.direction |
character string indicating the direction (positive or negative) for the
scaled minimal detectable difference when |
approx |
logical scalar indicating whether to compute the power based on an approximation to
the non-central t-distribution. The default value is |
tol |
numeric scalar indicating the tolerance argument to pass to the
|
maxiter |
positive integer indicating the maximum number of iterations
argument to pass to the |
Formulas for the power of the t-test for specified values of
the sample size, scaled difference, and Type I error level are given in
the help file for tTestPower
. The function tTestScaledMdd
uses the uniroot
search algorithm to determine the
required scaled minimal detectable difference for specified values of the
sample size, power, and Type I error level.
numeric vector of scaled minimal detectable differences.
See tTestPower
.
Steven P. Millard ([email protected])
See tTestPower
.
tTestPower
, tTestAlpha
,
tTestN
,
plotTTestDesign
, Normal,
t.test
, Hypothesis Tests.
# Look at how the scaled minimal detectable difference for the # one-sample t-test increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 scaled.mdd <- tTestScaledMdd(n.or.n1 = 20, power = seq(0.5,0.9,by=0.1)) round(scaled.mdd, 2) #[1] 0.46 0.52 0.59 0.66 0.76 #---------- # Repeat the last example, but compute the scaled minimal detectable # differences based on the approximation to the power instead of the # exact formula: scaled.mdd <- tTestScaledMdd(n.or.n1 = 20, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) round(scaled.mdd, 2) #[1] 0.47 0.53 0.59 0.66 0.76 #========== # Look at how the scaled minimal detectable difference for the two-sample # t-test decreases with increasing sample size: seq(10,50,by=10) #[1] 10 20 30 40 50 scaled.mdd <- tTestScaledMdd(seq(10, 50, by = 10), sample.type = "two") round(scaled.mdd, 2) #[1] 1.71 1.17 0.95 0.82 0.73 #---------- # Look at how the scaled minimal detectable difference for the two-sample # t-test decreases with increasing values of Type I error: scaled.mdd <- tTestScaledMdd(20, alpha = c(0.001, 0.01, 0.05, 0.1), sample.type="two") round(scaled.mdd, 2) #[1] 1.68 1.40 1.17 1.06 #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), # determine the minimal mean level of aldicarb at the third compliance # well necessary to detect a mean level of aldicarb greater than the # MCL of 7 ppb, assuming 90%, 95%, and 99% power. Use a 99% significance # level and assume an upper one-sided alternative (third compliance well # mean larger than 7). Use the estimated standard deviation from the # first four months of data to estimate the true population standard # deviation in order to determine the minimal detectable difference based # on the computed scaled minimal detectable difference, then use this # minimal detectable difference to determine the mean level of aldicarb # necessary to detect a difference. (The data are stored in # EPA.09.Ex.21.1.aldicarb.df.) # # Note that the scaled minimal detectable difference changes from 3.4 to # 3.9 to 4.7 as the power changes from 90% to 95% to 99%. Thus, the # minimal detectable difference changes from 7.2 to 8.1 to 9.8, and the # minimal mean level of aldicarb changes from 14.2 to 15.1 to 16.8. EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #5 1 Well.2 23.7 #6 2 Well.2 21.9 #7 3 Well.2 26.9 #8 4 Well.2 26.1 #9 1 Well.3 5.6 #10 2 Well.3 3.3 #11 3 Well.3 2.3 #12 4 Well.3 6.9 sigma <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well == "Well.3"])) sigma #[1] 2.101388 scaled.mdd <- tTestScaledMdd(n.or.n1 = 4, alpha = 0.01, power = c(0.90, 0.95, 0.99), sample.type="one", alternative="greater") scaled.mdd #[1] 3.431501 3.853682 4.668749 mdd <- scaled.mdd * sigma mdd #[1] 7.210917 8.098083 9.810856 minimal.mean <- mdd + 7 minimal.mean #[1] 14.21092 15.09808 16.81086 #========== # Clean up #--------- rm(scaled.mdd, sigma, mdd, minimal.mean)
# Look at how the scaled minimal detectable difference for the # one-sample t-test increases with increasing required power: seq(0.5, 0.9, by = 0.1) #[1] 0.5 0.6 0.7 0.8 0.9 scaled.mdd <- tTestScaledMdd(n.or.n1 = 20, power = seq(0.5,0.9,by=0.1)) round(scaled.mdd, 2) #[1] 0.46 0.52 0.59 0.66 0.76 #---------- # Repeat the last example, but compute the scaled minimal detectable # differences based on the approximation to the power instead of the # exact formula: scaled.mdd <- tTestScaledMdd(n.or.n1 = 20, power = seq(0.5, 0.9, by = 0.1), approx = TRUE) round(scaled.mdd, 2) #[1] 0.47 0.53 0.59 0.66 0.76 #========== # Look at how the scaled minimal detectable difference for the two-sample # t-test decreases with increasing sample size: seq(10,50,by=10) #[1] 10 20 30 40 50 scaled.mdd <- tTestScaledMdd(seq(10, 50, by = 10), sample.type = "two") round(scaled.mdd, 2) #[1] 1.71 1.17 0.95 0.82 0.73 #---------- # Look at how the scaled minimal detectable difference for the two-sample # t-test decreases with increasing values of Type I error: scaled.mdd <- tTestScaledMdd(20, alpha = c(0.001, 0.01, 0.05, 0.1), sample.type="two") round(scaled.mdd, 2) #[1] 1.68 1.40 1.17 1.06 #========== # Modifying the example on pages 21-4 to 21-5 of USEPA (2009), # determine the minimal mean level of aldicarb at the third compliance # well necessary to detect a mean level of aldicarb greater than the # MCL of 7 ppb, assuming 90%, 95%, and 99% power. Use a 99% significance # level and assume an upper one-sided alternative (third compliance well # mean larger than 7). Use the estimated standard deviation from the # first four months of data to estimate the true population standard # deviation in order to determine the minimal detectable difference based # on the computed scaled minimal detectable difference, then use this # minimal detectable difference to determine the mean level of aldicarb # necessary to detect a difference. (The data are stored in # EPA.09.Ex.21.1.aldicarb.df.) # # Note that the scaled minimal detectable difference changes from 3.4 to # 3.9 to 4.7 as the power changes from 90% to 95% to 99%. Thus, the # minimal detectable difference changes from 7.2 to 8.1 to 9.8, and the # minimal mean level of aldicarb changes from 14.2 to 15.1 to 16.8. EPA.09.Ex.21.1.aldicarb.df # Month Well Aldicarb.ppb #1 1 Well.1 19.9 #2 2 Well.1 29.6 #3 3 Well.1 18.7 #4 4 Well.1 24.2 #5 1 Well.2 23.7 #6 2 Well.2 21.9 #7 3 Well.2 26.9 #8 4 Well.2 26.1 #9 1 Well.3 5.6 #10 2 Well.3 3.3 #11 3 Well.3 2.3 #12 4 Well.3 6.9 sigma <- with(EPA.09.Ex.21.1.aldicarb.df, sd(Aldicarb.ppb[Well == "Well.3"])) sigma #[1] 2.101388 scaled.mdd <- tTestScaledMdd(n.or.n1 = 4, alpha = 0.01, power = c(0.90, 0.95, 0.99), sample.type="one", alternative="greater") scaled.mdd #[1] 3.431501 3.853682 4.668749 mdd <- scaled.mdd * sigma mdd #[1] 7.210917 8.098083 9.810856 minimal.mean <- mdd + 7 minimal.mean #[1] 14.21092 15.09808 16.81086 #========== # Clean up #--------- rm(scaled.mdd, sigma, mdd, minimal.mean)
Two-sample linear rank test to detect a difference (usually a shift) between two
distributions. The Wilcoxon Rank Sum test is a special case of
a linear rank test. The function twoSampleLinearRankTest
is part of
EnvStats mainly because this help file gives the necessary background to
explain two-sample linear rank tests for censored data (see twoSampleLinearRankTestCensored
).
twoSampleLinearRankTest(x, y, location.shift.null = 0, scale.shift.null = 1, alternative = "two.sided", test = "wilcoxon", shift.type = "location")
twoSampleLinearRankTest(x, y, location.shift.null = 0, scale.shift.null = 1, alternative = "two.sided", test = "wilcoxon", shift.type = "location")
x |
numeric vector of values for the first sample.
Missing ( |
y |
numeric vector of values for the second sample.
Missing ( |
location.shift.null |
numeric scalar indicating the hypothesized value of |
scale.shift.null |
numeric scalar indicating the hypothesized value of |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
test |
character string indicating which linear rank test to use. The possible values are:
|
shift.type |
character string indicating which kind of shift is being tested. The possible values
are |
The function twoSampleLinearRankTest
allows you to compare two samples using
a locally most powerful rank test (LMPRT) to determine whether the two samples come
from the same distribution. The sections below explain the concepts of location and
scale shifts, linear rank tests, and LMPRT's.
Definitions of Location and Scale Shifts
Let denote a random variable representing measurements from group 1 with
cumulative distribution function (cdf):
and let denote
independent observations from this
distribution. Let
denote a random variable from group 2 with cdf:
and let denote
independent observations from this
distribution. Set
.
General Hypotheses to Test Differences Between Two Populations
A very general hypothesis to test whether two distributions are the same is
given by:
versus the two-sided alternative hypothesis:
with strict inequality for at least one value of .
The two possible one-sided hypotheses would be:
versus the alternative hypothesis:
and
versus the alternative hypothesis:
A similar set of hypotheses to test whether the two distributions are the same are given by (Conover, 1980, p. 216):
versus the two-sided alternative hypothesis:
or
versus the alternative hypothesis:
or
versus the alternative hypothesis:
Note that this second set of hypotheses (9)–(14) is not equivalent to the
set of hypotheses (3)–(8). For example, if takes on the values 1 and 4
with probability 1/2 for each, and
only takes on values in the interval
(1, 4) with strict inequality at the enpoints (e.g.,
takes on the values
2 and 3 with probability 1/2 for each), then the null hypothesis (9) is
true but the null hypothesis (3) is not true. However, the null hypothesis (3)
implies the null hypothesis (9), (5) implies (11), and (7) implies (13).
Location Shift
A special case of the alternative hypotheses (4), (6), and (8) above is the
location shift alternative:
where denotes the shift between the two groups. (Note: some references
refer to (15) above as a shift in the median, but in fact this kind of shift
represents a shift in every single quantile, not just the median.)
If
is positive, this means that observations in group 1 tend to be
larger than observations in group 2, and if
is negative, observations
in group 1 tend to be smaller than observations in group 2.
The alternative hypothesis (15) is called a location shift: the only difference between the two distributions is a difference in location (e.g., the standard deviation is assumed to be the same for both distributions). A location shift is not applicable to distributions that are bounded below or above by some constant, such as a lognormal distribution. For lognormal distributions, the location shift could refer to a shift in location of the distribution of the log-transformed observations.
For a location shift, the null hypotheses (3) can be generalized as:
where denotes the null shift between the two groups. Almost always,
however, the null shift is taken to be 0 and we will assume this for the rest of this
help file.
Alternatively, the null and alternative hypotheses can be written as
versus the alternative hypothesis
The other one-sided alternative hypothesis () and two-sided
alternative hypothesis (
) could be considered as well.
The general hypotheses (3)-(14) are not location shift hypotheses
(e.g., the standard deviation does not have to be the same for both distributions),
but they do allow for distributions that are bounded below or above by a constant
(e.g., lognormal distributions).
Scale Shift
A special kind of scale shift replaces the alternative hypothesis (15) with the
alternative hypothesis:
where denotes the shift in scale between the two groups. Alternatively,
the null and alternative hypotheses for this scale shift can be written as
versus the alternative hypothesis
The other one-sided alternative hypothesis () and two-sided alternative
hypothesis (
) could be considered as well.
This kind of scale shift often involves a shift in both location and scale. For
example, suppose the underlying distribution for both groups is
exponential, with parameter rate=
. Then
the mean and standard deviation of the reference group is
, while
the mean and standard deviation of the treatment group is
. In
this case, the alternative hypothesis (21) implies the more general alternative
hypothesis (8).
Linear Rank Tests
The usual nonparametric test to test the null hypothesis of the same distribution
for both groups versus the location-shift alternative (18) is the
Wilcoxon Rank Sum test
(Gilbert, 1987, pp.247-250; Helsel and Hirsch, 1992, pp.118-123;
Hollander and Wolfe, 1999). Note that the Mann-Whitney U test is equivalent to the
Wilcoxon Rank Sum test (Hollander and Wolfe, 1999; Conover, 1980, p.215,
Zar, 2010). Hereafter, this test will be abbreviated as the MWW test. The MWW test
is performed by combining the
observations with the
observations and ranking them from smallest to largest, and then computing the
statistic
where denote the ranks of the
observations when
the
and
observations are combined ranked. The null
hypothesis (5), (11), or (17) is rejected in favor of the alternative hypothesis
(6), (12) or (18) if the value of
is too large. For small sample sizes,
the exact distribution of
under the null hypothesis is fairly easy to
compute and may be found in tables (e.g., Hollander and Wolfe, 1999;
Conover, 1980, pp.448-452). For larger sample sizes, a normal approximation is
usually used (Hollander and Wolfe, 1999; Conover, 1980, p.217). For the
R function
wilcox.test
, an exact p-value is computed if the
samples contain less than 50 finite values and there are no ties.
It is important to note that the MWW test is actually testing the more general hypotheses (9)-(14) (Conover, 1980, p.216; Divine et al., 2013), even though it is often presented as only applying to location shifts.
The MWW W-statistic in Equation (22) is an example of a linear rank statistic (Hettmansperger, 1984, p.147; Prentice, 1985), which is any statistic that can be written in the form:
where denotes a score function. Statistics of this form are also called
general scores statistics (Hettmansperger, 1984, p.147). The MWW test
uses the identity score function:
Any test based on a linear rank statistic is called a linear rank test.
Under the null hypothesis (3), (9), (17), or (20), the distribution of the linear
rank statistic does not depend on the form of the underlying distribution of
the
and
observations. Hence, tests based on
are
nonparametric (also called distribution-free). If the null hypothesis is not true,
however, the distribution of
will depend not only on the distributions of the
and
observations, but also upon the form the score function
.
Locally Most Powerful Linear Rank Tests
The decision of what scores to use may be based on considering the power of the test.
A locally most powerful rank test (LMPRT) of the null hypothesis (17) versus the
alternative (18) maximizes the slope of the power (as a function of ) in
the neighborhood where
. A LMPRT of the null hypothesis (20) versus
the alternative (21) maximizes the slope of the power (as a function of
)
in the neighborhood where
. That is, LMPRT's are the best linear rank
test you can use for detecting small shifts in location or scale.
Table 1 below shows the score functions associated with the LMPRT's for various assumed underlying distributions (Hettmansperger, 1984, Chapter 3; Millard and Deverel, 1988, p.2090). A test based on the identity score function of Equation (24) is equivalent to a test based on the score shown in Table 1 associated with the logistic distribution, thus the MWW test is the LMPRT for detecting a location shift when the underlying observations follow the logistic distribution. When the underlying distribution is normal or lognormal, the LMPRT for a location shift uses the “Normal scores” shown in Table 1. When the underlying distribution is exponential, the LMPRT for detecting a scale shift is based on the “Savage scores” shown in Table 1.
Table 1. Scores of LMPRT's for Various Distributions
Distribution | Score |
Shift Type | Test Name |
Logistic | |
Location | Wilcoxon Rank Sum |
Normal or | * |
Location | Van der Waerden or |
Lognormal (log-scale) | Normal scores | ||
Double Exponential | |
Location | Mood's Median |
Exponential or | |
Scale | Savage scores |
Extreme Value |
* Denotes an approximation to the true score. The symbol denotes the
cumulative distribution function of the standard normal distribution, and
denotes the
sign
function.
A large sample normal approximation to the distribution of the linear rank statistic
for arbitrary score functions is given by Hettmansperger (1984, p.148).
Under the null hypothesis (17) or (20), the mean and variance of
are given by:
Hettmansperger (1984, Chapter 3) shows that under the null hypothesis of no difference between the two groups, the statistic
is approximately distributed as a standard normal random variable for “large” sample sizes. This statistic will tend to be large if the observations in group 1 tend to be larger than the observations in group 2.
a list of class "htest"
containing the results of the hypothesis test.
See the help file for htest.object
for details.
The Wilcoxon Rank Sum test, also known as the Mann-Whitney U test, is the standard nonparametric test used to test for differences between two groups (e.g., Zar, 2010; USEPA, 2009, pp.16-14 to 16-20). Other possible nonparametric tests include linear rank tests based on scores other than the ranks, including the “normal scores” test and the “Savage scores” tests. The normal scores test is actually slightly more powerful than the Wilcoxon Rank Sum test for detecting small shifts in location if the underlying distribution is normal or lognormal. In general, however, there will be little difference between these two tests.
The results of calling the function twoSampleLinearRankTest
with the
argument test="wilcoxon"
will match those of calling the built-in
R function wilcox.test
with the arguments exact=FALSE
and
correct=FALSE
. In general, it is better to use the built-in function
wilcox.test
for performing the Wilcoxon Rank Sum test, since this
function can compute exact (rather than approximate) p-values.
Steven P. Millard ([email protected])
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York, Chapter 4.
Divine, G., H.J. Norton, R. Hunt, and J. Dinemann. (2013). A Review of Analysis and Sample Size Calculation Considerations for Wilcoxon Tests. Anesthesia & Analgesia 117, 699–710.
Hettmansperger, T.P. (1984). Statistical Inference Based on Ranks. John Wiley and Sons, New York, 323pp.
Hollander, M., and D.A. Wolfe. (1999). Nonparametric Statistical Methods, Second Edition. John Wiley and Sons, New York.
Millard, S.P., and S.J. Deverel. (1988). Nonparametric Statistical Methods for Comparing Two Sites Based on Data With Multiple Nondetect Limits. Water Resources Research, 24(12), 2087–2098.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.432–435.
Prentice, R.L. (1985). Linear Rank Tests. In Kotz, S., and N.L. Johnson, eds. Encyclopedia of Statistical Science. John Wiley and Sons, New York. Volume 5, pp.51–58.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
wilcox.test
, twoSampleLinearRankTestCensored
,
htest.object
.
# Generate 15 observations from a normal distribution with parameters # mean=3 and sd=1. Call these the observations from the reference group. # Generate 10 observations from a normal distribution with parameters # mean=3.5 and sd=1. Call these the observations from the treatment group. # Compare the results of calling wilcox.test to those of calling # twoSampleLinearRankTest with test="normal.scores". # (The call to set.seed allows you to reproduce this example.) set.seed(346) x <- rnorm(15, mean = 3) y <- rnorm(10, mean = 3.5) wilcox.test(x, y) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: location shift = 0 # #Alternative Hypothesis: True location shift is not equal to 0 # #Test Name: Wilcoxon rank sum test # #Data: x and y # #Test Statistic: W = 32 # #P-value: 0.0162759 twoSampleLinearRankTest(x, y, test = "normal.scores") #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) != Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Normal Scores Test # Based on Normal Approximation # #Data: x = x # y = y # #Sample Sizes: nx = 15 # ny = 10 # #Test Statistic: z = -2.431099 # #P-value: 0.01505308 #---------- # Clean up #--------- rm(x, y) #========== # Following Example 6.6 on pages 6.22-6.26 of USEPA (1994b), perform the # Wilcoxon Rank Sum test for the TcCB data (stored in EPA.94b.tccb.df). # There are m=47 observations from the reference area and n=77 observations # from the cleanup unit. Then compare the results using the other available # linear rank tests. Note that Mood's median test yields a p-value less # than 0.10, while the other tests yield non-significant p-values. # In this case, Mood's median test is picking up the residual contamination # in the cleanup unit. (See the example in the help file for quantileTest.) names(EPA.94b.tccb.df) #[1] "TcCB.orig" "TcCB" "Censored" "Area" summary(EPA.94b.tccb.df$Area) # Cleanup Reference # 77 47 with(EPA.94b.tccb.df, twoSampleLinearRankTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"])) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) != Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Wilcoxon Rank Sum Test # Based on Normal Approximation # #Data: x = TcCB[Area == "Cleanup"] # y = TcCB[Area == "Reference"] # #Sample Sizes: nx = 77 # ny = 47 # #Test Statistic: z = -1.171872 # #P-value: 0.2412485 with(EPA.94b.tccb.df, twoSampleLinearRankTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], test="normal.scores"))$p.value #[1] 0.3399484 with(EPA.94b.tccb.df, twoSampleLinearRankTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], test="moods.median"))$p.value #[1] 0.09707393 with(EPA.94b.tccb.df, twoSampleLinearRankTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], test="savage.scores"))$p.value #[1] 0.2884351
# Generate 15 observations from a normal distribution with parameters # mean=3 and sd=1. Call these the observations from the reference group. # Generate 10 observations from a normal distribution with parameters # mean=3.5 and sd=1. Call these the observations from the treatment group. # Compare the results of calling wilcox.test to those of calling # twoSampleLinearRankTest with test="normal.scores". # (The call to set.seed allows you to reproduce this example.) set.seed(346) x <- rnorm(15, mean = 3) y <- rnorm(10, mean = 3.5) wilcox.test(x, y) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: location shift = 0 # #Alternative Hypothesis: True location shift is not equal to 0 # #Test Name: Wilcoxon rank sum test # #Data: x and y # #Test Statistic: W = 32 # #P-value: 0.0162759 twoSampleLinearRankTest(x, y, test = "normal.scores") #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) != Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Normal Scores Test # Based on Normal Approximation # #Data: x = x # y = y # #Sample Sizes: nx = 15 # ny = 10 # #Test Statistic: z = -2.431099 # #P-value: 0.01505308 #---------- # Clean up #--------- rm(x, y) #========== # Following Example 6.6 on pages 6.22-6.26 of USEPA (1994b), perform the # Wilcoxon Rank Sum test for the TcCB data (stored in EPA.94b.tccb.df). # There are m=47 observations from the reference area and n=77 observations # from the cleanup unit. Then compare the results using the other available # linear rank tests. Note that Mood's median test yields a p-value less # than 0.10, while the other tests yield non-significant p-values. # In this case, Mood's median test is picking up the residual contamination # in the cleanup unit. (See the example in the help file for quantileTest.) names(EPA.94b.tccb.df) #[1] "TcCB.orig" "TcCB" "Censored" "Area" summary(EPA.94b.tccb.df$Area) # Cleanup Reference # 77 47 with(EPA.94b.tccb.df, twoSampleLinearRankTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"])) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) != Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Wilcoxon Rank Sum Test # Based on Normal Approximation # #Data: x = TcCB[Area == "Cleanup"] # y = TcCB[Area == "Reference"] # #Sample Sizes: nx = 77 # ny = 47 # #Test Statistic: z = -1.171872 # #P-value: 0.2412485 with(EPA.94b.tccb.df, twoSampleLinearRankTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], test="normal.scores"))$p.value #[1] 0.3399484 with(EPA.94b.tccb.df, twoSampleLinearRankTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], test="moods.median"))$p.value #[1] 0.09707393 with(EPA.94b.tccb.df, twoSampleLinearRankTest(TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], test="savage.scores"))$p.value #[1] 0.2884351
Two-sample linear rank test to detect a difference (usually a shift) between two distributions based on censored data.
twoSampleLinearRankTestCensored(x, x.censored, y, y.censored, censoring.side = "left", location.shift.null = 0, scale.shift.null = 1, alternative = "two.sided", test = "logrank", variance = "hypergeometric", surv.est = "prentice", shift.type = "location")
twoSampleLinearRankTestCensored(x, x.censored, y, y.censored, censoring.side = "left", location.shift.null = 0, scale.shift.null = 1, alternative = "two.sided", test = "logrank", variance = "hypergeometric", surv.est = "prentice", shift.type = "location")
x |
numeric vector of values for the first sample.
Missing ( |
x.censored |
numeric or logical vector indicating which values of |
y |
numeric vector of values for the second sample.
Missing ( |
y.censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs for the data in
|
location.shift.null |
numeric scalar indicating the hypothesized value of |
scale.shift.null |
numeric scalar indicating the hypothesized value of |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
test |
character string indicating which linear rank test to use. The possible values are:
|
variance |
character string indicating which kind of variance to compute for the test. The
possible values are: |
surv.est |
character string indicating what method to use to estimate the survival function.
The possible values are |
shift.type |
character string indicating which kind of shift is being tested. The possible values
are |
The function twoSampleLinearRankTestCensored
allows you to compare two
samples containing censored observations using a linear rank test to determine
whether the two samples came from the same distribution. The help file for
twoSampleLinearRankTest
explains linear rank tests for complete data
(i.e., no censored observations are present), and here we assume you are
familiar with that material The sections below explain how linear
rank tests can be extended to the case of censored data.
Notation
Several authors have proposed extensions of the MWW test to the case of censored
data, mainly in the context of survival analysis (e.g., Breslow, 1970; Cox, 1972;
Gehan, 1965; Mantel, 1966; Peto and Peto, 1972; Prentice, 1978). Prentice (1978)
showed how all of these proposed tests are extensions of a linear rank test to the
case of censored observations.
Survival analysis usually deals with right-censored data, whereas environmental data is rarely right-censored but often left-censored (some observations are reported as less than some detection limit). Fortunately, all of the methods developed for right-censored data can be applied to left-censored data as well. (See the sub-section Left-Censored Data below.)
In order to explain Prentice's (1978) generalization of linear rank tests to censored
data, we will use the following notation that closely follows Prentice (1978),
Prentice and Marek (1979), and Latta (1981).
Let denote a random variable representing measurements from group 1 with
cumulative distribution function (cdf):
and let denote
independent observations from this
distribution. Let
denote a random variable from group 2 with cdf:
and let denote
independent observations from this
distribution. Set
, the total number of observations.
Assume the data are right-censored so that some observations are only recorded as
greater than some censoring level, with possibly several different censoring levels.
Let
denote the
ordered, unique, uncensored
observations for the combined samples (in the context of survival data,
usually stands
for “time of death”). For
, let
denote the number of observations from sample 1 (the
observations) that are
equal to
, and let
denote the observations from sample 2 (the
observations) equal to this value. Set
the total number of observations equal to . If there are no tied
uncensored observations, then
for
,
otherwise it is greater than 1 for at least on value of
.
For , let
denote the number of censored
observations from sample 1 (the
observations) with censoring levels that
fall into the interval
where
by
definition, and let
denote the number of censored observations from
sample 2 (the
observations) with censoring levels that fall into this
interval. Set
the total number of censoring levels that fall into this interval.
Finally, set equal to the number of observations from sample 1
(uncensored and censored) known to be greater than or equal to
, i.e.,
that lie in the interval
,
set
equal to the number of observations from sample 2
(uncensored and censored) that lie in this interval, and set
In survival analysis jargon, denotes the number of people from
sample 1 who are “at risk” at time
, that is, these people are
known to still be alive at this time. Similarly,
denotes the number
of people from sample 2 who are at risk at time
, and
denotes
the total number of people at risk at time
.
Score Statistics for Multiply Censored Data
Prentice's (1978) generalization of the two-sample score (linear rank) statistic is
given by:
where and
denote the scores associated with the uncensored and
censored observations, respectively. As for complete data, the form of the scores
depends upon the assumed underlying distribution. Table 1 below shows scores for
various assumed distributions as presented in Prentice (1978) and Latta (1981)
(also see Table 5 of Millard and Deverel, 1988, p.2091).
The last column shows what these tests reduce to in the case of complete data
(no censored observations).
Table 1. Scores Associated with Various Censored Data Rank Tests
Uncensored | Censored | Uncensored | ||
Distribution | Score ( ) |
Score ( ) |
Test Name | Analogue |
Logistic | |
|
Peto-Peto | Wilcoxon Rank Sum |
" | |
|
Gehan or Breslow | " |
" | |
|
Tarone-Ware | " |
Normal, | |
|
Normal Scores 1 | Normal Scores |
Lognormal | ||||
" | " | |
Normal Scores 2 | " |
Double | |
|
Generalized | Mood's Median |
Exponential | |
Sign | ||
Exponential, | |
|
Logrank | Savage Scores |
Extreme Value |
In Table 1 above, denotes the cumulative distribution function of the
standard normal distribution,
denotes the probability density function
of the standard normal distribution, and
denotes the
sign
function. Also, the quantities and
denote the estimates of the cumulative distribution function (cdf) and survival
function, respectively, at time
for the combined sample. The estimated
cdf is related to the estimated survival function by:
The quantity denotes the Altshuler (1970) estimate of the
survival function at time
for the combined sample (see below).
The argument surv.est
determines what method to use estimate the survival
function. When surv.est="prentice"
(the default), the survival function is
estimated as:
(Prentice, 1978). When surv.est="kaplan-meier"
, the survival function is
estimated as:
(Kaplan and Meier, 1958), and when surv.est="peto-peto"
, the survival
function is estimated as:
where (Peto and Peto, 1972). All three of these estimators
of the survival function should produce very similar results. When
surv.est="altshuler"
, the survival function is estimated as:
(Altshuler, 1970). The scores for the logrank test use this estimator of survival.
Lee and Wang (2003, p. 116) present a slightly different version of the Peto-Peto
test. They use the Peto-Peto estimate of the survival function for , but
use the Kaplan-Meier estimate of the survival function for
.
The scores for the “Normal Scores 1” test shown in Table 1 above are based on
the approximation (30) of Prentice (1978). The scores for the
“Normal Scores 2” test are based on equation (7) of Prentice and Marek (1979).
For the “Normal Scores 2” test, the following rules are used to construct the
scores for the censored observations: , and
if
.
The Distribution of the Score Statistic
Under the null hypothesis that the two distributions are the same, the expected
value of the score statistic in Equation (6) is 0. The variance of
can be computed in at least three different ways. If the censoring
mechanism is the same for both groups, the permutation variance is
appropriate (
variance="permutation"
) and its estimate is given by:
Often, however, it is not clear whether this assumption is valid, and both Prentice (1978) and Prentice and Marek (1979) caution against using the permuation variance (Prentice and Marek, 1979, state it can lead to inflated estimates of variance).
If the censoring mechanisms for the two groups are not necessarily the same, a more
general estimator of the variance is based on a conditional permutation approach. In
this case, the statistic in Equation (6) is re-written as:
where
and
are given above in Table 1,
and the conditional permutation or hypergeometric estimate
(
variance="hypergeometric"
) is given by:
(Prentice and Marek, 1979; Latta, 1981; Millard and Deverel, 1988). Note that Equation (13) can be thought of as the sum of weighed values of observed minus expected observations.
Prentice (1978) derived an asymptotic estimator of the variance of the score
statistic given in Equation (6) above based on the log likelihood of the
rank vector (
variance="asymptotic"
). This estimator is the same as the
hypergeometric variance estimator for the logrank and Gehan tests (assuming no
tied uncensored observations), but for the Peto-Peto test, this estimator is
given by:
where
(Prentice, 1978; Latta, 1981; Millard and Deverel, 1988). Note that equation (14)
of Millard and Deverel (1988) contains a typographical error.
The Treatment of Ties
If the hypergeometric estimator of variance is being used, no modifications need to
be made for ties; Equations (13)-(15) already account for ties. For the case of the
permutation or asymptotic variance estimators, Equations (6), (12), and (16) all
assume no ties in the uncensored observations. If ties exist in the uncensored
observations, Prentice (1978) suggests computing the scores shown in Table 1
above as if there were no ties, and then assigning average scores to the
tied observations. (This modification also applies to the quantities
and
in Equation (16) above.) For this algorithm, the
statistic in Equation (6) is not in general the same as the one in Equation (13).
Computing a Test Statistic
Under the null hypothesis that the two distributions are the same, the statistic
is approximately distributed as a standard normal random variable for “large”
sample sizes. This statistic will tend to be large if the observations in
group 1 (the observations) tend to be larger than the observations in
group 2 (the
observations).
Left-Censored Data
Most of the time, if censored observations occur in environmental data, they are
left-censored (e.g., observations are reported as less than one or more detection
limits). For the two-sample test of differences between groups, the methods that
apply to right-censored data are easily adapted to left-censored data: simply
multiply the observations by -1, compute the z-statistic shown in Equation
(20), then reverse the sign of this statistic before computing the p-value.
a list of class "htestCensored"
containing the results of the hypothesis test.
See the help file for htestCensored.object
for details.
All of the tests computed by twoSampleLinearRankTestCensored
(logrank, Tarone-Ware, Gehan, Peto-Peto, normal scores, and generalized sign)
are based on a
statistic that is essentially the sum over all uncensored time points of the
weighted difference between the observed and expected number of observations at each
time point (see Equation (15) above). The tests differ in how they weight the
differences between the observed and expected number of observations.
Prentice and Marek (1979) point out that the Gehan test uses weights that depend on the censoring rates within each group and can lead to non-significant outcomes in the case of heavy censoring when in fact a very large difference between the two groups exists.
Latta (1981) performed a Monte Carlo simulation to study the power of the Gehan,
logrank, and Peto-Peto tests using all three different estimators of variance
(permutation, hypergeometric, and asymptotic). He used lognormal, Weibull, and
exponential distributions to generate the observations, and studied two different
cases of censoring: uniform censoring for both samples vs. no censoring in the first
sample and uniform censoring in the second sample. Latta (1981) used sample sizes
of 10 and 50 (both the equal and unequal cases were studied). Latta (1981) found
that all three tests maintained the nominal Type I error level (-level)
in the case of equal sample sizes and equal censoring. Also, the Peto-Peto test
based on the asymptotic variance appeared to maintain the nominal
-level
in all situations, but the other tests were slightly biased in the case of unequal
sample sizes and/or unequal censoring. In particular, tests based on the
hypergeometric variance are slightly biased for unequal sample sizes. Latta (1981)
concludes that if there is no censoring or light censoring, any of the tests may be
used (but the hypergeometric variance should not be used if the sample sizes are
very different). In the case of heavy censoring where sample sizes are far apart
and/or the censoring is very different between samples, the Peto-Peto test based on
the asymptotic variance should be used.
Millard and Deverel (1988) also performed a Monte Carlo simulation similar to
Latta's (1981) study. They only used the lognormal distribution to generate
observations, but also looked at the normal scores test and two ad-hoc modifications
of the MWW test. They found the “Normal Scores 2” test shown in Table 1
above to be the best behaved test in terms of maintaining the nominal
-level, but the other tests behaved almost as well. As Latta (1981)
found, when sample sizes and censoring are very different between the two groups,
the nominal
-level of most of the tests is slightly biased. In the
cases where the nominal
-level was maintained, the Peto-Peto test based
on the asymptotic variance appeared to be as powerful or more powerful than the
normal scores tests.
Neither of the Monte Carlo studies performed by Latta (1981) and Millard and Deverel (1988) looked at the behavior of the two-sample linear rank tests in the presence of several tied uncensored observations (because both studies generated observations from continuous distributions). Note that the results shown in Table 9 of Millard and Deverel (1988, p.2097) are not all correct because they did not allow for tied uncensored values. The last example in the EXAMPLES section below shows the correct values that should appear in that table.
Heller and Venkatraman (1996) performed a Monte Carlo simulation study to compare the behaviors of the Peto-Peto test (using the Prentice, 1978, estimator of survival; they call this the Prentice-Wilcoxon test) and logrank test under varying censoring conditions with sample sizes of 20 and 50 per group based on using the following methods to compute p-values: the asymptotic standard normal approximation, a permutation test approach (this is NOT the same as the permutation variance), and a bootstrap approach. Observed times were generated from Weibull and lognormal survival time distributions with independent uniform censoring. They found that for the Peto-Peto test, "the asymptotic test procedure was the most accurate; resampling procedures did not improve upon its accuracy." For the logrank test, with sample sizes of 20 per group, the usual test based on the asymptotic standard normal approximation tended to have a very slightly higher Type I error rate than assumed (however, for an assumed Type I error rate of 0.05, the largest Type I error rate observed was less than 0.065), whereas the permuation and bootstrap tests performed better; with sample sizes of 50 per group there was no difference in test performance.
Fleming and Harrington (1981) introduced a family of tests (sometimes called G-rho
tests) that contain the logrank and Peto-Peto tests as special cases. A single
parameter (rho) controls the weights given to the uncensored and
censored observations. Positive values of
produce tests more sensitive
to early differences in the survival function, that is, differences in the cdf at
small values. Negative values of
produce tests more sensitive to late
differences in the survival function, that is, differences in the cdf at large
values.
The function survdiff
in the R package
survival implements the G-rho family of tests suggested by Flemming and
Harrington (1981). Calling survdiff
with rho=0
(the default) yields
the logrank test. Calling survdiff
with rho=1
yields the Peto-Peto
test based on the Kaplan-Meier estimate of survival. The function survdiff
always uses the hypergeometric estimate of variance and the Kaplan-Meier estimate of
survival, but it uses the “left-continuous” version of the Kaplan-Meier
estimate. The left-continuous K-M estimate of survival is defined as
follows: at each death (unique uncensored observation), the estimated survival is
equal to the estimated survival based on the ordinary K-M estimate at the prior
death time (or 1 for the first death).
Steven P. Millard ([email protected])
Altshuler, B. (1970). Theory for the Measurement of Competing Risks in Animal Experiments. Mathematical Biosciences 6, 1–11.
Breslow, N.E. (1970). A Generalized Kruskal-Wallis Test for Comparing K Samples Subject to Unequal Patterns of Censorship. Biometrika 57, 579–594.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York, Chapter 4.
Cox, D.R. (1972). Regression Models and Life Tables (with Discussion). Journal of the Royal Statistical Society of London, Series B 34, 187–220.
Divine, G., H.J. Norton, R. Hunt, and J. Dinemann. (2013). A Review of Analysis and Sample Size Calculation Considerations for Wilcoxon Tests. Anesthesia & Analgesia 117, 699–710.
Fleming, T.R., and D.P. Harrington. (1981). A Class of Hypothesis Tests for One and Two Sample Censored Survival Data. Communications in Statistics – Theory and Methods A10(8), 763–794.
Fleming, T.R., and D.P. Harrington. (1991). Counting Processes & Survival Analysis. John Wiley and Sons, New York, Chapter 7.
Gehan, E.A. (1965). A Generalized Wilcoxon Test for Comparing Arbitrarily Singly-Censored Samples. Biometrika 52, 203–223.
Harrington, D.P., and T.R. Fleming. (1982). A Class of Rank Test Procedures for Censored Survival Data. Biometrika 69(3), 553–566.
Heller, G., and E. S. Venkatraman. (1996). Resampling Procedures to Compare Two Survival Distributions in the Presence of Right-Censored Data. Biometrics 52, 1204–1213.
Hettmansperger, T.P. (1984). Statistical Inference Based on Ranks. John Wiley and Sons, New York, 323pp.
Hollander, M., and D.A. Wolfe. (1999). Nonparametric Statistical Methods, Second Edition. John Wiley and Sons, New York.
Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457–481.
Latta, R.B. (1981). A Monte Carlo Study of Some Two-Sample Rank Tests with Censored Data. Journal of the American Statistical Association 76(375), 713–719.
Mantel, N. (1966). Evaluation of Survival Data and Two New Rank Order Statistics Arising in its Consideration. Cancer Chemotherapy Reports 50, 163-170.
Millard, S.P., and S.J. Deverel. (1988). Nonparametric Statistical Methods for Comparing Two Sites Based on Data With Multiple Nondetect Limits. Water Resources Research, 24(12), 2087–2098.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.432–435.
Peto, R., and J. Peto. (1972). Asymptotically Efficient Rank Invariant Test Procedures (with Discussion). Journal of the Royal Statistical Society of London, Series A 135, 185–206.
Prentice, R.L. (1978). Linear Rank Tests with Right Censored Data. Biometrika 65, 167–179.
Prentice, R.L. (1985). Linear Rank Tests. In Kotz, S., and N.L. Johnson, eds. Encyclopedia of Statistical Science. John Wiley and Sons, New York. Volume 5, pp.51–58.
Prentice, R.L., and P. Marek. (1979). A Qualitative Discrepancy Between Censored Data Rank Tests. Biometrics 35, 861–867.
Tarone, R.E., and J. Ware. (1977). On Distribution-Free Tests for Equality of Survival Distributions. Biometrika 64(1), 156–160.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
twoSampleLinearRankTest
, survdiff
,
wilcox.test
, htestCensored.object
.
# The last part of the EXAMPLES section in the help file for # cdfCompareCensored compares the empirical distribution of copper and zinc # between two sites: Alluvial Fan and Basin-Trough (Millard and Deverel, 1988). # The data for this example are stored in Millard.Deverel.88.df. Perform a # test to determine if there is a significant difference between these two # sites (perform a separate test for the copper and the zinc). Millard.Deverel.88.df # Cu.orig Cu Cu.censored Zn.orig Zn Zn.censored Zone Location #1 < 1 1 TRUE <10 10 TRUE Alluvial.Fan 1 #2 < 1 1 TRUE 9 9 FALSE Alluvial.Fan 2 #3 3 3 FALSE NA NA FALSE Alluvial.Fan 3 #. #. #. #116 5 5 FALSE 50 50 FALSE Basin.Trough 48 #117 14 14 FALSE 90 90 FALSE Basin.Trough 49 #118 4 4 FALSE 20 20 FALSE Basin.Trough 50 #------------------------------ # First look at the copper data #------------------------------ Cu.AF <- with(Millard.Deverel.88.df, Cu[Zone == "Alluvial.Fan"]) Cu.AF.cen <- with(Millard.Deverel.88.df, Cu.censored[Zone == "Alluvial.Fan"]) Cu.BT <- with(Millard.Deverel.88.df, Cu[Zone == "Basin.Trough"]) Cu.BT.cen <- with(Millard.Deverel.88.df, Cu.censored[Zone == "Basin.Trough"]) # Note the large number of tied observations in the copper data #-------------------------------------------------------------- table(Cu.AF[!Cu.AF.cen]) # 1 2 3 4 5 7 8 9 10 11 12 16 20 # 5 21 6 3 3 3 1 1 1 1 1 1 1 table(Cu.BT[!Cu.BT.cen]) # 1 2 3 4 5 6 8 9 12 14 15 17 23 # 7 4 8 5 1 2 1 2 1 1 1 1 1 # Logrank test with hypergeometric variance: #------------------------------------------- twoSampleLinearRankTestCensored(x = Cu.AF, x.censored = Cu.AF.cen, y = Cu.BT, y.censored = Cu.BT.cen) #Results of Hypothesis Test #Based on Censored Data #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) != Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Logrank Test # with Hypergeometric Variance # #Censoring Side: left # #Censoring Level(s): x = 1 5 10 20 # y = 1 2 5 10 15 # #Data: x = Cu.AF # y = Cu.BT # #Censoring Variable: x = Cu.AF.cen # y = Cu.BT.cen # #Number NA/NaN/Inf's Removed: x = 3 # y = 1 # #Sample Sizes: nx = 65 # ny = 49 # #Percent Censored: x = 26.2% # y = 28.6% # #Test Statistics: nu = -1.8791355 # var.nu = 13.6533490 # z = -0.5085557 # #P-value: 0.6110637 # Compare the p-values produced by the Normal Scores 2 test # using the hypergeomtric vs. permutation variance estimates. # Note how much larger the estimated variance is based on # the permuation variance estimate: #----------------------------------------------------------- twoSampleLinearRankTestCensored(x = Cu.AF, x.censored = Cu.AF.cen, y = Cu.BT, y.censored = Cu.BT.cen, test = "normal.scores.2")$p.value #[1] 0.2008913 twoSampleLinearRankTestCensored(x = Cu.AF, x.censored = Cu.AF.cen, y = Cu.BT, y.censored = Cu.BT.cen, test = "normal.scores.2", variance = "permutation")$p.value #[1] [1] 0.657001 #-------------------------- # Now look at the zinc data #-------------------------- Zn.AF <- with(Millard.Deverel.88.df, Zn[Zone == "Alluvial.Fan"]) Zn.AF.cen <- with(Millard.Deverel.88.df, Zn.censored[Zone == "Alluvial.Fan"]) Zn.BT <- with(Millard.Deverel.88.df, Zn[Zone == "Basin.Trough"]) Zn.BT.cen <- with(Millard.Deverel.88.df, Zn.censored[Zone == "Basin.Trough"]) # Note the moderate number of tied observations in the zinc data, # and the "outlier" of 620 in the Alluvial Fan data. #--------------------------------------------------------------- table(Zn.AF[!Zn.AF.cen]) # 5 7 8 9 10 11 12 17 18 19 20 23 29 30 33 40 50 620 # 1 1 1 1 20 2 1 1 1 1 14 1 1 1 1 1 1 1 table(Zn.BT[!Zn.BT.cen]) # 3 4 5 6 8 10 11 12 13 14 15 17 20 25 30 40 50 60 70 90 # 2 2 2 1 1 5 1 2 1 1 1 2 11 1 4 3 2 2 1 1 # Logrank test with hypergeometric variance: #------------------------------------------- twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen) #Results of Hypothesis Test #Based on Censored Data #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) != Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Logrank Test # with Hypergeometric Variance # #Censoring Side: left # #Censoring Level(s): x = 3 10 # y = 3 10 # #Data: x = Zn.AF # y = Zn.BT # #Censoring Variable: x = Zn.AF.cen # y = Zn.BT.cen # #Number NA/NaN/Inf's Removed: x = 1 # y = 0 # #Sample Sizes: nx = 67 # ny = 50 # #Percent Censored: x = 23.9% # y = 8.0% # #Test Statistics: nu = -6.992999 # var.nu = 17.203227 # z = -1.686004 # #P-value: 0.09179512 #---------- # Compare the p-values produced by the Logrank, Gehan, Peto-Peto, # and Tarone-Ware tests using the hypergeometric variance. #----------------------------------------------------------- twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen, test = "logrank")$p.value #[1] 0.09179512 twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen, test = "gehan")$p.value #[1] 0.0185445 twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen, test = "peto-peto")$p.value #[1] 0.009704529 twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen, test = "tarone-ware")$p.value #[1] 0.03457803 #---------- # Clean up #--------- rm(Cu.AF, Cu.AF.cen, Cu.BT, Cu.BT.cen, Zn.AF, Zn.AF.cen, Zn.BT, Zn.BT.cen) #========== # Example 16.5 on pages 16-22 to 16.23 of USEPA (2009) shows how to perform # the Tarone-Ware two sample linear rank test based on censored data using # observations on tetrachloroethylene (PCE) (ppb) collected at one background # and one compliance well. The data for this example are stored in # EPA.09.Ex.16.5.PCE.df. EPA.09.Ex.16.5.PCE.df # Well.type PCE.Orig.ppb PCE.ppb Censored #1 Background <4 4.0 TRUE #2 Background 1.5 1.5 FALSE #3 Background <2 2.0 TRUE #4 Background 8.7 8.7 FALSE #5 Background 5.1 5.1 FALSE #6 Background <5 5.0 TRUE #7 Compliance 6.4 6.4 FALSE #8 Compliance 10.9 10.9 FALSE #9 Compliance 7 7.0 FALSE #10 Compliance 14.3 14.3 FALSE #11 Compliance 1.9 1.9 FALSE #12 Compliance 10 10.0 FALSE #13 Compliance 6.8 6.8 FALSE #14 Compliance <5 5.0 TRUE with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "tarone-ware", alternative = "greater")) #Results of Hypothesis Test #Based on Censored Data #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) > Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Tarone-Ware Test # with Hypergeometric Variance # #Censoring Side: left # #Censoring Level(s): x = 5 # y = 2 4 5 # #Data: x = PCE.ppb[Well.type == "Compliance"] # y = PCE.ppb[Well.type == "Background"] # #Censoring Variable: x = Censored[Well.type == "Compliance"] # y = Censored[Well.type == "Background"] # #Sample Sizes: nx = 8 # ny = 6 # #Percent Censored: x = 12.5% # y = 50.0% # #Test Statistics: nu = 8.458912 # var.nu = 20.912407 # z = 1.849748 # #P-value: 0.03217495 # Compare the p-value for the Tarone-Ware test with p-values from # the logrank, Gehan, and Peto-Peto tests #----------------------------------------------------------------- with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "tarone-ware", alternative = "greater"))$p.value #[1] 0.03217495 with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "logrank", alternative = "greater"))$p.value #[1] 0.02752793 with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "gehan", alternative = "greater"))$p.value #[1] 0.03656224 with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "peto-peto", alternative = "greater"))$p.value #[1] 0.03127296 #========== # The results shown in Table 9 of Millard and Deverel (1988, p.2097) are correct # only for the hypergeometric variance and the modified MWW tests; the other # results were computed as if there were no ties. Re-compute the correct # z-statistics and p-values for the copper and zinc data. test <- c(rep(c("gehan", "logrank", "peto-peto"), 2), "peto-peto", "normal.scores.1", "normal.scores.2", "normal.scores.2") variance <- c(rep("permutation", 3), rep("hypergeometric", 3), "asymptotic", rep("permutation", 2), "hypergeometric") stats.mat <- matrix(as.numeric(NA), ncol = 4, nrow = 10) for(i in 1:10) { dum.list <- with(Millard.Deverel.88.df, twoSampleLinearRankTestCensored( x = Cu[Zone == "Basin.Trough"], x.censored = Cu.censored[Zone == "Basin.Trough"], y = Cu[Zone == "Alluvial.Fan"], y.censored = Cu.censored[Zone == "Alluvial.Fan"], test = test[i], variance = variance[i])) stats.mat[i, 1:2] <- c(dum.list$statistic["z"], dum.list$p.value) dum.list <- with(Millard.Deverel.88.df, twoSampleLinearRankTestCensored( x = Zn[Zone == "Basin.Trough"], x.censored = Zn.censored[Zone == "Basin.Trough"], y = Zn[Zone == "Alluvial.Fan"], y.censored = Zn.censored[Zone == "Alluvial.Fan"], test = test[i], variance = variance[i])) stats.mat[i, 3:4] <- c(dum.list$statistic["z"], dum.list$p.value) } dimnames(stats.mat) <- list(paste(test, variance, sep = "."), c("Cu.Z", "Cu.p.value", "Zn.Z", "Zn.p.value")) round(stats.mat, 2) # Cu.Z Cu.p.value Zn.Z Zn.p.value #gehan.permutation 0.87 0.38 2.49 0.01 #logrank.permutation 0.79 0.43 1.75 0.08 #peto-peto.permutation 0.92 0.36 2.42 0.02 #gehan.hypergeometric 0.71 0.48 2.35 0.02 #logrank.hypergeometric 0.51 0.61 1.69 0.09 #peto-peto.hypergeometric 1.03 0.30 2.59 0.01 #peto-peto.asymptotic 0.90 0.37 2.37 0.02 #normal.scores.1.permutation 0.94 0.34 2.37 0.02 #normal.scores.2.permutation 0.98 0.33 2.39 0.02 #normal.scores.2.hypergeometric 1.28 0.20 2.48 0.01 #---------- # Clean up #--------- rm(test, variance, stats.mat, i, dum.list)
# The last part of the EXAMPLES section in the help file for # cdfCompareCensored compares the empirical distribution of copper and zinc # between two sites: Alluvial Fan and Basin-Trough (Millard and Deverel, 1988). # The data for this example are stored in Millard.Deverel.88.df. Perform a # test to determine if there is a significant difference between these two # sites (perform a separate test for the copper and the zinc). Millard.Deverel.88.df # Cu.orig Cu Cu.censored Zn.orig Zn Zn.censored Zone Location #1 < 1 1 TRUE <10 10 TRUE Alluvial.Fan 1 #2 < 1 1 TRUE 9 9 FALSE Alluvial.Fan 2 #3 3 3 FALSE NA NA FALSE Alluvial.Fan 3 #. #. #. #116 5 5 FALSE 50 50 FALSE Basin.Trough 48 #117 14 14 FALSE 90 90 FALSE Basin.Trough 49 #118 4 4 FALSE 20 20 FALSE Basin.Trough 50 #------------------------------ # First look at the copper data #------------------------------ Cu.AF <- with(Millard.Deverel.88.df, Cu[Zone == "Alluvial.Fan"]) Cu.AF.cen <- with(Millard.Deverel.88.df, Cu.censored[Zone == "Alluvial.Fan"]) Cu.BT <- with(Millard.Deverel.88.df, Cu[Zone == "Basin.Trough"]) Cu.BT.cen <- with(Millard.Deverel.88.df, Cu.censored[Zone == "Basin.Trough"]) # Note the large number of tied observations in the copper data #-------------------------------------------------------------- table(Cu.AF[!Cu.AF.cen]) # 1 2 3 4 5 7 8 9 10 11 12 16 20 # 5 21 6 3 3 3 1 1 1 1 1 1 1 table(Cu.BT[!Cu.BT.cen]) # 1 2 3 4 5 6 8 9 12 14 15 17 23 # 7 4 8 5 1 2 1 2 1 1 1 1 1 # Logrank test with hypergeometric variance: #------------------------------------------- twoSampleLinearRankTestCensored(x = Cu.AF, x.censored = Cu.AF.cen, y = Cu.BT, y.censored = Cu.BT.cen) #Results of Hypothesis Test #Based on Censored Data #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) != Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Logrank Test # with Hypergeometric Variance # #Censoring Side: left # #Censoring Level(s): x = 1 5 10 20 # y = 1 2 5 10 15 # #Data: x = Cu.AF # y = Cu.BT # #Censoring Variable: x = Cu.AF.cen # y = Cu.BT.cen # #Number NA/NaN/Inf's Removed: x = 3 # y = 1 # #Sample Sizes: nx = 65 # ny = 49 # #Percent Censored: x = 26.2% # y = 28.6% # #Test Statistics: nu = -1.8791355 # var.nu = 13.6533490 # z = -0.5085557 # #P-value: 0.6110637 # Compare the p-values produced by the Normal Scores 2 test # using the hypergeomtric vs. permutation variance estimates. # Note how much larger the estimated variance is based on # the permuation variance estimate: #----------------------------------------------------------- twoSampleLinearRankTestCensored(x = Cu.AF, x.censored = Cu.AF.cen, y = Cu.BT, y.censored = Cu.BT.cen, test = "normal.scores.2")$p.value #[1] 0.2008913 twoSampleLinearRankTestCensored(x = Cu.AF, x.censored = Cu.AF.cen, y = Cu.BT, y.censored = Cu.BT.cen, test = "normal.scores.2", variance = "permutation")$p.value #[1] [1] 0.657001 #-------------------------- # Now look at the zinc data #-------------------------- Zn.AF <- with(Millard.Deverel.88.df, Zn[Zone == "Alluvial.Fan"]) Zn.AF.cen <- with(Millard.Deverel.88.df, Zn.censored[Zone == "Alluvial.Fan"]) Zn.BT <- with(Millard.Deverel.88.df, Zn[Zone == "Basin.Trough"]) Zn.BT.cen <- with(Millard.Deverel.88.df, Zn.censored[Zone == "Basin.Trough"]) # Note the moderate number of tied observations in the zinc data, # and the "outlier" of 620 in the Alluvial Fan data. #--------------------------------------------------------------- table(Zn.AF[!Zn.AF.cen]) # 5 7 8 9 10 11 12 17 18 19 20 23 29 30 33 40 50 620 # 1 1 1 1 20 2 1 1 1 1 14 1 1 1 1 1 1 1 table(Zn.BT[!Zn.BT.cen]) # 3 4 5 6 8 10 11 12 13 14 15 17 20 25 30 40 50 60 70 90 # 2 2 2 1 1 5 1 2 1 1 1 2 11 1 4 3 2 2 1 1 # Logrank test with hypergeometric variance: #------------------------------------------- twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen) #Results of Hypothesis Test #Based on Censored Data #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) != Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Logrank Test # with Hypergeometric Variance # #Censoring Side: left # #Censoring Level(s): x = 3 10 # y = 3 10 # #Data: x = Zn.AF # y = Zn.BT # #Censoring Variable: x = Zn.AF.cen # y = Zn.BT.cen # #Number NA/NaN/Inf's Removed: x = 1 # y = 0 # #Sample Sizes: nx = 67 # ny = 50 # #Percent Censored: x = 23.9% # y = 8.0% # #Test Statistics: nu = -6.992999 # var.nu = 17.203227 # z = -1.686004 # #P-value: 0.09179512 #---------- # Compare the p-values produced by the Logrank, Gehan, Peto-Peto, # and Tarone-Ware tests using the hypergeometric variance. #----------------------------------------------------------- twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen, test = "logrank")$p.value #[1] 0.09179512 twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen, test = "gehan")$p.value #[1] 0.0185445 twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen, test = "peto-peto")$p.value #[1] 0.009704529 twoSampleLinearRankTestCensored(x = Zn.AF, x.censored = Zn.AF.cen, y = Zn.BT, y.censored = Zn.BT.cen, test = "tarone-ware")$p.value #[1] 0.03457803 #---------- # Clean up #--------- rm(Cu.AF, Cu.AF.cen, Cu.BT, Cu.BT.cen, Zn.AF, Zn.AF.cen, Zn.BT, Zn.BT.cen) #========== # Example 16.5 on pages 16-22 to 16.23 of USEPA (2009) shows how to perform # the Tarone-Ware two sample linear rank test based on censored data using # observations on tetrachloroethylene (PCE) (ppb) collected at one background # and one compliance well. The data for this example are stored in # EPA.09.Ex.16.5.PCE.df. EPA.09.Ex.16.5.PCE.df # Well.type PCE.Orig.ppb PCE.ppb Censored #1 Background <4 4.0 TRUE #2 Background 1.5 1.5 FALSE #3 Background <2 2.0 TRUE #4 Background 8.7 8.7 FALSE #5 Background 5.1 5.1 FALSE #6 Background <5 5.0 TRUE #7 Compliance 6.4 6.4 FALSE #8 Compliance 10.9 10.9 FALSE #9 Compliance 7 7.0 FALSE #10 Compliance 14.3 14.3 FALSE #11 Compliance 1.9 1.9 FALSE #12 Compliance 10 10.0 FALSE #13 Compliance 6.8 6.8 FALSE #14 Compliance <5 5.0 TRUE with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "tarone-ware", alternative = "greater")) #Results of Hypothesis Test #Based on Censored Data #-------------------------- # #Null Hypothesis: Fy(t) = Fx(t) # #Alternative Hypothesis: Fy(t) > Fx(t) for at least one t # #Test Name: Two-Sample Linear Rank Test: # Tarone-Ware Test # with Hypergeometric Variance # #Censoring Side: left # #Censoring Level(s): x = 5 # y = 2 4 5 # #Data: x = PCE.ppb[Well.type == "Compliance"] # y = PCE.ppb[Well.type == "Background"] # #Censoring Variable: x = Censored[Well.type == "Compliance"] # y = Censored[Well.type == "Background"] # #Sample Sizes: nx = 8 # ny = 6 # #Percent Censored: x = 12.5% # y = 50.0% # #Test Statistics: nu = 8.458912 # var.nu = 20.912407 # z = 1.849748 # #P-value: 0.03217495 # Compare the p-value for the Tarone-Ware test with p-values from # the logrank, Gehan, and Peto-Peto tests #----------------------------------------------------------------- with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "tarone-ware", alternative = "greater"))$p.value #[1] 0.03217495 with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "logrank", alternative = "greater"))$p.value #[1] 0.02752793 with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "gehan", alternative = "greater"))$p.value #[1] 0.03656224 with(EPA.09.Ex.16.5.PCE.df, twoSampleLinearRankTestCensored( x = PCE.ppb[Well.type == "Compliance"], x.censored = Censored[Well.type == "Compliance"], y = PCE.ppb[Well.type == "Background"], y.censored = Censored[Well.type == "Background"], test = "peto-peto", alternative = "greater"))$p.value #[1] 0.03127296 #========== # The results shown in Table 9 of Millard and Deverel (1988, p.2097) are correct # only for the hypergeometric variance and the modified MWW tests; the other # results were computed as if there were no ties. Re-compute the correct # z-statistics and p-values for the copper and zinc data. test <- c(rep(c("gehan", "logrank", "peto-peto"), 2), "peto-peto", "normal.scores.1", "normal.scores.2", "normal.scores.2") variance <- c(rep("permutation", 3), rep("hypergeometric", 3), "asymptotic", rep("permutation", 2), "hypergeometric") stats.mat <- matrix(as.numeric(NA), ncol = 4, nrow = 10) for(i in 1:10) { dum.list <- with(Millard.Deverel.88.df, twoSampleLinearRankTestCensored( x = Cu[Zone == "Basin.Trough"], x.censored = Cu.censored[Zone == "Basin.Trough"], y = Cu[Zone == "Alluvial.Fan"], y.censored = Cu.censored[Zone == "Alluvial.Fan"], test = test[i], variance = variance[i])) stats.mat[i, 1:2] <- c(dum.list$statistic["z"], dum.list$p.value) dum.list <- with(Millard.Deverel.88.df, twoSampleLinearRankTestCensored( x = Zn[Zone == "Basin.Trough"], x.censored = Zn.censored[Zone == "Basin.Trough"], y = Zn[Zone == "Alluvial.Fan"], y.censored = Zn.censored[Zone == "Alluvial.Fan"], test = test[i], variance = variance[i])) stats.mat[i, 3:4] <- c(dum.list$statistic["z"], dum.list$p.value) } dimnames(stats.mat) <- list(paste(test, variance, sep = "."), c("Cu.Z", "Cu.p.value", "Zn.Z", "Zn.p.value")) round(stats.mat, 2) # Cu.Z Cu.p.value Zn.Z Zn.p.value #gehan.permutation 0.87 0.38 2.49 0.01 #logrank.permutation 0.79 0.43 1.75 0.08 #peto-peto.permutation 0.92 0.36 2.42 0.02 #gehan.hypergeometric 0.71 0.48 2.35 0.02 #logrank.hypergeometric 0.51 0.61 1.69 0.09 #peto-peto.hypergeometric 1.03 0.30 2.59 0.01 #peto-peto.asymptotic 0.90 0.37 2.37 0.02 #normal.scores.1.permutation 0.94 0.34 2.37 0.02 #normal.scores.2.permutation 0.98 0.33 2.39 0.02 #normal.scores.2.hypergeometric 1.28 0.20 2.48 0.01 #---------- # Clean up #--------- rm(test, variance, stats.mat, i, dum.list)
Perform a two-sample or paired-sample randomization (permutation) test for location based on either means or medians.
twoSamplePermutationTestLocation(x, y, fcn = "mean", alternative = "two.sided", mu1.minus.mu2 = 0, paired = FALSE, exact = FALSE, n.permutations = 5000, seed = NULL, tol = sqrt(.Machine$double.eps))
twoSamplePermutationTestLocation(x, y, fcn = "mean", alternative = "two.sided", mu1.minus.mu2 = 0, paired = FALSE, exact = FALSE, n.permutations = 5000, seed = NULL, tol = sqrt(.Machine$double.eps))
x |
numeric vector of observations from population 1.
Missing ( |
y |
numeric vector of observations from population 2.
Missing ( In the case when |
fcn |
character string indicating which location parameter to compare between the two
groups. The possible values are |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
mu1.minus.mu2 |
numeric scalar indicating the hypothesized value of the difference between the
means or medians. The default value is |
paired |
logical scalar indicating whether to perform a paired or two-sample permutation
test. The possible values are |
exact |
logical scalar indicating whether to perform the exact permutation test (i.e.,
enumerate all possible permutations) or simply sample from the permutation
distribution. The default value is |
n.permutations |
integer indicating how many times to sample from the permutation distribution when
|
seed |
positive integer to pass to the R function |
tol |
numeric scalar indicating the tolerance to use for computing the p-value for the
two-sample permutation test. The default value is |
Randomization Tests
In 1935, R.A. Fisher introduced the idea of a randomization test
(Manly, 2007, p. 107; Efron and Tibshirani, 1993, Chapter 15), which is based on
trying to answer the question: “Did the observed pattern happen by chance,
or does the pattern indicate the null hypothesis is not true?” A randomization
test works by simply enumerating all of the possible outcomes under the null
hypothesis, then seeing where the observed outcome fits in. A randomization test
is also called a permutation test, because it involves permuting the
observations during the enumeration procedure (Manly, 2007, p. 3).
In the past, randomization tests have not been used as extensively as they are now
because of the “large” computing resources needed to enumerate all of the
possible outcomes, especially for large sample sizes. The advent of more powerful
personal computers and software has allowed randomization tests to become much
easier to perform. Depending on the sample size, however, it may still be too
time consuming to enumerate all possible outcomes. In this case, the randomization
test can still be performed by sampling from the randomization distribution, and
comparing the observed outcome to this sampled permutation distribution.
Two-Sample Randomization Test for Location (paired=FALSE
)
Let be a vector of
independent and identically distributed (i.i.d.) observations
from some distribution with location parameter (e.g., mean or median)
,
and let
be a vector of
i.i.d. observations from the same distribution with possibly different location
parameter
.
Consider the test of the null hypothesis that the difference in the location parameters is equal to some specified value:
where
and denotes the hypothesized difference in the meansures of
location (usually
).
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
)
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative
To perform the test of the null hypothesis (1) versus any of the three alternatives (3)-(5), you can use the two-sample permutation test. The two sample permutation test is based on trying to answer the question, “Did the observed difference in means or medians happen by chance, or does the observed difference indicate that the null hypothesis is not true?” Under the null hypothesis, the underlying distributions for each group are the same, therefore it should make no difference which group an observation gets assigned to. The two-sample permutation test works by simply enumerating all possible permutations of group assignments, and for each permutation computing the difference between the measures of location for each group (Manly, 2007, p. 113; Efron and Tibshirani, 1993, p. 202). The measure of location for a group could be the mean, median, or any other measure you want to use. For example, if the observations from Group 1 are 3 and 5, and the observations from Group 2 are 4, 6, and 7, then there are 10 different ways of splitting these five observations into one group of size 2 and another group of size 3. The table below lists all of the possible group assignments, along with the differences in the group means.
Group 1 | Group 2 | Mean 1 - Mean 2 |
3, 4 | 5, 6, 7 | -2.5 |
3, 5 | 4, 6, 7 | -1.67 |
3, 6 | 4, 5, 7 | -0.83 |
3, 7 | 4, 5, 6 | 0 |
4, 5 | 3, 6, 7 | -0.83 |
4, 6 | 3, 5, 7 | 0 |
4, 7 | 3, 5, 6 | 0.83 |
5, 6 | 3, 4, 7 | 0.83 |
5, 7 | 3, 4, 6 | 1.67 |
6, 7 | 3, 4, 5 | 2.5 |
In this example, the observed group assignments and difference in means are shown in the second row of the table.
For a one-sided upper alternative (Equation (3)), the p-value is computed as the proportion of times that the differences of the means (or medians) in the permutation distribution are greater than or equal to the observed difference in means (or medians). For a one-sided lower alternative hypothesis (Equation (4)), the p-value is computed as the proportion of times that the differences in the means (or medians) in the permutation distribution are less than or equal to the observed difference in the means (or medians). For a two-sided alternative hypothesis (Equation (5)), the p-value is computed as the proportion of times the absolute values of the differences in the means (or medians) in the permutation distribution are greater than or equal to the absolute value of the observed difference in the means (or medians).
For this simple example, the one-sided upper, one-sided lower, and two-sided p-values are 0.9, 0.2 and 0.4, respectively.
Note: Because of the nature of machine arithmetic and how the permutation
distribution is computed, a one-sided upper p-value is computed as the proportion
of times that the differences of the means (or medians) in the permutation
distribution are greater than or equal to
[the observed difference in means (or medians) - a small tolerance value], where the
tolerance value is determined by the argument tol
. Similarly, a one-sided
lower p-value is computed as the proportion of times that the differences in the
means (or medians) in the permutation distribution are less than or equal to
[the observed difference in the means (or medians) + a small tolerance value].
Finally, a two-sided p-value is computed as the proportion of times the absolute
values of the differences in the means (or medians) in the permutation distribution
are greater than or equal to
[the absolute value of the observed difference in the means (or medians) - a small tolerance value].
In this simple example, we assumed the hypothesized differences in the means under
the null hypothesis was . If we had hypothesized a different
value for
, then we would have had to subtract this value from each of
the observations in Group 1 before permuting the group assignments to compute the
permutation distribution of the differences of the means. As in the case of the
one-sample permutation test, if the sample sizes
for the groups become too large to compute all possible permutations of the group
assignments, the permutation test can still be performed by sampling from the
permutation distribution and comparing the observed difference in locations to the
sampled permutation distribution of the difference in locations.
Unlike the two-sample Student's t-test, we do not have to worry about the normality assumption when we use a permutation test. The permutation test still assumes, however, that under the null hypothesis, the distributions of the observations from each group are exactly the same, and under the alternative hypothesis there is simply a shift in location (that is, the whole distribution of group 1 is shifted by some constant relative to the distribution of group 2). Mathematically, this can be written as follows:
where and
denote the cumulative distribution functions for
group 1 and group 2, respectively. If
, this implies that the
observations in group 1 tend to be larger than the observations in group 2, and
if
, this implies that the observations in group 1 tend to be
smaller than the observations in group 2. Thus, the shape and spread (variance)
of the two distributions should be the same whether the null hypothesis is true or
not. Therefore, the Type I error rate for a permutation test can be affected by
differences in variances between the two groups.
Confidence Intervals for the Difference in Means or Medians
Based on the relationship between hypothesis tests and confidence intervals, it is
possible to construct a two-sided or one-sided confidence
interval for the difference in means or medians based on the two-sample permutation
test by finding the values of
that correspond to obtaining a
p-value of
(Manly, 2007, pp. 18–20, 114). A confidence interval
based on the bootstrap however, will yield a similar type of confidence interval
(Efron and Tibshirani, 1993, p. 214); see the help file for
boot
in the R package boot.
Paired-Sample Randomization Test for Location (paired=TRUE
)
When the argument paired=TRUE
, the arguments x
and y
are
assumed to have the same length, and the differences
,
are assumed to be independent
observations from some symmetric distribution with mean
. The
one-sample permutation test can then be applied
to the differences.
A list of class "permutationTest"
containing the results of the hypothesis
test. See the help file for permutationTest.object
for details.
A frequent question in environmental statistics is “Is the concentration of chemical X in Area A greater than the concentration of chemical X in Area B?”. For example, in groundwater detection monitoring at hazardous and solid waste sites, the concentration of a chemical in the groundwater at a downgradient well must be compared to “background”. If the concentration is “above” the background then the site enters assessment monitoring. As another example, soil cleanup at a Superfund site may involve comparing the concentration of a chemical in the soil at a “cleaned up” site with the concentration at a “background” site. If the concentration at the “cleaned up” site is “greater” than the background concentration, then further investigation and remedial action may be required. Determining what it means for the chemical concentration to be “greater” than background is a policy decision: you may want to compare averages, medians, 95'th percentiles, etc.
Hypothesis tests you can use to compare “location” between two groups include: Student's t-test, Fisher's randomization test (described in this help file), the Wilcoxon rank sum test, other two-sample linear rank tests, the quantile test, and a test based on a bootstrap confidence interval.
Steven P. Millard ([email protected])
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, Chapter 15.
Manly, B.F.J. (2007). Randomization, Bootstrap and Monte Carlo Methods in Biology. Third Edition. Chapman & Hall, New York, Chapter 6.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.426–431.
permutationTest.object
, plot.permutationTest
,
oneSamplePermutationTest
, twoSamplePermutationTestProportion
,
Hypothesis Tests, boot
.
# Generate 10 observations from a lognormal distribution with parameters # mean=5 and cv=2, and and 20 observations from a lognormal distribution with # parameters mean=10 and cv=2. Test the null hypothesis that the means of the # two distributions are the same against the alternative that the mean for # group 1 is less than the mean for group 2. # (Note: the call to set.seed allows you to reproduce the same data # (dat1 and dat2), and setting the argument seed=732 in the call to # twoSamplePermutationTestLocation() lets you reproduce this example by # getting the same sample from the permutation distribution). set.seed(256) dat1 <- rlnormAlt(10, mean = 5, cv = 2) dat2 <- rlnormAlt(20, mean = 10, cv = 2) test.list <- twoSamplePermutationTestLocation(dat1, dat2, alternative = "less", seed = 732) # Print the results of the test #------------------------------ test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mu.x-mu.y = 0 # #Alternative Hypothesis: True mu.x-mu.y is less than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Means # (Based on Sampling # Permutation Distribution # 5000 Times) # #Estimated Parameter(s): mean of x = 2.253439 # mean of y = 11.825430 # #Data: x = dat1 # y = dat2 # #Sample Sizes: nx = 10 # ny = 20 # #Test Statistic: mean.x - mean.y = -9.571991 # #P-value: 0.001 # Plot the results of the test #----------------------------- dev.new() plot(test.list) #========== # The guidance document "Statistical Methods for Evaluating the Attainment of # Cleanup Standards, Volume 3: Reference-Based Standards for Soils and Solid # Media" (USEPA, 1994b, pp. 6.22-6.25) contains observations of # 1,2,3,4-Tetrachlorobenzene (TcCB) in ppb at a Reference Area and a Cleanup Area. # These data are stored in the data frame EPA.94b.tccb.df. Use the # two-sample permutation test to test for a difference in means between the # two areas vs. the alternative that the mean in the Cleanup Area is greater. # Do the same thing for the medians. # # The permutation test based on comparing means shows a significant differnce, # while the one based on comparing medians does not. # First test for a difference in the means. #------------------------------------------ mean.list <- with(EPA.94b.tccb.df, twoSamplePermutationTestLocation( TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], alternative = "greater", seed = 47)) mean.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mu.x-mu.y = 0 # #Alternative Hypothesis: True mu.x-mu.y is greater than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Means # (Based on Sampling # Permutation Distribution # 5000 Times) # #Estimated Parameter(s): mean of x = 3.9151948 # mean of y = 0.5985106 # #Data: x = TcCB[Area == "Cleanup"] # y = TcCB[Area == "Reference"] # #Sample Sizes: nx = 77 # ny = 47 # #Test Statistic: mean.x - mean.y = 3.316684 # #P-value: 0.0206 dev.new() plot(mean.list) #---------- # Now test for a difference in the medians. #------------------------------------------ median.list <- with(EPA.94b.tccb.df, twoSamplePermutationTestLocation( TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], fcn = "median", alternative = "greater", seed = 47)) median.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mu.x-mu.y = 0 # #Alternative Hypothesis: True mu.x-mu.y is greater than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Medians # (Based on Sampling # Permutation Distribution # 5000 Times) # #Estimated Parameter(s): median of x = 0.43 # median of y = 0.54 # #Data: x = TcCB[Area == "Cleanup"] # y = TcCB[Area == "Reference"] # #Sample Sizes: nx = 77 # ny = 47 # #Test Statistic: median.x - median.y = -0.11 # #P-value: 0.936 dev.new() plot(median.list) #========== # Clean up #--------- rm(test.list, mean.list, median.list) graphics.off()
# Generate 10 observations from a lognormal distribution with parameters # mean=5 and cv=2, and and 20 observations from a lognormal distribution with # parameters mean=10 and cv=2. Test the null hypothesis that the means of the # two distributions are the same against the alternative that the mean for # group 1 is less than the mean for group 2. # (Note: the call to set.seed allows you to reproduce the same data # (dat1 and dat2), and setting the argument seed=732 in the call to # twoSamplePermutationTestLocation() lets you reproduce this example by # getting the same sample from the permutation distribution). set.seed(256) dat1 <- rlnormAlt(10, mean = 5, cv = 2) dat2 <- rlnormAlt(20, mean = 10, cv = 2) test.list <- twoSamplePermutationTestLocation(dat1, dat2, alternative = "less", seed = 732) # Print the results of the test #------------------------------ test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mu.x-mu.y = 0 # #Alternative Hypothesis: True mu.x-mu.y is less than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Means # (Based on Sampling # Permutation Distribution # 5000 Times) # #Estimated Parameter(s): mean of x = 2.253439 # mean of y = 11.825430 # #Data: x = dat1 # y = dat2 # #Sample Sizes: nx = 10 # ny = 20 # #Test Statistic: mean.x - mean.y = -9.571991 # #P-value: 0.001 # Plot the results of the test #----------------------------- dev.new() plot(test.list) #========== # The guidance document "Statistical Methods for Evaluating the Attainment of # Cleanup Standards, Volume 3: Reference-Based Standards for Soils and Solid # Media" (USEPA, 1994b, pp. 6.22-6.25) contains observations of # 1,2,3,4-Tetrachlorobenzene (TcCB) in ppb at a Reference Area and a Cleanup Area. # These data are stored in the data frame EPA.94b.tccb.df. Use the # two-sample permutation test to test for a difference in means between the # two areas vs. the alternative that the mean in the Cleanup Area is greater. # Do the same thing for the medians. # # The permutation test based on comparing means shows a significant differnce, # while the one based on comparing medians does not. # First test for a difference in the means. #------------------------------------------ mean.list <- with(EPA.94b.tccb.df, twoSamplePermutationTestLocation( TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], alternative = "greater", seed = 47)) mean.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mu.x-mu.y = 0 # #Alternative Hypothesis: True mu.x-mu.y is greater than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Means # (Based on Sampling # Permutation Distribution # 5000 Times) # #Estimated Parameter(s): mean of x = 3.9151948 # mean of y = 0.5985106 # #Data: x = TcCB[Area == "Cleanup"] # y = TcCB[Area == "Reference"] # #Sample Sizes: nx = 77 # ny = 47 # #Test Statistic: mean.x - mean.y = 3.316684 # #P-value: 0.0206 dev.new() plot(mean.list) #---------- # Now test for a difference in the medians. #------------------------------------------ median.list <- with(EPA.94b.tccb.df, twoSamplePermutationTestLocation( TcCB[Area=="Cleanup"], TcCB[Area=="Reference"], fcn = "median", alternative = "greater", seed = 47)) median.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: mu.x-mu.y = 0 # #Alternative Hypothesis: True mu.x-mu.y is greater than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Medians # (Based on Sampling # Permutation Distribution # 5000 Times) # #Estimated Parameter(s): median of x = 0.43 # median of y = 0.54 # #Data: x = TcCB[Area == "Cleanup"] # y = TcCB[Area == "Reference"] # #Sample Sizes: nx = 77 # ny = 47 # #Test Statistic: median.x - median.y = -0.11 # #P-value: 0.936 dev.new() plot(median.list) #========== # Clean up #--------- rm(test.list, mean.list, median.list) graphics.off()
Perform a two-sample randomization (permutation) test to compare two proportions. This is also called Fisher's exact test.
Note: You can perform Fisher's exact test in R using the function
fisher.test
.
twoSamplePermutationTestProportion(x, y, x.and.y = "Binomial Outcomes", alternative = "two.sided", tol = sqrt(.Machine$double.eps))
twoSamplePermutationTestProportion(x, y, x.and.y = "Binomial Outcomes", alternative = "two.sided", tol = sqrt(.Machine$double.eps))
x , y
|
When When |
x.and.y |
character string indicating the kind of data stored in the vectors |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
tol |
numeric scalar indicating the tolerance to use for computing the p-value for the
two-sample permutation test. The default value is |
Randomization Tests
In 1935, R.A. Fisher introduced the idea of a randomization test
(Manly, 2007, p. 107; Efron and Tibshirani, 1993, Chapter 15), which is based on
trying to answer the question: “Did the observed pattern happen by chance,
or does the pattern indicate the null hypothesis is not true?” A randomization
test works by simply enumerating all of the possible outcomes under the null
hypothesis, then seeing where the observed outcome fits in. A randomization test
is also called a permutation test, because it involves permuting the
observations during the enumeration procedure (Manly, 2007, p. 3).
In the past, randomization tests have not been used as extensively as they are now
because of the “large” computing resources needed to enumerate all of the
possible outcomes, especially for large sample sizes. The advent of more powerful
personal computers and software has allowed randomization tests to become much
easier to perform. Depending on the sample size, however, it may still be too
time consuming to enumerate all possible outcomes. In this case, the randomization
test can still be performed by sampling from the randomization distribution, and
comparing the observed outcome to this sampled permutation distribution.
Two-Sample Randomization Test for Proportions
Let be a vector of
independent and identically distributed (i.i.d.) observations
from a binomial distribution with parameter
size=1
and
probability of success prob=
, and let
be a vector of
i.i.d. observations from a binomial distribution with
parameter
size=1
and probability of success prob=
.
Consider the test of the null hypothesis:
The three possible alternative hypotheses are the upper one-sided alternative
(alternative="greater"
)
the lower one-sided alternative (alternative="less"
)
and the two-sided alternative
To perform the test of the null hypothesis (1) versus any of the three
alternatives (2)-(4), you can use the two-sample permutation test, which is also
called Fisher's exact test. When the observations are
from a B(1, ) distribution, the sample mean is an estimate of
.
Fisher's exact test is simply a permutation test for the difference between two
means from two different groups (see
twoSamplePermutationTestLocation
),
where the underlying populations are binomial with size parameter size=1
,
but possibly different values of the prob
parameter .
Fisher's exact test is usually described in terms of testing hypotheses concerning
a 2 x 2 contingency table (van Bell et al., 2004, p. 157;
Hollander and Wolfe, 1999, p. 473; Sheskin, 2011; Zar, 2010, p. 561).
The probabilities associated with the permutation distribution can be computed by
using the hypergeometric distribution.
A list of class "permutationTest"
containing the results of the hypothesis
test. See the help file for permutationTest.object
for details.
Sometimes in environmental data analysis we are interested in determining whether two probabilities or rates or proportions differ from each other. For example, we may ask the question: “Does exposure to pesticide X increase the risk of developing cancer Y?”, where cancer Y may be liver cancer, stomach cancer, or some other kind of cancer. One way environmental scientists attempt to answer this kind of question is by conducting experiments on rodents in which one group (the “treatment” or “exposed” group) is exposed to the pesticide and the other group (the control group) is not. The incidence of cancer Y in the exposed group is compared with the incidence of cancer Y in the control group. (See Rodricks (2007) for a discussion of extrapolating results from experiments involving rodents to consequences in humans and the associated difficulties).
Hypothesis tests you can use to compare proportions or probability of
“success” between two groups include Fisher's exact test and the test
based on the normal approximation (see the R help file for
prop.test
).
Steven P. Millard ([email protected])
Efron, B., and R.J. Tibshirani. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York, Chapter 15.
Hollander, M., and D.A. Wolfe. (1999). Nonparametric Statistical Methods. Second Edition. John Wiley and Sons, New York, p.473.
Manly, B.F.J. (2007). Randomization, Bootstrap and Monte Carlo Methods in Biology. Third Edition. Chapman & Hall, New York, Chapter 6.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL, pp.441–446.
Graham, S.L., K.J. Davis, W.H. Hansen, and C.H. Graham. (1975). Effects of Prolonged Ethylene Thiourea Ingestion on the Thyroid of the Rat. Food and Cosmetics Toxicology, 13(5), 493–499.
Rodricks, J.V. (1992). Calculated Risks: The Toxicity and Human Health Risks of Chemicals in Our Environment. Cambridge University Press, New York.
Rodricks, J.V. (2007). Calculated Risks: The Toxicity and Human Health Risks of Chemicals in Our Environment. Second Edition. Cambridge University Press, New York.
Sheskin, D.J. (2011). Handbook of Parametric and Nonparametric Statistical Procedures Fifth Edition. CRC Press, Boca Raton, FL.
van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York, p. 157.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, p. 561.
permutationTest.object
, plot.permutationTest
, twoSamplePermutationTestLocation
,
oneSamplePermutationTest
,
Hypothesis Tests, boot
.
# Generate 10 observations from a binomial distribution with parameters # size=1 and prob=0.3, and 20 observations from a binomial distribution # with parameters size=1 and prob=0.5. Test the null hypothesis that the # probability of "success" for the two distributions is the same against the # alternative that the probability of "success" for group 1 is less than # the probability of "success" for group 2. # (Note: the call to set.seed allows you to reproduce this example). set.seed(23) dat1 <- rbinom(10, size = 1, prob = 0.3) dat2 <- rbinom(20, size = 1, prob = 0.5) test.list <- twoSamplePermutationTestProportion( dat1, dat2, alternative = "less") #---------- # Print the results of the test #------------------------------ test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: p.x - p.y = 0 # #Alternative Hypothesis: True p.x - p.y is less than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Proportions # (Fisher's Exact Test) # #Estimated Parameter(s): p.hat.x = 0.60 # p.hat.y = 0.65 # #Data: x = dat1 # y = dat2 # #Sample Sizes: nx = 10 # ny = 20 # #Test Statistic: p.hat.x - p.hat.y = -0.05 # #P-value: 0.548026 #---------- # Plot the results of the test #------------------------------ dev.new() plot(test.list) #---------- # Compare to the results of fisher.test #-------------------------------------- x11 <- sum(dat1) x21 <- length(dat1) - sum(dat1) x12 <- sum(dat2) x22 <- length(dat2) - sum(dat2) mat <- matrix(c(x11, x12, x21, x22), ncol = 2) fisher.test(mat, alternative = "less") #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: odds ratio = 1 # #Alternative Hypothesis: True odds ratio is less than 1 # #Test Name: Fisher's Exact Test for Count Data # #Estimated Parameter(s): odds ratio = 0.8135355 # #Data: mat # #P-value: 0.548026 # #95% Confidence Interval: LCL = 0.000000 # UCL = 4.076077 #========== # Rodricks (1992, p. 133) presents data from an experiment by # Graham et al. (1975) in which different groups of rats were exposed to # various concentration levels of ethylene thiourea (ETU), a decomposition # product of a certain class of fungicides that can be found in treated foods. # In the group exposed to a dietary level of 250 ppm of ETU, 16 out of 69 rats # (23%) developed thyroid tumors, whereas in the control group # (no exposure to ETU) only 2 out of 72 (3%) rats developed thyroid tumors. # If we use Fisher's exact test to test the null hypothesis that the proportion # of rats exposed to 250 ppm of ETU who will develop thyroid tumors over their # lifetime is no greater than the proportion of rats not exposed to ETU who will # develop tumors, we get a one-sided upper p-value of 0.0002. Therefore, we # conclude that the true underlying rate of tumor incidence in the exposed group # is greater than in the control group. # # The data for this example are stored in Graham.et.al.75.etu.df. # Look at the data #----------------- Graham.et.al.75.etu.df # dose tumors n proportion #1 0 2 72 0.02777778 #2 5 2 75 0.02666667 #3 25 1 73 0.01369863 #4 125 2 73 0.02739726 #5 250 16 69 0.23188406 #6 500 62 70 0.88571429 # Perform the test for a difference in tumor rates #------------------------------------------------- Num.Tumors <- with(Graham.et.al.75.etu.df, tumors[c(5, 1)]) Sample.Sizes <- with(Graham.et.al.75.etu.df, n[c(5, 1)]) test.list <- twoSamplePermutationTestProportion( x = Num.Tumors, y = Sample.Sizes, x.and.y="Number Successes and Trials", alternative = "greater") #---------- # Print the results of the test #------------------------------ test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: p.x - p.y = 0 # #Alternative Hypothesis: True p.x - p.y is greater than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Proportions # (Fisher's Exact Test) # #Estimated Parameter(s): p.hat.x = 0.23188406 # p.hat.y = 0.02777778 # #Data: x = Num.Tumors # n = Sample.Sizes # #Sample Sizes: nx = 69 # ny = 72 # #Test Statistic: p.hat.x - p.hat.y = 0.2041063 # #P-value: 0.0002186462 #---------- # Plot the results of the test #------------------------------ dev.new() plot(test.list) #========== # Clean up #--------- rm(test.list, x11, x12, x21, x22, mat, Num.Tumors, Sample.Sizes) #graphics.off()
# Generate 10 observations from a binomial distribution with parameters # size=1 and prob=0.3, and 20 observations from a binomial distribution # with parameters size=1 and prob=0.5. Test the null hypothesis that the # probability of "success" for the two distributions is the same against the # alternative that the probability of "success" for group 1 is less than # the probability of "success" for group 2. # (Note: the call to set.seed allows you to reproduce this example). set.seed(23) dat1 <- rbinom(10, size = 1, prob = 0.3) dat2 <- rbinom(20, size = 1, prob = 0.5) test.list <- twoSamplePermutationTestProportion( dat1, dat2, alternative = "less") #---------- # Print the results of the test #------------------------------ test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: p.x - p.y = 0 # #Alternative Hypothesis: True p.x - p.y is less than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Proportions # (Fisher's Exact Test) # #Estimated Parameter(s): p.hat.x = 0.60 # p.hat.y = 0.65 # #Data: x = dat1 # y = dat2 # #Sample Sizes: nx = 10 # ny = 20 # #Test Statistic: p.hat.x - p.hat.y = -0.05 # #P-value: 0.548026 #---------- # Plot the results of the test #------------------------------ dev.new() plot(test.list) #---------- # Compare to the results of fisher.test #-------------------------------------- x11 <- sum(dat1) x21 <- length(dat1) - sum(dat1) x12 <- sum(dat2) x22 <- length(dat2) - sum(dat2) mat <- matrix(c(x11, x12, x21, x22), ncol = 2) fisher.test(mat, alternative = "less") #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: odds ratio = 1 # #Alternative Hypothesis: True odds ratio is less than 1 # #Test Name: Fisher's Exact Test for Count Data # #Estimated Parameter(s): odds ratio = 0.8135355 # #Data: mat # #P-value: 0.548026 # #95% Confidence Interval: LCL = 0.000000 # UCL = 4.076077 #========== # Rodricks (1992, p. 133) presents data from an experiment by # Graham et al. (1975) in which different groups of rats were exposed to # various concentration levels of ethylene thiourea (ETU), a decomposition # product of a certain class of fungicides that can be found in treated foods. # In the group exposed to a dietary level of 250 ppm of ETU, 16 out of 69 rats # (23%) developed thyroid tumors, whereas in the control group # (no exposure to ETU) only 2 out of 72 (3%) rats developed thyroid tumors. # If we use Fisher's exact test to test the null hypothesis that the proportion # of rats exposed to 250 ppm of ETU who will develop thyroid tumors over their # lifetime is no greater than the proportion of rats not exposed to ETU who will # develop tumors, we get a one-sided upper p-value of 0.0002. Therefore, we # conclude that the true underlying rate of tumor incidence in the exposed group # is greater than in the control group. # # The data for this example are stored in Graham.et.al.75.etu.df. # Look at the data #----------------- Graham.et.al.75.etu.df # dose tumors n proportion #1 0 2 72 0.02777778 #2 5 2 75 0.02666667 #3 25 1 73 0.01369863 #4 125 2 73 0.02739726 #5 250 16 69 0.23188406 #6 500 62 70 0.88571429 # Perform the test for a difference in tumor rates #------------------------------------------------- Num.Tumors <- with(Graham.et.al.75.etu.df, tumors[c(5, 1)]) Sample.Sizes <- with(Graham.et.al.75.etu.df, n[c(5, 1)]) test.list <- twoSamplePermutationTestProportion( x = Num.Tumors, y = Sample.Sizes, x.and.y="Number Successes and Trials", alternative = "greater") #---------- # Print the results of the test #------------------------------ test.list #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: p.x - p.y = 0 # #Alternative Hypothesis: True p.x - p.y is greater than 0 # #Test Name: Two-Sample Permutation Test # Based on Differences in Proportions # (Fisher's Exact Test) # #Estimated Parameter(s): p.hat.x = 0.23188406 # p.hat.y = 0.02777778 # #Data: x = Num.Tumors # n = Sample.Sizes # #Sample Sizes: nx = 69 # ny = 72 # #Test Statistic: p.hat.x - p.hat.y = 0.2041063 # #P-value: 0.0002186462 #---------- # Plot the results of the test #------------------------------ dev.new() plot(test.list) #========== # Clean up #--------- rm(test.list, x11, x12, x21, x22, mat, Num.Tumors, Sample.Sizes) #graphics.off()
Test the null hypothesis that the variances of two or more normal distributions are the same using Levene's or Bartlett's test.
varGroupTest(object, ...) ## S3 method for class 'formula' varGroupTest(object, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: varGroupTest(object, group, test = "Levene", correct = TRUE, data.name = NULL, group.name = NULL, parent.of.data = NULL, subset.expression = NULL, ...) ## S3 method for class 'data.frame' varGroupTest(object, ...) ## S3 method for class 'matrix' varGroupTest(object, ...) ## S3 method for class 'list' varGroupTest(object, ...)
varGroupTest(object, ...) ## S3 method for class 'formula' varGroupTest(object, data = NULL, subset, na.action = na.pass, ...) ## Default S3 method: varGroupTest(object, group, test = "Levene", correct = TRUE, data.name = NULL, group.name = NULL, parent.of.data = NULL, subset.expression = NULL, ...) ## S3 method for class 'data.frame' varGroupTest(object, ...) ## S3 method for class 'matrix' varGroupTest(object, ...) ## S3 method for class 'list' varGroupTest(object, ...)
object |
an object containing data for 2 or more groups whose variances are to be compared.
In the default method, the argument |
data |
when |
subset |
when |
na.action |
when |
group |
when |
test |
character string indicating which test to use. The possible values are |
correct |
logical scalar indicating whether to use the correction factor for Bartlett's test.
The default value is |
data.name |
character string indicating the name of the data used for the group variance test.
The default value is |
group.name |
character string indicating the name of the data used to create the groups.
The default value is |
parent.of.data |
character string indicating the source of the data used for the group variance test. |
subset.expression |
character string indicating the expression used to subset the data. |
... |
additional arguments affecting the group variance test. |
The function varGroupTest
performs Levene's or Bartlett's test for
homogeneity of variance among two or more groups. The R function var.test
compares two variances.
Bartlett's test is very sensitive to the assumption of normality and will tend to give
significant results even when the null hypothesis is true if the underlying distributions
have long tails (e.g., are leptokurtic). Levene's test is almost as powerful as Bartlett's
test when the underlying distributions are normal, and unlike Bartlett's test it tends to
maintain the assumed -level when the underlying distributions are not normal
(Snedecor and Cochran, 1989, p.252; Milliken and Johnson, 1992, p.22; Conover et al., 1981).
Thus, Levene's test is generally recommended over Bartlett's test.
a list of class "htest"
containing the results of the group variance test.
Objects of class "htest"
have special printing and plotting methods.
See the help file for htest.object
for details.
Chapter 11 of USEPA (2009) discusses using Levene's test to test the assumption of equal variances between monitoring wells or to test that the variance is stable over time when performing intrawell tests.
Steven P. Millard ([email protected])
Conover, W.J., M.E. Johnson, and M.M. Johnson. (1981). A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data. Technometrics 23(4), 351-361.
Davis, C.B. (1994). Environmental Regulatory Statistics. In Patil, G.P., and C.R. Rao, eds., Handbook of Statistics, Vol. 12: Environmental Statistics. North-Holland, Amsterdam, a division of Elsevier, New York, NY, Chapter 26, 817-865.
Milliken, G.A., and D.E. Johnson. (1992). Analysis of Messy Data, Volume I: Designed Experiments. Chapman & Hall, New York.
Snedecor, G.W., and W.G. Cochran. (1989). Statistical Methods, Eighth Edition. Iowa State University Press, Ames Iowa.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
# Example 11-2 of USEPA (2009, page 11-7) gives an example of # testing the assumption of equal variances across wells for arsenic # concentrations (ppb) in groundwater collected at 6 monitoring # wells over 4 months. The data for this example are stored in # EPA.09.Ex.11.1.arsenic.df. head(EPA.09.Ex.11.1.arsenic.df) # Arsenic.ppb Month Well #1 22.9 1 1 #2 3.1 2 1 #3 35.7 3 1 #4 4.2 4 1 #5 2.0 1 2 #6 1.2 2 2 longToWide(EPA.09.Ex.11.1.arsenic.df, "Arsenic.ppb", "Month", "Well", paste.row.name = TRUE, paste.col.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 #Month.1 22.9 2.0 2.0 7.8 24.9 0.3 #Month.2 3.1 1.2 109.4 9.3 1.3 4.8 #Month.3 35.7 7.8 4.5 25.9 0.8 2.8 #Month.4 4.2 52.0 2.5 2.0 27.0 1.2 varGroupTest(Arsenic.ppb ~ Well, data = EPA.09.Ex.11.1.arsenic.df) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Ratio of each pair of variances = 1 # #Alternative Hypothesis: At least one variance differs # #Test Name: Levene's Test for # Homogenity of Variance # #Estimated Parameter(s): Well.1 = 246.8158 # Well.2 = 592.6767 # Well.3 = 2831.4067 # Well.4 = 105.2967 # Well.5 = 207.4467 # Well.6 = 3.9025 # #Data: Arsenic.ppb # #Grouping Variable: Well # #Data Source: EPA.09.Ex.11.1.arsenic.df # #Sample Sizes: Well.1 = 4 # Well.2 = 4 # Well.3 = 4 # Well.4 = 4 # Well.5 = 4 # Well.6 = 4 # #Test Statistic: F = 4.564176 # #Test Statistic Parameters: num df = 5 # denom df = 18 # #P-value: 0.007294084
# Example 11-2 of USEPA (2009, page 11-7) gives an example of # testing the assumption of equal variances across wells for arsenic # concentrations (ppb) in groundwater collected at 6 monitoring # wells over 4 months. The data for this example are stored in # EPA.09.Ex.11.1.arsenic.df. head(EPA.09.Ex.11.1.arsenic.df) # Arsenic.ppb Month Well #1 22.9 1 1 #2 3.1 2 1 #3 35.7 3 1 #4 4.2 4 1 #5 2.0 1 2 #6 1.2 2 2 longToWide(EPA.09.Ex.11.1.arsenic.df, "Arsenic.ppb", "Month", "Well", paste.row.name = TRUE, paste.col.name = TRUE) # Well.1 Well.2 Well.3 Well.4 Well.5 Well.6 #Month.1 22.9 2.0 2.0 7.8 24.9 0.3 #Month.2 3.1 1.2 109.4 9.3 1.3 4.8 #Month.3 35.7 7.8 4.5 25.9 0.8 2.8 #Month.4 4.2 52.0 2.5 2.0 27.0 1.2 varGroupTest(Arsenic.ppb ~ Well, data = EPA.09.Ex.11.1.arsenic.df) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: Ratio of each pair of variances = 1 # #Alternative Hypothesis: At least one variance differs # #Test Name: Levene's Test for # Homogenity of Variance # #Estimated Parameter(s): Well.1 = 246.8158 # Well.2 = 592.6767 # Well.3 = 2831.4067 # Well.4 = 105.2967 # Well.5 = 207.4467 # Well.6 = 3.9025 # #Data: Arsenic.ppb # #Grouping Variable: Well # #Data Source: EPA.09.Ex.11.1.arsenic.df # #Sample Sizes: Well.1 = 4 # Well.2 = 4 # Well.3 = 4 # Well.4 = 4 # Well.5 = 4 # Well.6 = 4 # #Test Statistic: F = 4.564176 # #Test Statistic Parameters: num df = 5 # denom df = 18 # #P-value: 0.007294084
Estimate the variance, test the null hypothesis using the chi-squared test that the variance is equal to a user-specified value, and create a confidence interval for the variance.
varTest(x, alternative = "two.sided", conf.level = 0.95, sigma.squared = 1, data.name = NULL)
varTest(x, alternative = "two.sided", conf.level = 0.95, sigma.squared = 1, data.name = NULL)
x |
numeric vector of observations. Missing ( |
alternative |
character string indicating the kind of alternative hypothesis. The possible values are
|
conf.level |
numeric scalar between 0 and 1 indicating the confidence level associated with the confidence
interval for the population variance. The default value is |
sigma.squared |
a numeric scalar indicating the hypothesized value of the variance. The default value is
|
data.name |
character string indicating the name of the data used for the test of variance. |
The function varTest
performs the one-sample chi-squared test of the hypothesis
that the population variance is equal to the user specified value given by the argument
sigma.squared
, and it also returns a confidence interval for the population variance.
The R function var.test
performs the F-test for comparing two variances.
A list of class "htest"
containing the results of the hypothesis test.
See the help file for htest.object
for details.
Just as you can perform tests of hypothesis on measures of location (mean, median, percentile, etc.), you can do the same thing for measures of spread or variability. Usually, we are interested in estimating variability only because we want to quantify the uncertainty of our estimated location or percentile. Sometimes, however, we are interested in estimating variability and quantifying the uncertainty in our estimate of variability (for example, for performing a sensitivity analysis for power or sample size calculations), or testing whether the population variability is equal to a certain value. There are at least two possible methods of performing a one-sample hypothesis test on variability:
Perform a hypothesis test for the population variance based on the chi-squared statistic, assuming the underlying population is normal.
Perform a hypothesis test for any kind of measure of spread assuming any kind of underlying distribution based on a bootstrap confidence interval (using, for example, the package boot).
You can use varTest
for the first method.
Note: For a one-sample test of location, Student's t-test is fairly robust to departures from normality (i.e., the Type I error rate is maintained), as long as the sample size is reasonably "large." The chi-squared test on the population variance, however, is extremely sensitive to departures from normality. For example, if the underlying population is skewed, the actual Type I error rate will be larger than assumed.
Steven P. Millard ([email protected])
van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York.
Millard, S.P., and N.K. Neerchal. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, FL.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
# Generate 20 observations from a normal distribution with parameters # mean=2 and sd=1. Test the null hypothesis that the true variance is # equal to 0.5 against the alternative that the true variance is not # equal to 0.5. # (Note: the call to set.seed allows you to reproduce this example). set.seed(23) dat <- rnorm(20, mean = 2, sd = 1) varTest(dat, sigma.squared = 0.5) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: variance = 0.5 # #Alternative Hypothesis: True variance is not equal to 0.5 # #Test Name: Chi-Squared Test on Variance # #Estimated Parameter(s): variance = 0.753708 # #Data: dat # #Test Statistic: Chi-Squared = 28.64090 # #Test Statistic Parameter: df = 19 # #P-value: 0.1436947 # #95% Confidence Interval: LCL = 0.4359037 # UCL = 1.6078623 # Note that in this case we would not reject the # null hypothesis at the 5% or even the 10% level. # Clean up rm(dat)
# Generate 20 observations from a normal distribution with parameters # mean=2 and sd=1. Test the null hypothesis that the true variance is # equal to 0.5 against the alternative that the true variance is not # equal to 0.5. # (Note: the call to set.seed allows you to reproduce this example). set.seed(23) dat <- rnorm(20, mean = 2, sd = 1) varTest(dat, sigma.squared = 0.5) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: variance = 0.5 # #Alternative Hypothesis: True variance is not equal to 0.5 # #Test Name: Chi-Squared Test on Variance # #Estimated Parameter(s): variance = 0.753708 # #Data: dat # #Test Statistic: Chi-Squared = 28.64090 # #Test Statistic Parameter: df = 19 # #P-value: 0.1436947 # #95% Confidence Interval: LCL = 0.4359037 # UCL = 1.6078623 # Note that in this case we would not reject the # null hypothesis at the 5% or even the 10% level. # Clean up rm(dat)
Density, distribution function, quantile function, and random generation
for the zero-modified lognormal distribution with parameters meanlog
,
sdlog
, and p.zero
.
The zero-modified lognormal (delta) distribution is the mixture of a lognormal distribution with a positive probability mass at 0.
dzmlnorm(x, meanlog = 0, sdlog = 1, p.zero = 0.5) pzmlnorm(q, meanlog = 0, sdlog = 1, p.zero = 0.5) qzmlnorm(p, meanlog = 0, sdlog = 1, p.zero = 0.5) rzmlnorm(n, meanlog = 0, sdlog = 1, p.zero = 0.5)
dzmlnorm(x, meanlog = 0, sdlog = 1, p.zero = 0.5) pzmlnorm(q, meanlog = 0, sdlog = 1, p.zero = 0.5) qzmlnorm(p, meanlog = 0, sdlog = 1, p.zero = 0.5) rzmlnorm(n, meanlog = 0, sdlog = 1, p.zero = 0.5)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
meanlog |
vector of means of the normal (Gaussian) part of the distribution on the
log scale. The default is |
sdlog |
vector of (positive) standard deviations of the normal (Gaussian)
part of the distribution on the log scale. The default is |
p.zero |
vector of probabilities between 0 and 1 indicating the probability the random
variable equals 0. For |
The zero-modified lognormal (delta) distribution is the mixture of a
lognormal distribution with a positive probability mass at 0. This distribution
was introduced without a name by Aitchison (1955), and the name
-distribution was coined by Aitchison and Brown (1957, p.95).
It is a special case of a “zero-modified” distribution
(see Johnson et al., 1992, p. 312).
Let denote the density of a
lognormal random variable
with parameters
meanlog=
and
sdlog=
. The density function of a
zero-modified lognormal (delta) random variable
with parameters
meanlog=
,
sdlog=
, and
p.zero=
,
denoted
, is given by:
|
|
for |
|
for
|
Note that is not the mean of the zero-modified lognormal
distribution on the log scale; it is the mean of the lognormal part of the
distribution on the log scale. Similarly,
is
not the standard deviation of the zero-modified lognormal distribution
on the log scale; it is the standard deviation of the lognormal part of the
distribution on the log scale.
Let and
denote the mean and standard deviation of the
overall zero-modified lognormal distribution on the log scale. Aitchison (1955)
shows that:
Note that when p.zero=
=0
, the zero-modified lognormal
distribution simplifies to the lognormal distribution.
dzmlnorm
gives the density, pzmlnorm
gives the distribution function,
qzmlnorm
gives the quantile function, and rzmlnorm
generates random
deviates.
The zero-modified lognormal (delta) distribution is sometimes used to model chemical concentrations for which some observations are reported as “Below Detection Limit” (the nondetects are assumed equal to 0). See, for example, Gilliom and Helsel (1986), Owen and DeRouen (1980), and Gibbons et al. (2009, Chapter 12). USEPA (2009, Chapter 15) recommends this strategy only in specific situations, and Helsel (2012, Chapter 1) strongly discourages this approach to dealing with non-detects.
A variation of the zero-modified lognormal (delta) distribution is the zero-modified normal distribution, in which a normal distribution is mixed with a positive probability mass at 0.
One way to try to assess whether a zero-modified lognormal (delta),
zero-modified normal, censored normal, or censored lognormal is the best
model for the data is to construct both censored and detects-only probability
plots (see qqPlotCensored
).
Steven P. Millard ([email protected])
Aitchison, J. (1955). On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association 50, 901-908.
Aitchison, J., and J.A.C. Brown (1957). The Lognormal Distribution (with special reference to its uses in economics). Cambridge University Press, London. pp.94-99.
Crow, E.L., and K. Shimizu. (1988). Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, pp.47-51.
Gibbons, RD., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring. Second Edition. John Wiley and Sons, Hoboken, NJ.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135-146.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R. Second Edition. John Wiley and Sons, Hoboken, NJ, Chapter 1.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, p.312.
Owen, W., and T. DeRouen. (1980). Estimation of the Mean for Lognormal Data Containing Zeros and Left-Censored Values, with Applications to the Measurement of Worker Exposure to Air Contaminants. Biometrics 36, 707-719.
USEPA (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, Permits and State Programs Division, US Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Zero-Modified Lognormal (Alternative Parameterization),
Lognormal, LognormalAlt,
Zero-Modified Normal,
ezmlnorm
, Probability Distributions and Random Numbers.
# Density of the zero-modified lognormal (delta) distribution with # parameters meanlog=0, sdlog=1, and p.zero=0.5, evaluated at # 0, 0.5, 1, 1.5, and 2: dzmlnorm(seq(0, 2, by = 0.5)) #[1] 0.50000000 0.31374804 0.19947114 0.12248683 #[5] 0.07843701 #---------- # The cdf of the zero-modified lognormal (delta) distribution with # parameters meanlog=1, sdlog=2, and p.zero=0.1, evaluated at 4: pzmlnorm(4, 1, 2, .1) #[1] 0.6189203 #---------- # The median of the zero-modified lognormal (delta) distribution with # parameters meanlog=2, sdlog=3, and p.zero=0.1: qzmlnorm(0.5, 2, 3, 0.1) #[1] 4.859177 #---------- # Random sample of 3 observations from the zero-modified lognormal # (delta) distribution with parameters meanlog=1, sdlog=2, and p.zero=0.4. # (Note: The call to set.seed simply allows you to reproduce this example.) set.seed(20) rzmlnorm(3, 1, 2, 0.4) #[1] 0.000000 0.000000 3.146641
# Density of the zero-modified lognormal (delta) distribution with # parameters meanlog=0, sdlog=1, and p.zero=0.5, evaluated at # 0, 0.5, 1, 1.5, and 2: dzmlnorm(seq(0, 2, by = 0.5)) #[1] 0.50000000 0.31374804 0.19947114 0.12248683 #[5] 0.07843701 #---------- # The cdf of the zero-modified lognormal (delta) distribution with # parameters meanlog=1, sdlog=2, and p.zero=0.1, evaluated at 4: pzmlnorm(4, 1, 2, .1) #[1] 0.6189203 #---------- # The median of the zero-modified lognormal (delta) distribution with # parameters meanlog=2, sdlog=3, and p.zero=0.1: qzmlnorm(0.5, 2, 3, 0.1) #[1] 4.859177 #---------- # Random sample of 3 observations from the zero-modified lognormal # (delta) distribution with parameters meanlog=1, sdlog=2, and p.zero=0.4. # (Note: The call to set.seed simply allows you to reproduce this example.) set.seed(20) rzmlnorm(3, 1, 2, 0.4) #[1] 0.000000 0.000000 3.146641
Density, distribution function, quantile function, and random generation
for the zero-modified lognormal distribution with parameters mean
,
cv
, and p.zero
.
The zero-modified lognormal (delta) distribution is the mixture of a lognormal distribution with a positive probability mass at 0.
dzmlnormAlt(x, mean = exp(1/2), cv = sqrt(exp(1) - 1), p.zero = 0.5) pzmlnormAlt(q, mean = exp(1/2), cv = sqrt(exp(1) - 1), p.zero = 0.5) qzmlnormAlt(p, mean = exp(1/2), cv = sqrt(exp(1) - 1), p.zero = 0.5) rzmlnormAlt(n, mean = exp(1/2), cv = sqrt(exp(1) - 1), p.zero = 0.5)
dzmlnormAlt(x, mean = exp(1/2), cv = sqrt(exp(1) - 1), p.zero = 0.5) pzmlnormAlt(q, mean = exp(1/2), cv = sqrt(exp(1) - 1), p.zero = 0.5) qzmlnormAlt(p, mean = exp(1/2), cv = sqrt(exp(1) - 1), p.zero = 0.5) rzmlnormAlt(n, mean = exp(1/2), cv = sqrt(exp(1) - 1), p.zero = 0.5)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
mean |
vector of means of the lognormal part of the distribution on the.
The default is |
cv |
vector of (positive) coefficients of variation of the lognormal
part of the distribution. The default is |
p.zero |
vector of probabilities between 0 and 1 indicating the probability the random
variable equals 0. For |
The zero-modified lognormal (delta) distribution is the mixture of a
lognormal distribution with a positive probability mass at 0. This distribution
was introduced without a name by Aitchison (1955), and the name
-distribution was coined by Aitchison and Brown (1957, p.95).
It is a special case of a “zero-modified” distribution
(see Johnson et al., 1992, p. 312).
Let denote the density of a
lognormal random variable
with parameters
mean=
and
cv=
. The density function of a
zero-modified lognormal (delta) random variable
with parameters
mean=
,
cv=
, and
p.zero=
,
denoted
, is given by:
|
|
for |
|
for
|
Note that is not the mean of the zero-modified lognormal
distribution; it is the mean of the lognormal part of the distribution.
Similarly,
is not the coefficient of variation of the
zero-modified lognormal distribution; it is the coefficient of variation of the
lognormal part of the distribution.
Let ,
, and
denote the mean,
standard deviation, and coefficient of variation of the overall zero-modified
lognormal distribution. Let
denote the standard deviation of the
lognormal part of the distribution, so that
.
Aitchison (1955) shows that:
so that
Note that when p.zero=
=0
, the zero-modified lognormal
distribution simplifies to the lognormal distribution.
dzmlnormAlt
gives the density, pzmlnormAlt
gives the distribution function,
qzmlnormAlt
gives the quantile function, and rzmlnormAlt
generates random
deviates.
The zero-modified lognormal (delta) distribution is sometimes used to model chemical concentrations for which some observations are reported as “Below Detection Limit” (the nondetects are assumed equal to 0). See, for example, Gilliom and Helsel (1986), Owen and DeRouen (1980), and Gibbons et al. (2009, Chapter 12). USEPA (2009, Chapter 15) recommends this strategy only in specific situations, and Helsel (2012, Chapter 1) strongly discourages this approach to dealing with non-detects.
A variation of the zero-modified lognormal (delta) distribution is the zero-modified normal distribution, in which a normal distribution is mixed with a positive probability mass at 0.
One way to try to assess whether a zero-modified lognormal (delta),
zero-modified normal, censored normal, or censored lognormal is the best
model for the data is to construct both censored and detects-only probability
plots (see qqPlotCensored
).
Steven P. Millard ([email protected])
Aitchison, J. (1955). On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association 50, 901-908.
Aitchison, J., and J.A.C. Brown (1957). The Lognormal Distribution (with special reference to its uses in economics). Cambridge University Press, London. pp.94-99.
Crow, E.L., and K. Shimizu. (1988). Lognormal Distributions: Theory and Applications. Marcel Dekker, New York, pp.47-51.
Gibbons, RD., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring. Second Edition. John Wiley and Sons, Hoboken, NJ.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135-146.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R. Second Edition. John Wiley and Sons, Hoboken, NJ, Chapter 1.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, p.312.
Owen, W., and T. DeRouen. (1980). Estimation of the Mean for Lognormal Data Containing Zeros and Left-Censored Values, with Applications to the Measurement of Worker Exposure to Air Contaminants. Biometrics 36, 707-719.
USEPA (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, Permits and State Programs Division, US Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Zero-Modified Lognormal, LognormalAlt,
ezmlnormAlt
, Probability Distributions and Random Numbers.
# Density of the zero-modified lognormal (delta) distribution with # parameters mean=10, cv=1, and p.zero=0.5, evaluated at # 9, 10, and 11: dzmlnormAlt(9:11, mean = 10, cv = 1, p.zero = 0.5) #[1] 0.02552685 0.02197043 0.01891924 #---------- # The cdf of the zero-modified lognormal (delta) distribution with # parameters mean=10, cv=2, and p.zero=0.1, evaluated at 8: pzmlnormAlt(8, 10, 2, .1) #[1] 0.709009 #---------- # The median of the zero-modified lognormal (delta) distribution with # parameters mean=10, cv=2, and p.zero=0.1: qzmlnormAlt(0.5, 10, 2, 0.1) #[1] 3.74576 #---------- # Random sample of 3 observations from the zero-modified lognormal # (delta) distribution with parameters mean=10, cv=2, and p.zero=0.4. # (Note: The call to set.seed simply allows you to reproduce this example.) set.seed(20) rzmlnormAlt(3, 10, 2, 0.4) #[1] 0.000000 0.000000 4.907131
# Density of the zero-modified lognormal (delta) distribution with # parameters mean=10, cv=1, and p.zero=0.5, evaluated at # 9, 10, and 11: dzmlnormAlt(9:11, mean = 10, cv = 1, p.zero = 0.5) #[1] 0.02552685 0.02197043 0.01891924 #---------- # The cdf of the zero-modified lognormal (delta) distribution with # parameters mean=10, cv=2, and p.zero=0.1, evaluated at 8: pzmlnormAlt(8, 10, 2, .1) #[1] 0.709009 #---------- # The median of the zero-modified lognormal (delta) distribution with # parameters mean=10, cv=2, and p.zero=0.1: qzmlnormAlt(0.5, 10, 2, 0.1) #[1] 3.74576 #---------- # Random sample of 3 observations from the zero-modified lognormal # (delta) distribution with parameters mean=10, cv=2, and p.zero=0.4. # (Note: The call to set.seed simply allows you to reproduce this example.) set.seed(20) rzmlnormAlt(3, 10, 2, 0.4) #[1] 0.000000 0.000000 4.907131
Density, distribution function, quantile function, and random generation
for the zero-modified normal distribution with parameters mean
,
sd
, and p.zero
.
The zero-modified normal distribution is the mixture of a normal distribution with a positive probability mass at 0.
dzmnorm(x, mean = 0, sd = 1, p.zero = 0.5) pzmnorm(q, mean = 0, sd = 1, p.zero = 0.5) qzmnorm(p, mean = 0, sd = 1, p.zero = 0.5) rzmnorm(n, mean = 0, sd = 1, p.zero = 0.5)
dzmnorm(x, mean = 0, sd = 1, p.zero = 0.5) pzmnorm(q, mean = 0, sd = 1, p.zero = 0.5) qzmnorm(p, mean = 0, sd = 1, p.zero = 0.5) rzmnorm(n, mean = 0, sd = 1, p.zero = 0.5)
x |
vector of quantiles. |
q |
vector of quantiles. |
p |
vector of probabilities between 0 and 1. |
n |
sample size. If |
mean |
vector of means of the normal (Gaussian) part of the distribution.
The default is |
sd |
vector of (positive) standard deviations of the normal (Gaussian)
part of the distribution. The default is |
p.zero |
vector of probabilities between 0 and 1 indicating the probability the random
variable equals 0. For |
The zero-modified normal distribution is the mixture of a normal distribution with a positive probability mass at 0.
Let denote the density of a
normal (Gaussian) random variable
with parameters
mean=
and
sd=
. The density function of a
zero-modified normal random variable
with parameters
mean=
,
sd=
, and
p.zero=
, denoted
,
is given by:
|
|
for |
|
for
|
Note that is not the mean of the zero-modified normal distribution;
it is the mean of the normal part of the distribution. Similarly,
is
not the standard deviation of the zero-modified normal distribution; it is
the standard deviation of the normal part of the distribution.
Let and
denote the mean and standard deviation of the
overall zero-modified normal distribution. Aitchison (1955) shows that:
Note that when p.zero=
=0
, the zero-modified normal
distribution simplifies to the normal distribution.
dzmnorm
gives the density, pzmnorm
gives the distribution function,
qzmnorm
gives the quantile function, and rzmnorm
generates random
deviates.
The zero-modified normal distribution is sometimes used to model chemical concentrations for which some observations are reported as “Below Detection Limit”. See, for example USEPA (1992c, pp.27-34) and Gibbons et al. (2009, Chapter 12). Note, however, that USEPA (1992c) has been superseded by USEPA (2009) which recommends this strategy only in specific situations (see Chapter 15 of the document). This strategy is strongly discouraged by Helsel (2012, Chapter 1).
In cases where you want to model chemical concentrations for which some observations are reported as “Below Detection Limit” and you want to treat the non-detects as equal to 0, it will usually be more appropriate to model the data with a zero-modified lognormal (delta) distribution since chemical concentrations are bounded below at 0 (e.g., Gilliom and Helsel, 1986; Owen and DeRouen, 1980).
One way to try to assess whether a zero-modified lognormal (delta),
zero-modified normal, censored normal, or censored lognormal is the best
model for the data is to construct both censored and detects-only probability
plots (see qqPlotCensored
).
Steven P. Millard ([email protected])
Aitchison, J. (1955). On the Distribution of a Positive Random Variable Having a Discrete Probability Mass at the Origin. Journal of the American Statistical Association 50, 901-908.
Gilliom, R.J., and D.R. Helsel. (1986). Estimation of Distributional Parameters for Censored Trace Level Water Quality Data: 1. Estimation Techniques. Water Resources Research 22, 135-146.
Gibbons, RD., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring. Second Edition. John Wiley and Sons, Hoboken, NJ.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R. Second Edition. John Wiley and Sons, Hoboken, NJ, Chapter 1.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions. Second Edition. John Wiley and Sons, New York, p.312.
Owen, W., and T. DeRouen. (1980). Estimation of the Mean for Lognormal Data Containing Zeros and Left-Censored Values, with Applications to the Measurement of Worker Exposure to Air Contaminants. Biometrics 36, 707-719.
USEPA (1992c). Statistical Analysis of Ground-Water Monitoring Data at RCRA Facilities: Addendum to Interim Final Guidance. Office of Solid Waste, Permits and State Programs Division, US Environmental Protection Agency, Washington, D.C.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.
Zero-Modified Lognormal, Normal,
ezmnorm
, Probability Distributions and Random Numbers.
# Density of the zero-modified normal distribution with parameters # mean=2, sd=1, and p.zero=0.5, evaluated at 0, 0.5, 1, 1.5, and 2: dzmnorm(seq(0, 2, by = 0.5), mean = 2) #[1] 0.5000000 0.0647588 0.1209854 0.1760327 0.1994711 #---------- # The cdf of the zero-modified normal distribution with parameters # mean=3, sd=2, and p.zero=0.1, evaluated at 4: pzmnorm(4, 3, 2, .1) #[1] 0.7223162 #---------- # The median of the zero-modified normal distribution with parameters # mean=3, sd=1, and p.zero=0.1: qzmnorm(0.5, 3, 1, 0.1) #[1] 2.86029 #---------- # Random sample of 3 observations from the zero-modified normal distribution # with parameters mean=3, sd=1, and p.zero=0.4. # (Note: The call to set.seed simply allows you to reproduce this example.) set.seed(20) rzmnorm(3, 3, 1, 0.4) #[1] 0.000000 0.000000 3.073168
# Density of the zero-modified normal distribution with parameters # mean=2, sd=1, and p.zero=0.5, evaluated at 0, 0.5, 1, 1.5, and 2: dzmnorm(seq(0, 2, by = 0.5), mean = 2) #[1] 0.5000000 0.0647588 0.1209854 0.1760327 0.1994711 #---------- # The cdf of the zero-modified normal distribution with parameters # mean=3, sd=2, and p.zero=0.1, evaluated at 4: pzmnorm(4, 3, 2, .1) #[1] 0.7223162 #---------- # The median of the zero-modified normal distribution with parameters # mean=3, sd=1, and p.zero=0.1: qzmnorm(0.5, 3, 1, 0.1) #[1] 2.86029 #---------- # Random sample of 3 observations from the zero-modified normal distribution # with parameters mean=3, sd=1, and p.zero=0.4. # (Note: The call to set.seed simply allows you to reproduce this example.) set.seed(20) rzmnorm(3, 3, 1, 0.4) #[1] 0.000000 0.000000 3.073168
Estimate the shape parameter of a generalized extreme value distribution and test the null hypothesis that the true value is equal to 0.
zTestGevdShape(x, pwme.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), alternative = "two.sided")
zTestGevdShape(x, pwme.method = "unbiased", plot.pos.cons = c(a = 0.35, b = 0), alternative = "two.sided")
x |
numeric vector of observations.
Missing ( |
pwme.method |
character string specifying the method of estimating the probability-weighted
moments. Possible values are |
plot.pos.cons |
numeric vector of length 2 specifying the constants used in the formula for the
plotting positions. The default value is |
alternative |
character string indicating the kind of alternative hypothesis. The possible values
are |
Let be a vector of
observations
from a generalized extreme value distribution with parameters
location=
,
scale=
, and
shape=
.
Furthermore, let
denote the probability-weighted moments
estimator (PWME) of the shape parameter
(see the help file for
egevd
). Then the statistic
is asymptotically distributed as a N(0,1) random variable under the null hypothesis
(Hosking et al., 1985). The function
zTestGevdShape
performs the usual one-sample z-test using the statistic computed in Equation (1).
The PWME of may be computed using either U-statistic type
probability-weighted moments estimators or plotting-position type estimators
(see
egevd
). Although Hosking et al. (1985) base their statistic on
plotting-position type estimators, Hosking and Wallis (1995) recommend using the
U-statistic type estimators for almost all applications.
This test is only asymptotically correct. Hosking et al. (1985), however, found
that the -level is adequately maintained for samples as small as
.
A list of class "htest"
containing the results of the hypothesis test.
See the help file for htest.object
for details.
Two-parameter extreme value distributions (EVD) have been applied extensively since the 1930's to several fields of study, including the distributions of hydrological and meteorological variables, human lifetimes, and strength of materials. The three-parameter generalized extreme value distribution (GEVD) was introduced by Jenkinson (1955) to model annual maximum and minimum values of meteorological events. Since then, it has been used extensively in the hydological and meteorological fields.
The three families of EVDs are all special kinds of GEVDs. When the shape parameter
, the GEVD reduces to the Type I extreme value (Gumbel) distribution.
When
, the GEVD is the same as the Type II extreme value
distribution, and when
it is the same as the Type III extreme value
distribution.
Hosking et al. (1985) introduced the test used by the function zTestGevdShape
to test the null hypothesis . They found this test has power
comparable to the modified likelihood-ratio test, which was found by Hosking (1984)
to be the best overall test the thirteen tests he considered.
Fill and Stedinger (1995) denote this test the “kappa test” and compare it
with the L-Cs test suggested by Chowdhury et al. (1991) and the probability
plot correlation coefficient goodness-of-fit test for the Gumbel distribution given
by Vogel (1986) (see the sub-section for test="ppcc"
under the Details section
of the help file for gofTest
).
Steven P. Millard ([email protected])
Chowdhury, J.U., J.R. Stedinger, and L. H. Lu. (1991). Goodness-of-Fit Tests for Regional Generalized Extreme Value Flood Distributions. Water Resources Research 27(7), 1765–1776.
Fill, H.D., and J.R. Stedinger. (1995). L Moment and Probability Plot Correlation Coefficient Goodness-of-Fit Tests for the Gumbel Distribution and Impact of Autocorrelation. Water Resources Research 31(1), 225–229.
Hosking, J.R.M. (1984). Testing Whether the Shape Parameter is Zero in the Generalized Extreme-Value Distribution. Biometrika 71(2), 367–374.
Hosking, J.R.M., and J.R. Wallis (1995). A Comparison of Unbiased and Plotting-Position Estimators of L Moments. Water Resources Research 31(8), 2019–2025.
Hosking, J.R.M., J.R. Wallis, and E.F. Wood. (1985). Estimation of the Generalized Extreme-Value Distribution by the Method of Probability-Weighted Moments. Technometrics 27(3), 251–261.
Jenkinson, A.F. (1955). The Frequency Distribution of the Annual Maximum (or Minimum) of Meteorological Events. Quarterly Journal of the Royal Meteorological Society 81, 158–171.
Vogel, R.M. (1986). The Probability Plot Correlation Coefficient Test for the Normal, Lognormal, and Gumbel Distributional Hypotheses. Water Resources Research 22(4), 587–590. (Correction, Water Resources Research 23(10), 2013, 1987.)
GEVD, egevd
, EVD, eevd
,
Goodness-of-Fit Tests, htest.object
.
# Generate 25 observations from a generalized extreme value distribution with # parameters location=2, scale=1, and shape=1, and test the null hypothesis # that the shape parameter is equal to 0. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgevd(25, location = 2, scale = 1, shape = 1) zTestGevdShape(dat) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: shape = 0 # #Alternative Hypothesis: True shape is not equal to 0 # #Test Name: Z-test of shape=0 for GEVD # #Estimated Parameter(s): shape = 0.6623014 # #Estimation Method: Unbiased pwme # #Data: dat # #Sample Size: 25 # #Test Statistic: z = 4.412206 # #P-value: 1.023225e-05 #---------- # Clean up #--------- rm(dat)
# Generate 25 observations from a generalized extreme value distribution with # parameters location=2, scale=1, and shape=1, and test the null hypothesis # that the shape parameter is equal to 0. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(250) dat <- rgevd(25, location = 2, scale = 1, shape = 1) zTestGevdShape(dat) #Results of Hypothesis Test #-------------------------- # #Null Hypothesis: shape = 0 # #Alternative Hypothesis: True shape is not equal to 0 # #Test Name: Z-test of shape=0 for GEVD # #Estimated Parameter(s): shape = 0.6623014 # #Estimation Method: Unbiased pwme # #Data: dat # #Sample Size: 25 # #Test Statistic: z = 4.412206 # #P-value: 1.023225e-05 #---------- # Clean up #--------- rm(dat)