Package 'EnvStats'

Title:	Package for Environmental Statistics, Including US EPA Guidance
Description:	Graphical and statistical analyses of environmental data, with focus on analyzing chemical concentrations and physical parameters, usually in the context of mandated environmental monitoring. Major environmental statistical methods found in the literature and regulatory guidance documents, with extensive help that explains what these methods do, how to use them, and where to find them in the literature. Numerous built-in data sets from regulatory guidance documents and environmental statistics literature. Includes scripts reproducing analyses presented in the book "EnvStats: An R Package for Environmental Statistics" (Millard, 2013, Springer, ISBN 978-1-4614-8455-4, <doi:10.1007/978-1-4614-8456-1>).
Authors:	Steven P. Millard [aut], Alexander Kowarik [ctb, cre]
Maintainer:	Alexander Kowarik <[email protected]>
License:	GPL (>= 3)
Version:	3.1.0
Built:	2025-03-31 06:34:32 UTC
Source:	https://github.com/alexkowa/envstats

Help Index

Trichloroethylene Concentrations Before and After Remedation
Compute Lack-of-Fit and Pure Error Anova Table for a Linear Model
Compute Sample Size Necessary to Achieve Specified Power for One-Way Fixed-Effects Analysis of Variance
Compute the Power of a One-Way Fixed-Effects Analysis of Variance
Base b Representation of a Number
Lead concentration in soil samples.
Benthic Data from Monitoring Program in Chesapeake Bay
Randomly sampled measurements of an analyte in soil samples.
Boxcox Power Transformation
S3 Class "boxcox"
Boxcox Power Transformation for Type I Censored Data
S3 Class "boxcoxCensored"
S3 Class "boxcoxLm"
Apply a Box-Cox Power Transformation to a Set of Data
Fit a Calibration Line or Curve
S3 Class "calibrate"
Abstract: Castillo and Hadi (1994)
Plot Two Cumulative Distribution Functions
Plot Two Cumulative Distribution Functions Based on Censored Data
Plot Cumulative Distribution Function
Chen's Modified One-Sided t-test for Skewed Distributions
The Chi Distribution
Half-Width of Confidence Interval for Binomial Proportion or Difference Between Two Proportions
Sample Size for Specified Half-Width of Confidence Interval for Binomial Proportion or Difference Between Two Proportions
Half-Width of Confidence Interval for Normal Distribution Mean or Difference Between Two Means
Sample Size for Specified Half-Width of Confidence Interval for Normal Distribution Mean or Difference Between Two Means
Compute Confidence Level Associated with a Nonparametric Confidence Interval for a Quantile
Sample Size for Nonparametric Confidence Interval for a Quantile
Table of Confidence Intervals for Mean or Difference Between Two Means
Table of Confidence Intervals for Proportion or Difference Between Two Proportions
Sample Coefficient of Variation.
Determine Detection Limit
Choose Best Fitting Distribution Based on Goodness-of-Fit Tests
S3 Class "distChoose"
Choose Best Fitting Distribution Based on Goodness-of-Fit Tests for Censored Data
S3 Class "distChooseCensored"
Data Frame Summarizing Available Probability Distributions and Estimation Methods
Estimate Parameters of a Beta Distribution
Estimate Parameter of a Binomial Distribution
Empirical Cumulative Distribution Function Plot
Empirical Cumulative Distribution Function Plot Based on Type I Censored Data
Estimate Parameters of an Extreme Value (Gumbel) Distribution
Estimate Rate Parameter of an Exponential Distribution
Estimate Parameters of Gamma Distribution
Estimate Mean and Coefficient of Variation for a Gamma Distribution Based on Type I Censored Data
Estimate Shape and Scale Parameters for a Gamma Distribution Based on Type I Censored Data
Estimate Probability Parameter of a Geometric Distribution
Estimate Parameters of a Generalized Extreme Value Distribution
Estimate Parameter of a Hypergeometric Distribution
Estimate Parameters of a Lognormal Distribution (Log-Scale)
Estimate Parameters of a Three-Parameter Lognormal Distribution (Log-Scale)
Estimate Parameters of a Lognormal Distribution (Original Scale)
Estimate Parameters for a Lognormal Distribution (Original Scale) Based on Type I Censored Data
Estimate Parameters for a Lognormal Distribution (Log-Scale) Based on Type I Censored Data
Estimate Parameters of a Logistic Distribution
The Empirical Distribution Based on a Set of Observations
Estimate Probability Parameter of a Negative Binomial Distribution
Estimate Parameters of a Normal (Gaussian) Distribution
Estimate Parameters for a Normal Distribution Based on Type I Censored Data
Estimate Mean, Standard Deviation, and Standard Error Nonparametrically
Estimate Mean, Standard Deviation, and Standard Error Nonparametrically Based on Censored Data
Atmospheric Environmental Conditions in New York City
Concentrations in Exhibit 2 of 2002d USEPA Guidance Document
Concentrations in Exhibit 4 of 2002d USEPA Guidance Document
Concentrations in Exhibit 6 of 2002d USEPA Guidance Document
Concentrations in Exhibit 9 of 2002d USEPA Guidance Document
Nickel Concentrations from Example 10-1 of 2009 USEPA Guidance Document
Arsenic Concentrations from Example 11-1 of 2009 USEPA Guidance Document
Carbon Tetrachloride Concentrations from Example 12-1 of 2009 USEPA Guidance Document
Naphthalene Concentrations from Example 12-4 of 2009 USEPA Guidance Document
Iron Concentrations from Example 13-1 of 2009 USEPA Guidance Document
Manganese Concentrations from Example 14-1 of 2009 USEPA Guidance Document
Alkalinity Measures from Example 14-3 of 2009 USEPA Guidance Document
Arsenic Concentrations from Example 14-4 of 2009 USEPA Guidance Document
Analyte Concentrations from Example 14-8 of 2009 USEPA Guidance Document
Manganese Concentrations from Example 15-1 of 2009 USEPA Guidance Document
Sulfate Concentrations from Example 16-1 of 2009 USEPA Guidance Document
Benzene Concentrations from Example 16-2 of 2009 USEPA Guidance Document
Copper Concentrations from Example 16-4 of 2009 USEPA Guidance Document
Tetrachloroethylene Concentrations from Example 16-5 of 2009 USEPA Guidance Document
Log-transformed Lead Concentrations from Example 17-1 of 2009 USEPA Guidance Document
Toluene Concentrations from Example 17-2 of 2009 USEPA Guidance Document
Chrysene Concentrations from Example 17-3 of 2009 USEPA Guidance Document
Log-transformed Chrysene Concentrations from Example 17-3 of 2009 USEPA Guidance Document
Copper Concentrations from Example 17-4 of 2009 USEPA Guidance Document
Chloride Concentrations from Example 17-5 of 2009 USEPA Guidance Document
Sulfate Concentrations from Example 17-6 of 2009 USEPA Guidance Document
Sodium Concentrations from Example 17-7 of 2009 USEPA Guidance Document
Arsenic Concentrations from Example 18-1 of 2009 USEPA Guidance Document
Chrysene Concentrations from Example 18-2 of 2009 USEPA Guidance Document
Trichloroethylene Concentrations from Example 18-3 of 2009 USEPA Guidance Document
Xylene Concentrations from Example 18-4 of 2009 USEPA Guidance Document
Sulfate Concentrations from Example 19-1 of 2009 USEPA Guidance Document
Chloride Concentrations from Example 19-2 of 2009 USEPA Guidance Document
Mercury Concentrations from Example 19-5 of 2009 USEPA Guidance Document
Nickel Concentrations from Example 20-1 of 2009 USEPA Guidance Document
Aldicarb Concentrations from Example 21-1 of 2009 USEPA Guidance Document
Benzene Concentrations from Example 21-2 of 2009 USEPA Guidance Document
Beryllium Concentrations from Example 21-5 of 2009 USEPA Guidance Document
Nitrate Concentrations from Example 21-6 of 2009 USEPA Guidance Document
Trichloroethylene Concentrations from Example 21-7 of 2009 USEPA Guidance Document
Vinyl Chloride Concentrations from Example 22-1 of 2009 USEPA Guidance Document
Specific Conductance from Example 22-2 of 2009 USEPA Guidance Document
Sulfate Concentrations from Example 6-3 of 2009 USEPA Guidance Document
Arsenic concentrations from Example 7.1 of 2009 USEPA Guidance Document
Trichloroethene concentrations in Table 9.1 of 2009 USEPA Guidance Document
Arsenic, Mercury and Strontium Concentrations in Table 9-3 of 2009 USEPA Guidance Document
Nickel Concentrations in Table 9-4 of 2009 USEPA Guidance Document
Aldicarb Concentrations from 1989 USEPA Guidance Document
Aldicarb Concentrations from 1989 USEPA Guidance Document
Benzene Concentrations from 1989 USEPA Guidance Document
Cadmium Concentrations from 1989 USEPA Guidance Document
Chlordane Concentrations from 1989 USEPA Guidance Document
Chlordane Concentrations from 1989 USEPA Guidance Document
EDB Concentrations from 1989 USEPA Guidance Document
Lead Concentrations from 1989 USEPA Guidance Document
Log-transformed Lead Concentrations from 1989 USEPA Guidance Document
Manganese Concentrations from 1989 USEPA Guidance Document
Sulfate Concentrations from 1989 USEPA Guidance Document
T-29 Concentrations from 1989 USEPA Guidance Document
Total Organic Carbon Concentrations from 1989 USEPA Guidance Document
Arsenic Concentrations from 1992 USEPA Guidance Document
Arsenic Concentrations from 1992 USEPA Guidance Document
Arsenic Concentrations from 1992 USEPA Guidance Document
Benzene Concentrations from 1992 USEPA Guidance Document
Benzene Concentrations from 1992 USEPA Guidance Document
Carbon Tetrachloride Concentrations from 1992 USEPA Guidance Document
Chrysene Concentrations from 1992 USEPA Guidance Document
Copper Concentrations from 1992 USEPA Guidance Document
Copper Concentrations from 1992 USEPA Guidance Document
Log-transformed Nickel Concentrations from 1992 USEPA Guidance Document
Nickel Concentrations from 1992 USEPA Guidance Document
Nickel Concentrations from 1992 USEPA Guidance Document
Toluene Concentrations from 1992 USEPA Guidance Document
Zinc Concentrations from 1992 USEPA Guidance Document
Chromium Concentrations from 1992 USEPA Guidance Document
Chromium Concentrations from 1992 USEPA Guidance Document
Lead Concentrations from 1994 USEPA Guidance Document
1,2,3,4-Tetrachlorobenzene Concentrations from 1994 USEPA Guidance Document
Calibration Data for Cadmium at Mass 111
Estimate Parameters of a Pareto Distribution
Plot Empirical Probability Density Function
Estimate Parameter of a Poisson Distribution
Estimate Mean of a Poisson Distribution Based on Type I Censored Data
Estimate Quantiles of a Beta Distribution
Estimate Quantiles of a Binomial Distribution
Estimate Quantiles of an Extreme Value (Gumbel) Distribution
Estimate Quantiles of an Exponential Distribution
Estimate Quantiles of a Gamma Distribution
Estimate Quantiles of a Geometric Distribution
Estimate Quantiles of a Generalized Extreme Value Distribution
Estimate Quantiles of a Hypergeometric Distribution
Estimate Quantiles of a Lognormal Distribution
Estimate Quantiles of a Three-Parameter Lognormal Distribution
Estimate Quantiles of a Lognormal Distribution Based on Type I Censored Data
Estimate Quantiles of a Logistic Distribution
Estimate Quantiles of a Negative Binomial Distribution
Estimate Quantiles of a Normal Distribution
Estimate Quantiles of a Normal Distribution Based on Type I Censored Data
Estimate Quantiles of a Distribution Nonparametrically
Estimate Quantiles of a Pareto Distribution
Estimate Quantiles of a Poisson Distribution
Estimate Quantiles of a Uniform Distribution
Estimate Quantiles of a Weibull Distribution
Estimate Quantiles of a Zero-Modified Lognormal (Delta) Distribution
Estimate Quantiles of a Zero-Modified Normal Distribution
Plot Pointwise Error Bars
S3 Class "estimate"
S3 Class "estimateCensored"
Euler's Constant
Estimate Parameters of a Uniform Distribution
The Extreme Value (Gumbel) Distribution
Expected Value of Order Statistics for Random Sample from Standard Normal Distribution
Estimate Parameters of a Weibull Distribution
Estimate Parameters of a Zero-Modified Lognormal (Delta) Distribution
Estimate Parameters of a Zero-Modified Normal Distribution
EnvStats Functions Listed by Category
EnvStats Functions for Calibration
EnvStats Functions for Censored Data
EnvStats Functions for Data Transformations
EnvStats Functions for Estimating Distribution Parameters
EnvStats Functions for Estimating Distribution Quantiles
EnvStats Functions for Goodness-of-Fit Tests
EnvStats Functions for Hypothesis Tests
EnvStats Functions for Monte Carlo Simulation and Risk Assessment
EnvStats Functions for Plotting Probability Distributions
EnvStats Functions for Creating Plots Using the ggplot2 Package
EnvStats Functions for Power and Sample Size Calculations
EnvStats Functions for Prediction Intervals
EnvStats Functions for Printing and Plotting Objects of Various S3 Classes
EnvStats Probability Distributions and Random Numbers
EnvStats Functions for Summary Statistics and Plots
EnvStats Functions for Tolerance Intervals
EnvStats Functions for Trend Analysis
The Gamma Distribution (Alternative Parameterization)
1-D Scatter Plots with Confidence Intervals Using ggplot2
Geometric Mean
Geometric Standard Deviation.
The Generalized Extreme Value Distribution
Alkilinity Data from Gibbons et al. (2009)
Vinyl Chloride Data from Gibbons et al. (2009)
S3 Class "gof"
S3 Class "gofCensored"
S3 Class "gofGroup"
Goodness-of-Fit Test for a Specified Probability Distribution for Groups
S3 Class "gofOutlier"
Goodness-of-Fit Test
Goodness-of-Fit Test for Normal or Lognormal Distribution Based on Censored Data
S3 Class "gofTwoSample"
Generalized Pivotal Quantity for Confidence Interval for the Mean of a Normal Distribution Based on Censored Data
Generalized Pivotal Quantity for Tolerance Interval for a Normal Distribution Based on Censored Data
Ethylene Thiourea Dose-Response Data
Adjusted Alpha Levels to Compute Confidence Intervals for the Mean of a Gamma Distribution
Example of Multiply Left-censored Data from Literature
Silver Concentrations From An Interlab Comparison
Paired Counts of Mayfly Nymphs Above and Below Industrial Outfalls
Abstract: Hosking et al. (1985)
S3 Class "htest"
S3 Class "htestCensored"
Predict Concentration Using Calibration
Interquartile Range
Nonparametric Test for Monotonic Trend Within Each Season Based on Kendall's Tau Statistic
Kendall's Nonparametric Test for Montonic Trend
Coefficient of (Excess) Kurtosis
Fecal Coliform Data from the Illinois River
Sample Size for a t-Test for Linear Trend
Power of a t-Test for Linear Trend
Scaled Minimal Detectable Slope for a t-Test for Linear Trend
Estimate L-Moments
The Three-Parameter Lognormal Distribution
The Lognormal Distribution (Alternative Parameterization)
Mixture of Two Lognormal Distributions
Mixture of Two Lognormal Distributions (Alternative Parameterization)
The Truncated Lognormal Distribution
The Truncated Lognormal Distribution (Alternative Parameterization)
Convert a Long Format Data Set into a Wide Format
Copper and Zinc Concentrations in Shallow Ground Water
Modified 1,2,3,4-Tetrachlorobenzene Data with Censored Values
Show the EnvStats NEWS File
NIOSH Air Lead Levels Data
Mixture of Two Normal Distributions
The Truncated Normal Distribution
Ammonium Concentration in Precipitation Measured at Olympic National Park Hoh Ranger Station
Fisher's One-Sample Randomization (Permutation) Test for Location
Ozone Concentrations in the Northeast U.S.
The Pareto Distribution
Plot Probability Density Function
S3 Class "permutationTest"
Plot Results of Box-Cox Transformations
Plot Results of Box-Cox Transformations Based on Type I Censored Data
Plot Results of Box-Cox Transformations for a Linear Model
Plot Results of Goodness-of-Fit Test
Plot Results of Goodness-of-Fit Test Based on Censored Data
Plot Results of Group Goodness-of-Fit Test
Plot Results of Goodness-of-Fit Test to Compare Two Samples
Plot Results of Permutation Test
Create Plots for a Sampling Design Based on a One-Way Fixed-Effects Analysis of Variance
Plots for Sampling Design Based on Confidence Interval for Binomial Proportion or Difference Between Two Proportions
Plots for Sampling Design Based on Confidence Interval for Mean of a Normal Distribution or Difference Between Two Means
Plots for Sampling Design Based on Nonparametric Confidence Interval for a Quantile
Plots for a Sampling Design Based on a t-Test for Linear Trend
Power Curves for Sampling Design for Test Based on Simultaneous Prediction Interval for Lognormal Distribution
Power Curves for Sampling Design for Test Based on Prediction Interval for Lognormal Distribution
Plots for a Sampling Design Based on a Prediction Interval for the Next k Observations from a Normal Distribution
Power Curves for Sampling Design for Test Based on Simultaneous Prediction Interval for Normal Distribution
Power Curves for Sampling Design for Test Based on Prediction Interval for Normal Distribution
Plots for a Sampling Design Based on a Nonparametric Prediction Interval
Plots for a Sampling Design Based on a Simultaneous Nonparametric Prediction Interval
Power Curves for Sampling Design for Test Based on Nonparametric Simultaneous Prediction Interval
Plots for Sampling Design Based on One- or Two-Sample Proportion Test
Plots for a Sampling Design Based on a Tolerance Interval for a Normal Distribution
Plots for a Sampling Design Based on a Nonparametric Tolerance Interval
Plots for a Sampling Design Based on a One- or Two-Sample t-Test
Plots for a Sampling Design Based on a One- or Two-Sample t-Test, Assuming Lognormal Data
Pointwise Confidence Limits for Predictions
Plotting Positions for Type I Censored Data
Model Predictions
Predict Method for Linear Model Fits
Prediction Interval for Gamma Distribution
Simultaneous Prediction Interval for a Gamma Distribution
Prediction Interval for a Lognormal Distribution
Probability That at Least One Set of Future Observations Violates the Given Rule Based on a Simultaneous Prediction Interval for a Lognormal Distribution
Probability That at Least One Future Observation Falls Outside a Prediction Interval for a Lognormal Distribution
Simultaneous Prediction Interval for a Lognormal Distribution
Prediction Interval for a Normal Distribution
Half-Width of a Prediction Interval for the next k Observations from a Normal Distribution
Compute the Value of K for a Prediction Interval for a Normal Distribution
Sample Size for a Specified Half-Width of a Prediction Interval for the next k Observations from a Normal Distribution
Simultaneous Prediction Interval for a Normal Distribution
Compute the Value of K for a Simultaneous Prediction Interval for a Normal Distribution
Probability That at Least One Set of Future Observations Violates the Given Rule Based on a Simultaneous Prediction Interval for a Normal Distribution
Probability That at Least One Future Observation Falls Outside a Prediction Interval for a Normal Distribution
Nonparametric Prediction Interval for a Continuous Distribution
Confidence Level for Nonparametric Prediction Interval for Continuous Distribution
Sample Size for a Nonparametric Prediction Interval for a Continuous Distribution
Nonparametric Simultaneous Prediction Interval for a Continuous Distribution
Confidence Level of Simultaneous Nonparametric Prediction Interval for Continuous Distribution
Sample Size for Simultaneous Nonparametric Prediction Interval for Continuous Distribution
Probability That at Least One Set of Future Observations Violates the Given Rule Based on a Nonparametric Simultaneous Prediction Interval
Prediction Interval for a Poisson Distribution
Print Values
Print Output of Objective for Box-Cox Power Transformations
Print Output of Objective for Box-Cox Power Transformations Based on Type I Censored Data
Print Output of Objective for Box-Cox Power Transformations for an "lm" Object
Print Output of Goodness-of-Fit Tests
Print Output of Goodness-of-Fit Tests
Print Objects of Class "estimate"
Print Objects of Class "estimateCensored"
Print Output of Goodness-of-Fit Tests
Print Output of Goodness-of-Fit Tests Based on Censored Data
Print Output of Group Goodness-of-Fit Tests
Print Output of Goodness-of-Fit Outlier Tests
Print Output of Two-Sample Goodness-of-Fit Tests
Print Output of Hypothesis Tests
Print Output of Hypothesis Tests Based on Censored Data
Print Output of Permutation Tests
Print Summary Statistics
Minimal Detectable Difference Associated with a One- or Two-Sample Proportion Test
Compute Sample Size Necessary to Achieve a Specified Power for a One- or Two-Sample Proportion Test
Compute the Power of a One- or Two-Sample Proportion Test
Real dataset from ProUCL 5.2.0.
ProUCL Critical Values for Anderson-Darling Goodness-of-Fit Test for Gamma Distribution
ProUCL Critical Values for Kolmogorov-Smirnov Goodness-of-Fit Test for Gamma Distribution
Estimate Probability-Weighted Moments
Quantile-Quantile (Q-Q) Plot
Quantile-Quantile (Q-Q) Plot for Type I Censored Data
Develop Gestalt of Q-Q Plots for Specific Distributions
Two-Sample Rank Test to Detect a Shift in a Proportion of the "Treated" Population
Compute p-Value for the Quantile Test
Carbon Monoxide Emissions from Oil Refinery.
Rosner's Test for Outliers
Test for the Presence of Serial Correlation
One-Sample or Paired-Sample Sign Test on a Median
Simulate a Multivariate Matrix Based on a Specified Rank Correlation Mat
Simulate a Vector of Random Numbers From a Specified Theoretical or Empirical Probability Distribution
Ammonia Nitrogen Concentrations in the Skagit River, Marblemount, Washington
Coefficient of Skewness
Add Text Indicating the Mean and Standard Deviation to a ggplot2 Plot
Add Text Indicating the Median and Interquartile Range to a ggplot2 Plot
Add Text Indicating the Sample Size to a ggplot2 Plot
Add Text to a ggplot2 Plot Indicating the Results of a Hypothesis Test
1-D Scatter Plots with Confidence Intervals
Full Complement of Summary Statistics
Summary Statistics
S3 Class "summaryStats"
Tolerance Interval for a Gamma Distribution
Tolerance Interval for a Lognormal Distribution
Tolerance Interval for a Lognormal Distribution Based on Censored Data
Tolerance Interval for a Normal Distribution
Tolerance Interval for a Normal Distribution Based on Censored Data
Half-Width of a Tolerance Interval for a Normal Distribution
Compute the Value of K for a Tolerance Interval for a Normal Distribution
Sample Size for a Specified Half-Width of a Tolerance Interval for a Normal Distribution
Nonparametric Tolerance Interval for a Continuous Distribution
Confidence Level for Nonparametric Tolerance Interval for Continuous Distribution
Coverage for Nonparametric Tolerance Interval for Continuous Distribution
Sample Size for Nonparametric Tolerance Interval for Continuous Distribution
Tolerance Interval for a Poisson Distribution
Total Phosphorus Data from Chesapeake Bay
The Triangular Distribution
Type I Error Level for a One- or Two-Sample t-Test
Sample Size for a One- or Two-Sample t-Test, Assuming Lognormal Data
Power of a One- or Two-Sample t-Test Assuming Lognormal Data
Minimal or Maximal Detectable Ratio of Means for One- or Two-Sample t-Test, Assuming Lognormal Data
Sample Size for a One- or Two-Sample t-Test
Power of a One- or Two-Sample t-Test
Scaled Minimal Detectable Difference for One- or Two-Sample t-Test
Two-Sample Linear Rank Test to Detect a Difference Between Two Distributions
Two-Sample Linear Rank Test to Detect a Difference Between Two Distributions Based on Censored Data
Two-Sample or Paired-Sample Randomization (Permutation) Test for Location
Randomization (Permutation) Test to Compare Two Proportions (Fisher's Exact Test)
Test for Homogeneity of Variance Among Two or More Groups
One-Sample Chi-Squared Test on Variance
The Zero-Modified Lognormal (Delta) Distribution
The Zero-Modified Lognormal (Delta) Distribution (Alternative Parameterization)
The Zero-Modified Normal Distribution
Test Whether the Shape Parameter of a Generalized Extreme Value Distribution is Equal to 0

Trichloroethylene Concentrations Before and After Remedation

Description

Trichloroethylene (TCE) concentrations (mg/L) at 10 groundwater monitoring wells before and after remediation.

Usage

data(ACE.13.TCE.df)data(ACE.13.TCE.df)

Format

A data frame with 20 observations on the following 3 variables.

TCE.mg.per.L: TCE concentrations
Well: a factor indicating the well number
Period: a factor indicating the period (before vs. after remediation)

Source

USACE. (2013). Environmental Quality - Environmental Statistics. Engineer Manual EM 200-1-16, 31 May 2013. Department of the Army, U.S. Army Corps of Engineers, Washington, D.C. 20314-1000, p. M-10. https://www.publications.usace.army.mil/Portals/76/Publications/EngineerManuals/EM_200-1-16.pdf.

Compute Lack-of-Fit and Pure Error Anova Table for a Linear Model

Description

Compute a lack-of-fit and pure error anova table for either a linear model with one predictor variable or else a linear model for which all predictor variables in the model are functions of a single variable (for example, x, x^2, etc.). There must be replicate observations for at least one value of the predictor variable(s).

Usage

  anovaPE(object)
anovaPE(object)

Arguments

object

an object of class "lm". The object must have only one predictor variable in the formula, or else all predictor variables in the model must be functions of a single variable (for example, x, x^2, etc.). Also, the predictor variable(s) must have replicate observations for at least one value of the predictor variable(s).

Finally, the total number of observations must be such that the degrees of freedom associated with the residual sums of squares is greater than the number of observations minus the number of unique observations.

Details

Produces an anova table with the the sums of squares partitioned by “Lack of Fit” and “Pure Error”. See Draper and Smith (1998, pp.47-53) for details. This function is called by the function calibrate.

Value

An object of class "anova" inheriting from class "data.frame".

Author(s)

Steven P. Millard ([email protected])

References

Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.47-53.

Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.

Examples

  # The data frame EPA.97.cadmium.111.df contains calibration data for 
  # cadmium at mass 111 (ng/L) that appeared in Gibbons et al. (1997b) 
  # and were provided to them by the U.S. EPA. 
  #
  # First, display a plot of these data along with the fitted calibration line 
  # and 99% non-simultaneous prediction limits.  See  
  # Millard and Neerchal (2001, pp.566-569) for more details on this 
  # example.

  EPA.97.cadmium.111.df
  #   Cadmium Spike
  #1     0.88     0
  #2     1.57     0
  #3     0.70     0
  #...
  #33   99.20   100
  #34   93.71   100
  #35  100.43   100

  Cadmium <- EPA.97.cadmium.111.df$Cadmium 

  Spike <- EPA.97.cadmium.111.df$Spike 

  calibrate.list <- calibrate(Cadmium ~ Spike, 
    data = EPA.97.cadmium.111.df) 

  newdata <- data.frame(Spike = seq(min(Spike), max(Spike), length.out = 100)) 

  pred.list <- predict(calibrate.list, newdata = newdata, se.fit = TRUE) 

  pointwise.list <- pointwise(pred.list, coverage = 0.99, individual = TRUE) 

  plot(Spike, Cadmium, ylim = c(min(pointwise.list$lower), 
    max(pointwise.list$upper)), xlab = "True Concentration (ng/L)", 
    ylab = "Observed Concentration (ng/L)") 

  abline(calibrate.list, lwd = 2) 

  lines(newdata$Spike, pointwise.list$lower, lty = 8, lwd = 2) 

  lines(newdata$Spike, pointwise.list$upper, lty = 8, lwd = 2) 

  title(paste("Calibration Line and 99% Prediction Limits", 
    "for US EPA Cadmium 111 Data", sep="\n"))

  rm(Cadmium, Spike, newdata, calibrate.list, pred.list, 
    pointwise.list)

  #----------

  # Now fit the linear model and produce the anova table to check for 
  # lack of fit.  There is no evidence for lack of fit (p = 0.41).

  fit <- lm(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) 


  anova(fit) 
  #Analysis of Variance Table
  #
  #Response: Cadmium
  #          Df Sum Sq Mean Sq F value    Pr(>F)    
  #Spike      1  43220   43220  9356.9 < 2.2e-16 ***
  #Residuals 33    152       5                      
  #---
  #Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  #Analysis of Variance Table
  #
  #Response: Cadmium 
  #
  #Terms added sequentially (first to last) 
  #          Df Sum of Sq  Mean Sq  F Value Pr(F) 
  #    Spike  1  43220.27 43220.27 9356.879     0 
  #Residuals 33    152.43     4.62 


  anovaPE(fit) 
  #                 Df Sum Sq Mean Sq  F value Pr(>F)    
  #Spike             1  43220   43220 9341.559 <2e-16 ***
  #Lack of Fit       3     14       5    0.982 0.4144    
  #Pure Error       30    139       5                    
  #---
  #Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


  rm(fit)
# The data frame EPA.97.cadmium.111.df contains calibration data for 
  # cadmium at mass 111 (ng/L) that appeared in Gibbons et al. (1997b) 
  # and were provided to them by the U.S. EPA. 
  #
  # First, display a plot of these data along with the fitted calibration line 
  # and 99% non-simultaneous prediction limits.  See  
  # Millard and Neerchal (2001, pp.566-569) for more details on this 
  # example.

  EPA.97.cadmium.111.df
  #   Cadmium Spike
  #1     0.88     0
  #2     1.57     0
  #3     0.70     0
  #...
  #33   99.20   100
  #34   93.71   100
  #35  100.43   100

  Cadmium <- EPA.97.cadmium.111.df$Cadmium 

  Spike <- EPA.97.cadmium.111.df$Spike 

  calibrate.list <- calibrate(Cadmium ~ Spike, 
    data = EPA.97.cadmium.111.df) 

  newdata <- data.frame(Spike = seq(min(Spike), max(Spike), length.out = 100)) 

  pred.list <- predict(calibrate.list, newdata = newdata, se.fit = TRUE) 

  pointwise.list <- pointwise(pred.list, coverage = 0.99, individual = TRUE) 

  plot(Spike, Cadmium, ylim = c(min(pointwise.list$lower), 
    max(pointwise.list$upper)), xlab = "True Concentration (ng/L)", 
    ylab = "Observed Concentration (ng/L)") 

  abline(calibrate.list, lwd = 2) 

  lines(newdata$Spike, pointwise.list$lower, lty = 8, lwd = 2) 

  lines(newdata$Spike, pointwise.list$upper, lty = 8, lwd = 2) 

  title(paste("Calibration Line and 99% Prediction Limits", 
    "for US EPA Cadmium 111 Data", sep="\n"))

  rm(Cadmium, Spike, newdata, calibrate.list, pred.list, 
    pointwise.list)

  #----------

  # Now fit the linear model and produce the anova table to check for 
  # lack of fit.  There is no evidence for lack of fit (p = 0.41).

  fit <- lm(Cadmium ~ Spike, data = EPA.97.cadmium.111.df) 


  anova(fit) 
  #Analysis of Variance Table
  #
  #Response: Cadmium
  #          Df Sum Sq Mean Sq F value    Pr(>F)    
  #Spike      1  43220   43220  9356.9 < 2.2e-16 ***
  #Residuals 33    152       5                      
  #---
  #Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  #Analysis of Variance Table
  #
  #Response: Cadmium 
  #
  #Terms added sequentially (first to last) 
  #          Df Sum of Sq  Mean Sq  F Value Pr(F) 
  #    Spike  1  43220.27 43220.27 9356.879     0 
  #Residuals 33    152.43     4.62 


  anovaPE(fit) 
  #                 Df Sum Sq Mean Sq  F value Pr(>F)    
  #Spike             1  43220   43220 9341.559 <2e-16 ***
  #Lack of Fit       3     14       5    0.982 0.4144    
  #Pure Error       30    139       5                    
  #---
  #Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


  rm(fit)

Compute Sample Size Necessary to Achieve Specified Power for One-Way Fixed-Effects Analysis of Variance

Description

Compute the sample sizes necessary to achieve a specified power for a one-way fixed-effects analysis of variance test, given the population means, population standard deviation, and significance level.

Usage

  aovN(mu.vec, sigma = 1, alpha = 0.05, power = 0.95, 
    round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)
aovN(mu.vec, sigma = 1, alpha = 0.05, power = 0.95, 
    round.up = TRUE, n.max = 5000, tol = 1e-07, maxiter = 1000)

Arguments

`mu.vec`	required numeric vector of population means. The length of `mu.vec` must be at least 2. Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are not allowed.
`sigma`	optional numeric scalar specifying the population standard deviation ( $\sigma$ ) for each group. The default value is `sigma=1`.
`alpha`	optional numeric scalar between 0 and 1 indicating the Type I error level associated with the hypothesis test. The default value is `alpha=0.05`.
`power`	optional numeric scalar between 0 and 1 indicating the power associated with the hypothesis test. The default value is `power=0.95`.
`round.up`	optional logical scalar indicating whether to round up the value of the computed sample size to the next smallest integer. The default value is `round.up=TRUE`.
`n.max`	positive integer greater then 1 indicating the maximum sample size per group. The default value is `n.max=5000`.
`tol`	optional numeric scalar indicating the tolerance to use in the `uniroot` search algorithm. The default value is `tol=1e-7`.
`maxiter`	optional positive integer indicating the maximum number of iterations to use in the `uniroot` search algorithm. The default value is `maxiter=1000`.

Details

The F-statistic to test the equality of $k$ population means assuming each population has a normal distribution with the same standard deviation $\sigma$ is presented in most basic statistics texts, including Zar (2010, Chapter 10), Berthouex and Brown (2002, Chapter 24), and Helsel and Hirsh (1992, pp.164-169). The formula for the power of this test is given in Scheffe (1959, pp.38-39,62-65). The power of the one-way fixed-effects ANOVA depends on the sample sizes for each of the $k$ groups, the value of the population means for each of the $k$ groups, the population standard deviation $\sigma$ , and the significance level $\alpha$ . See the help file for aovPower.

The function aovN assumes equal sample sizes for each of the $k$ groups and uses a search algorithm to determine the sample size $n$ required to attain a specified power, given the values of the population means and the significance level.

Value

numeric scalar indicating the required sample size for each group. (The number of groups is equal to the length of the argument mu.vec.)

Note

The normal and lognormal distribution are probably the two most frequently used distributions to model environmental data. Sometimes it is necessary to compare several means to determine whether any are significantly different from each other (e.g., USEPA, 2009, p.6-38). In this case, assuming normally distributed data, you perform a one-way parametric analysis of variance.

In the course of designing a sampling program, an environmental scientist may wish to determine the relationship between sample size, Type I error level, power, and differences in means if one of the objectives of the sampling program is to determine whether a particular mean differs from a group of means. The functions aovPower, aovN, and plotAovDesign can be used to investigate these relationships for the case of normally-distributed observations.

Author(s)

Steven P. Millard ([email protected])

References

Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.

Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.

Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapters 27, 29, 30.

Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.

Scheffe, H. (1959). The Analysis of Variance. John Wiley and Sons, New York, 477pp.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.6-38.

Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 10.

Examples

  # Look at how the required sample size for a one-way ANOVA 
  # increases with increasing power:

  aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.8) 
  #[1] 21 

  aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.9) 
  #[1] 27 

  aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.95) 
  #[1] 33

  #----------------------------------------------------------------

  # Look at how the required sample size for a one-way ANOVA, 
  # given a fixed power, decreases with increasing variability 
  # in the population means:

  aovN(mu.vec = c(10, 10, 11), sigma=5) 
  #[1] 581 

  aovN(mu.vec = c(10, 10, 15), sigma = 5) 
  #[1] 25 

  aovN(mu.vec = c(10, 13, 15), sigma = 5) 
  #[1] 33 

  aovN(mu.vec = c(10, 15, 20), sigma = 5) 
  #[1] 10

  #----------------------------------------------------------------

  # Look at how the required sample size for a one-way ANOVA, 
  # given a fixed power, decreases with increasing values of 
  # Type I error:

  aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.001) 
  #[1] 89 

  aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.01) 
  #[1] 67 

  aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.05) 
  #[1] 50 

  aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.1) 
  #[1] 42
# Look at how the required sample size for a one-way ANOVA 
  # increases with increasing power:

  aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.8) 
  #[1] 21 

  aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.9) 
  #[1] 27 

  aovN(mu.vec = c(10, 12, 15), sigma = 5, power = 0.95) 
  #[1] 33

  #----------------------------------------------------------------

  # Look at how the required sample size for a one-way ANOVA, 
  # given a fixed power, decreases with increasing variability 
  # in the population means:

  aovN(mu.vec = c(10, 10, 11), sigma=5) 
  #[1] 581 

  aovN(mu.vec = c(10, 10, 15), sigma = 5) 
  #[1] 25 

  aovN(mu.vec = c(10, 13, 15), sigma = 5) 
  #[1] 33 

  aovN(mu.vec = c(10, 15, 20), sigma = 5) 
  #[1] 10

  #----------------------------------------------------------------

  # Look at how the required sample size for a one-way ANOVA, 
  # given a fixed power, decreases with increasing values of 
  # Type I error:

  aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.001) 
  #[1] 89 

  aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.01) 
  #[1] 67 

  aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.05) 
  #[1] 50 

  aovN(mu.vec = c(10, 12, 14), sigma = 5, alpha = 0.1) 
  #[1] 42

Compute the Power of a One-Way Fixed-Effects Analysis of Variance

Description

Compute the power of a one-way fixed-effects analysis of variance, given the sample sizes, population means, population standard deviation, and significance level.

Usage

  aovPower(n.vec, mu.vec = rep(0, length(n.vec)), sigma = 1, alpha = 0.05)
aovPower(n.vec, mu.vec = rep(0, length(n.vec)), sigma = 1, alpha = 0.05)

Arguments

`n.vec`	numeric vector of sample sizes for each group. The $i^{th}$ element of `n.vec` denotes the sample size for group $i$ . The length of `n.vec` must be at least 2, and all elements of `n.vec` must be greater than or equal to 2. Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are not allowed.
`mu.vec`	numeric vector of population means. The length of `mu.vec` must be the same as the length of `n.vec`. The default value is a vector of zeros. Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are not allowed.
`sigma`	numeric scalar specifying the population standard deviation ( $\sigma$ ) for each group. The default value is `sigma=1`.
`alpha`	numeric scalar between 0 and 1 indicating the Type I error level associated with the hypothesis test. The default value is `alpha=0.05`.

Details

Consider $k$ normally distributed populations with common standard deviation $\sigma$ . Let $\mu_i$ denote the mean of the $i$ 'th group ( $i = 1, 2, \ldots, k$ ), and let $\underline{x}_i = x_{i1}, x_{i2}, \ldots, x_{in_i}$ denote a vector of $n_i$ observations from the $i$ 'th group. The statistical method of analysis of variance (ANOVA) tests the null hypothesis:

$H_0: \mu_1 = \mu_2 = \cdots = \mu_k \;\;\;\;\;\; (1)$

against the alternative hypothesis that at least one of the means is different from the rest by using the F-statistic given by:

$F = \frac{[\sum_{i=1}^k n_i(\bar{x}_{i.} - \bar{x}_{..})^2]/(k-1)}{[\sum_{i=1}^k \sum_{j=1}^{n_i} (x_{ij} - \bar{x}_{i.})^2]/(N-k)} \;\;\;\;\;\; (2)$

where

$\bar{x}_{i.} = \frac{1}{n_i} \sum_{j=1}^{n_i} x_{ij} \;\;\;\;\;\; (3)$

$\bar{x}_{..} = \frac{1}{N} \sum_{i=1}^k n_i\bar{x}_{i.} = \frac{1}{N} \sum_{i=1}^k \sum_{j=1}^{n_i} x_{ij} \;\;\;\;\;\; (4)$

$N = \sum_{i=1}^k n_i \;\;\;\;\;\; (5)$

Under the null hypothesis (1), the F-statistic in (2) follows an F-distribution with $k-1$ and $N-k$ degrees of freedom. Analysis of variance rejects the null hypothesis (1) at significance level $\alpha$ when

$F > F_{k-1, N-k}(1 - \alpha) \;\;\;\;\;\; (6)$

where $F_{\nu_1, \nu_2}(p)$ denotes the $p$ 'th quantile of the F-distribution with $\nu_1$ and $\nu_2$ degrees of freedom (Zar, 2010, Chapter 10; Berthouex and Brown, 2002, Chapter 24; Helsel and Hirsh, 1992, pp. 164–169).

The power of this test, denoted by $1-\beta$ , where $\beta$ denotes the probability of a Type II error, is given by:

$1 - \beta = Pr[F_{k-1, N-k, \Delta} > F_{k-1, N-k}(1 - \alpha)] \;\;\;\;\;\; (7)$

where

$\Delta = \frac{\sum_{i=1}^k n_i(\mu_i - \bar{\mu}_.)^2}{\sigma^2} \;\;\;\;\;\; (8)$

$\bar{\mu}_. = \frac{1}{k} \sum_{i=1}^k \mu _i \;\;\;\;\;\; (9)$

and $F_{\nu_1, \nu_2, \Delta}$ denotes a non-central F random variable with $\nu_1$ and $\nu_2$ degrees of freedom and non-centrality parameter $\Delta$ . Equation (7) can be re-written as:

$1 - \beta = 1 - H[F_{k-1, N-k}(1 - \alpha), k-1, N-k, \Delta] \;\;\;\;\;\; (10)$

where $H(x, \nu_1, \nu_2, \Delta)$ denotes the cumulative distribution function of this random variable evaluated at $x$ (Scheffe, 1959, pp.38–39, 62–65).

The power of the one-way fixed-effects ANOVA depends on the sample sizes for each of the $k$ groups, the value of the population means for each of the $k$ groups, the population standard deviation $\sigma$ , and the significance level $\alpha$ .

Value

a numeric scalar indicating the power of the one-way fixed-effects ANOVA for the given sample sizes, population means, population standard deviation, and significance level.

Note

Author(s)

Steven P. Millard ([email protected])

References

Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Second Edition. Lewis Publishers, Boca Raton, FL.

Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY, Chapter 7.

Johnson, N. L., S. Kotz, and N. Balakrishnan. (1995). Continuous Univariate Distributions, Volume 2. Second Edition. John Wiley and Sons, New York, Chapters 27, 29, 30.

Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.

Scheffe, H. (1959). The Analysis of Variance. John Wiley and Sons, New York, 477pp.

Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 10.

Examples

  # Look at how the power of a one-way ANOVA increases 
  # with increasing sample size:

  aovPower(n.vec = rep(5, 3), mu.vec = c(10, 15, 20), sigma = 5) 
  #[1] 0.7015083 

  aovPower(n.vec = rep(10, 3), mu.vec = c(10, 15, 20), sigma = 5) 
  #[1] 0.9732551

  #----------------------------------------------------------------

  # Look at how the power of a one-way ANOVA increases 
  # with increasing variability in the population means:

  aovPower(n.vec = rep(5,3), mu.vec = c(10, 10, 11), sigma=5) 
  #[1] 0.05795739 

  aovPower(n.vec = rep(5, 3), mu.vec = c(10, 10, 15), sigma = 5) 
  #[1] 0.2831863 

  aovPower(n.vec = rep(5, 3), mu.vec = c(10, 13, 15), sigma = 5) 
  #[1] 0.2236093 

  aovPower(n.vec = rep(5, 3), mu.vec = c(10, 15, 20), sigma = 5) 
  #[1] 0.7015083

  #----------------------------------------------------------------

  # Look at how the power of a one-way ANOVA increases 
  # with increasing values of Type I error:

  aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), 
    sigma = 5, alpha = 0.001) 
  #[1] 0.02655785 

  aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), 
    sigma = 5, alpha = 0.01) 
  #[1] 0.1223527 

  aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), 
    sigma = 5, alpha = 0.05) 
  #[1] 0.3085313 

  aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), 
    sigma = 5, alpha = 0.1) 
  #[1] 0.4373292

  #==========

  # The example on pages 5-11 to 5-14 of USEPA (1989b) shows 
  # log-transformed concentrations of lead (mg/L) at two 
  # background wells and four compliance wells, where observations 
  # were taken once per month over four months (the data are 
  # stored in EPA.89b.loglead.df.)  Assume the true mean levels 
  # at each well are 3.9, 3.9, 4.5, 4.5, 4.5, and 5, respectively. 
  # Compute the power of a one-way ANOVA to test for mean 
  # differences between wells.  Use alpha=0.05, and assume the 
  # true standard deviation is equal to the one estimated from 
  # the data in this example.

  # First look at the data 
  names(EPA.89b.loglead.df)
  #[1] "LogLead"   "Month"     "Well"      "Well.type"

  dev.new()
  stripChart(LogLead ~ Well, data = EPA.89b.loglead.df,
    show.ci = FALSE, xlab = "Well Number", 
    ylab="Log [ Lead (ug/L) ]", 
    main="Lead Concentrations at Six Wells")

  # Note: The assumption of a constant variance across 
  # all wells is suspect.


  # Now perform the ANOVA and get the estimated sd 
  aov.list <- aov(LogLead ~ Well, data=EPA.89b.loglead.df) 

  summary(aov.list) 
  #            Df Sum Sq Mean Sq F value  Pr(>F)  
  #Well         5 5.7447 1.14895  3.3469 0.02599 *
  #Residuals   18 6.1791 0.34328                  
  #---
  #Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 

  # Now call the function aovPower 
  aovPower(n.vec = rep(4, 6), 
    mu.vec = c(3.9,3.9,4.5,4.5,4.5,5), sigma=sqrt(0.34)) 
  #[1] 0.5523148

  # Clean up
  rm(aov.list)
# Look at how the power of a one-way ANOVA increases 
  # with increasing sample size:

  aovPower(n.vec = rep(5, 3), mu.vec = c(10, 15, 20), sigma = 5) 
  #[1] 0.7015083 

  aovPower(n.vec = rep(10, 3), mu.vec = c(10, 15, 20), sigma = 5) 
  #[1] 0.9732551

  #----------------------------------------------------------------

  # Look at how the power of a one-way ANOVA increases 
  # with increasing variability in the population means:

  aovPower(n.vec = rep(5,3), mu.vec = c(10, 10, 11), sigma=5) 
  #[1] 0.05795739 

  aovPower(n.vec = rep(5, 3), mu.vec = c(10, 10, 15), sigma = 5) 
  #[1] 0.2831863 

  aovPower(n.vec = rep(5, 3), mu.vec = c(10, 13, 15), sigma = 5) 
  #[1] 0.2236093 

  aovPower(n.vec = rep(5, 3), mu.vec = c(10, 15, 20), sigma = 5) 
  #[1] 0.7015083

  #----------------------------------------------------------------

  # Look at how the power of a one-way ANOVA increases 
  # with increasing values of Type I error:

  aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), 
    sigma = 5, alpha = 0.001) 
  #[1] 0.02655785 

  aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), 
    sigma = 5, alpha = 0.01) 
  #[1] 0.1223527 

  aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), 
    sigma = 5, alpha = 0.05) 
  #[1] 0.3085313 

  aovPower(n.vec = rep(10,3), mu.vec = c(10, 12, 14), 
    sigma = 5, alpha = 0.1) 
  #[1] 0.4373292

  #==========

  # The example on pages 5-11 to 5-14 of USEPA (1989b) shows 
  # log-transformed concentrations of lead (mg/L) at two 
  # background wells and four compliance wells, where observations 
  # were taken once per month over four months (the data are 
  # stored in EPA.89b.loglead.df.)  Assume the true mean levels 
  # at each well are 3.9, 3.9, 4.5, 4.5, 4.5, and 5, respectively. 
  # Compute the power of a one-way ANOVA to test for mean 
  # differences between wells.  Use alpha=0.05, and assume the 
  # true standard deviation is equal to the one estimated from 
  # the data in this example.

  # First look at the data 
  names(EPA.89b.loglead.df)
  #[1] "LogLead"   "Month"     "Well"      "Well.type"

  dev.new()
  stripChart(LogLead ~ Well, data = EPA.89b.loglead.df,
    show.ci = FALSE, xlab = "Well Number", 
    ylab="Log [ Lead (ug/L) ]", 
    main="Lead Concentrations at Six Wells")

  # Note: The assumption of a constant variance across 
  # all wells is suspect.


  # Now perform the ANOVA and get the estimated sd 
  aov.list <- aov(LogLead ~ Well, data=EPA.89b.loglead.df) 

  summary(aov.list) 
  #            Df Sum Sq Mean Sq F value  Pr(>F)  
  #Well         5 5.7447 1.14895  3.3469 0.02599 *
  #Residuals   18 6.1791 0.34328                  
  #---
  #Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 

  # Now call the function aovPower 
  aovPower(n.vec = rep(4, 6), 
    mu.vec = c(3.9,3.9,4.5,4.5,4.5,5), sigma=sqrt(0.34)) 
  #[1] 0.5523148

  # Clean up
  rm(aov.list)

Base $b$ Representation of a Number

Description

For any number represented in base 10, compute the representation in any user-specified base.

Usage

  base(n, base = 10, num.digits = max(0, floor(log(n, base))) + 1)
base(n, base = 10, num.digits = max(0, floor(log(n, base))) + 1)

Arguments

`n`	a non-negative integer (base 10).
`base`	a positive integer greater than 1 indicating what base to represent `n` in.
`num.digits`	a positive integer indicating how many digits to use to represent `n` in base `base`. By default, `num.digits` is equal to just the number of required digits (i.e., `max(0, floor(log(n, base))) + 1)`. Setting `num.digits` to a larger number than this will result in 0's padding the left.

Details

If $b$ is a positive integer greater than 1, and $n$ is a positive integer, then $n$ can be expressed uniquely in the form

$n = a_kb^k + a_{k-1}b^{k-1} + \ldots + a_1b + a0$

where $k$ is a non-negative integer, the coefficients $a_0, a_1, \ldots, a_k$ are non-negative integers less than $b$ , and $a_k > 0$ (Rosen, 1988, p.105). The function base computes the coefficients $a_0, a_1, \ldots, a_k$ .

Value

A numeric vector of length num.digits showing the representation of n in base base.

Note

The function base is included in EnvStats because it is called by the function
oneSamplePermutationTest.

Author(s)

Steven P. Millard ([email protected])

References

Rosen, K.H. (1988). Discrete Mathematics and Its Applications. Random House, New York, pp.105-107.

Examples

  # Compute the value of 7 in base 2.

  base(7, 2) 
  #[1] 1 1 1 

  base(7, 2, num.digits=5) 
  #[1] 0 0 1 1 1
# Compute the value of 7 in base 2.

  base(7, 2) 
  #[1] 1 1 1 

  base(7, 2, num.digits=5) 
  #[1] 0 0 1 1 1

Lead concentration in soil samples.

Description

Lead (Pb) concentrations (mg/kg) in 29 discrete environmental soil samples from a site suspected to be contaminated with lead.

Usage

    Beal.2010.Pb.df
    data(Beal.2010.Pb.df)
Beal.2010.Pb.df
    data(Beal.2010.Pb.df)

Format

A data frame with 29 observations on the following 3 variables.

Pb.char: Character vector indicating lead concentrations. Nondetects indicated with the less-than sign (e.g., <1)
Pb: numeric vector indicating lead concentration.
Censored: logical vector indicating censoring status.

Source

Beal, D. (2010). A Macro for Calculating Summary Statistics on Left Censored Environmental Data Using the Kaplan-Meier Method. Paper SDA-09, presented at Southeast SAS Users Group 2010, September 26-28, Savannah, GA. https://analytics.ncsu.edu/sesug/2010/SDA09.Beal.pdf.

Benthic Data from Monitoring Program in Chesapeake Bay

Description

Benthic data from a monitoring program in the Chesapeake Bay, Maryland, covering July 1994 - December 1991.

Usage

Benthic.dfBenthic.df

Format

A data frame with 585 observations on the following 7 variables.

Site.ID: Site ID
Stratum: Stratum Number (101-131)
Latitude: Latitude (degrees North)
Longitude: Longitude (negative values; degrees West)
Index: Benthic Index (between 1 and 5)
Salinity: Salinity (ppt)
Silt: Silt Content (% clay in soil)

Details

Data from the Long Term Benthic Monitoring Program of the Chesapeake Bay. The data consist of measurements of benthic characteristics and a computed index of benthic health for several locations in the bay. Sampling methods and designs of the program are discussed in Ranasinghe et al. (1992).

The data represent observations collected at 585 separate point locations (sites). The sites are divided into 31 different strata, numbered 101 through 131, each strata consisting of geographically close sites of similar degradation conditions. The benthic index values range from 1 to 5 on a continuous scale, where high values correspond to healthier benthos. Salinity was measured in parts per thousand (ppt), and silt content is expressed as a percentage of clay in the soil with high numbers corresponding to muddy areas.

The United States Environmental Protection Agency (USEPA) established an initiative for the Chesapeake Bay in partnership with the states bordering the bay in 1984. The goal of the initiative is the restoration (abundance, health, and diversity) of living resources to the bay by reducing nutrient loadings, reducing toxic chemical impacts, and enhancing habitats. USEPA's Chesapeake Bay Program Office is responsible for implementing this initiative and has established an extensive monitoring program that includes traditional water chemistry sampling, as well as collecting data on living resources to measure progress towards meeting the restoration goals.

Sampling benthic invertebrate assemblages has been an integral part of the Chesapeake Bay monitoring program due to their ecological importance and their value as biological indicators. The condition of benthic assemblages is a measure of the ecological health of the bay, including the effects of multiple types of environmental stresses. Nevertheless, regional-scale assessment of ecological status and trends using benthic assemblages are limited by the fact that benthic assemblages are strongly influenced by naturally variable habitat elements, such as salinity, sediment type, and depth. Also, different state agencies and USEPA programs use different sampling methodologies, limiting the ability to integrate data into a unified assessment. To circumvent these limitations, USEPA has standardized benthic data from several different monitoring programs into a single database, and from that database developed a Restoration Goals Benthic Index that identifies whether benthic restoration goals are being met.

Source

Ranasinghe, J.A., L.C. Scott, and R. Newport. (1992). Long-term Benthic Monitoring and Assessment Program for the Maryland Portion of the Bay, Jul 1984-Dec 1991. Report prepared for the Maryland Department of the Environment and the Maryland Department of Natural Resources by Versar, Inc., Columbia, MD.

Examples

  attach(Benthic.df)

  # Show station locations
  #-----------------------
  dev.new()
  plot(Longitude, Latitude, 
      xlab = "-Longitude (Degrees West)",
      ylab = "Latitude",
      main = "Sampling Station Locations")


  # Scatterplot matrix of benthic index, salinity, and silt
  #--------------------------------------------------------
  dev.new()
  pairs(~ Index + Salinity + Silt, data = Benthic.df)


  # Contour and perspective plots based on loess fit
  # showing only predicted values within the convex hull
  # of station locations
  #-----------------------------------------------------
  library(sp)

  loess.fit <- loess(Index ~ Longitude * Latitude,
      data=Benthic.df, normalize=FALSE, span=0.25)
  lat <- Benthic.df$Latitude
  lon <- Benthic.df$Longitude
  Latitude <- seq(min(lat), max(lat), length=50)
  Longitude <- seq(min(lon), max(lon), length=50)
  predict.list <- list(Longitude=Longitude,
      Latitude=Latitude)
  predict.grid <- expand.grid(predict.list)
  predict.fit <- predict(loess.fit, predict.grid)
  index.chull <- chull(lon, lat)
  inside <- point.in.polygon(point.x = predict.grid$Longitude, 
      point.y = predict.grid$Latitude, 
      pol.x = lon[index.chull], 
      pol.y = lat[index.chull])
  predict.fit[inside == 0] <- NA

  dev.new()
  contour(Longitude, Latitude, predict.fit,
      levels=seq(1, 5, by=0.5), labcex=0.75,
      xlab="-Longitude (degrees West)",
      ylab="Latitude (degrees North)")
  title(main=paste("Contour Plot of Benthic Index",
      "Based on Loess Smooth", sep="\n"))

  dev.new()
  persp(Longitude, Latitude, predict.fit,
      xlim = c(-77.3, -75.9), ylim = c(38.1, 39.5), zlim = c(0, 6), 
      theta = -45, phi = 30, d = 0.5,
      xlab="-Longitude (degrees West)",
      ylab="Latitude (degrees North)",
      zlab="Benthic Index", ticktype = "detailed")
  title(main=paste("Surface Plot of Benthic Index",
      "Based on Loess Smooth", sep="\n"))

  detach("Benthic.df")

  rm(loess.fit, lat, lon, Latitude, Longitude, predict.list,
      predict.grid, predict.fit, index.chull, inside)
attach(Benthic.df)

  # Show station locations
  #-----------------------
  dev.new()
  plot(Longitude, Latitude, 
      xlab = "-Longitude (Degrees West)",
      ylab = "Latitude",
      main = "Sampling Station Locations")


  # Scatterplot matrix of benthic index, salinity, and silt
  #--------------------------------------------------------
  dev.new()
  pairs(~ Index + Salinity + Silt, data = Benthic.df)


  # Contour and perspective plots based on loess fit
  # showing only predicted values within the convex hull
  # of station locations
  #-----------------------------------------------------
  library(sp)

  loess.fit <- loess(Index ~ Longitude * Latitude,
      data=Benthic.df, normalize=FALSE, span=0.25)
  lat <- Benthic.df$Latitude
  lon <- Benthic.df$Longitude
  Latitude <- seq(min(lat), max(lat), length=50)
  Longitude <- seq(min(lon), max(lon), length=50)
  predict.list <- list(Longitude=Longitude,
      Latitude=Latitude)
  predict.grid <- expand.grid(predict.list)
  predict.fit <- predict(loess.fit, predict.grid)
  index.chull <- chull(lon, lat)
  inside <- point.in.polygon(point.x = predict.grid$Longitude, 
      point.y = predict.grid$Latitude, 
      pol.x = lon[index.chull], 
      pol.y = lat[index.chull])
  predict.fit[inside == 0] <- NA

  dev.new()
  contour(Longitude, Latitude, predict.fit,
      levels=seq(1, 5, by=0.5), labcex=0.75,
      xlab="-Longitude (degrees West)",
      ylab="Latitude (degrees North)")
  title(main=paste("Contour Plot of Benthic Index",
      "Based on Loess Smooth", sep="\n"))

  dev.new()
  persp(Longitude, Latitude, predict.fit,
      xlim = c(-77.3, -75.9), ylim = c(38.1, 39.5), zlim = c(0, 6), 
      theta = -45, phi = 30, d = 0.5,
      xlab="-Longitude (degrees West)",
      ylab="Latitude (degrees North)",
      zlab="Benthic Index", ticktype = "detailed")
  title(main=paste("Surface Plot of Benthic Index",
      "Based on Loess Smooth", sep="\n"))

  detach("Benthic.df")

  rm(loess.fit, lat, lon, Latitude, Longitude, predict.list,
      predict.grid, predict.fit, index.chull, inside)

Randomly sampled measurements of an analyte in soil samples.

Description

Analyte concentrations ( $\mu$ g/g) in 11 discrete environmental soil samples.

Usage

    BJC.2000.df
    data(BJC.2000.df)
BJC.2000.df
    data(BJC.2000.df)

Format

A data frame with 11 observations on the following 4 variables.

Analyte.char: Character vector indicating lead concentrations. Nondetects indicated with the letter U after the measure (e.g., 0.10U)
Analyte: numeric vector indicating analyte concentration.
Censored: logical vector indicating censoring status.
Detect: numeric vector of 0s (nondetects) and 1s (detects) indicating censoring status.

Source

BJC. (2000). Improved Methods for Calculating Concentrations Used in Exposure Assessments. BJC/OR-416, Prepared by the Lockheed Martin Energy Research Corporation. Prepared for the U.S. Department of Energy Office of Environmental Management. Bechtel Jacobs Company, LLC. January, 2000. https://rais.ornl.gov/documents/bjc_or416.pdf.

Boxcox Power Transformation

Description

boxcox is a generic function used to compute the value(s) of an objective for one or more Box-Cox power transformations, or to compute an optimal power transformation based on a specified objective. The function invokes particular methods which depend on the class of the first argument.

Currently, there is a default method and a method for objects of class "lm".

Usage

boxcox(x, ...)

## Default S3 method:
boxcox(x, 
    lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, 
    optimize = FALSE, objective.name = "PPCC", 
    eps = .Machine$double.eps, include.x = TRUE, ...)

## S3 method for class 'lm'
boxcox(x, 
    lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, 
    optimize = FALSE, objective.name = "PPCC", 
    eps = .Machine$double.eps, include.x = TRUE, ...) 
boxcox(x, ...)

## Default S3 method:
boxcox(x, 
    lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, 
    optimize = FALSE, objective.name = "PPCC", 
    eps = .Machine$double.eps, include.x = TRUE, ...)

## S3 method for class 'lm'
boxcox(x, 
    lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, 
    optimize = FALSE, objective.name = "PPCC", 
    eps = .Machine$double.eps, include.x = TRUE, ...)

Arguments

`x`	an object of class `"lm"` for which the response variable is all positive numbers, or else a numeric vector of positive numbers. When `x` is an object of class `"lm"`, the object must have been created with a call to the function `lm` that includes the `data` argument. When `x` is a numeric vector of positive observations, missing (`NA`), undefined (`NaN`), and infinite (`-Inf, Inf`) values are allowed but will be removed.
`lambda`	numeric vector of finite values indicating what powers to use for the Box-Cox transformation. When `optimize=FALSE`, the default value is `lambda=seq(-2, 2, by=0.5)`. When `optimize=TRUE`, `lambda` must be a vector with two values indicating the range over which the optimization will occur and the range of these two values must include 1. In this case, the default value is `lambda=c(-2, 2)`.
`optimize`	logical scalar indicating whether to simply evalute the objective function at the given values of `lambda` (`optimize=FALSE`; the default), or to compute the optimal power transformation within the bounds specified by `lambda` (`optimize=TRUE`).
`objective.name`	character string indicating what objective to use. The possible values are `"PPCC"` (probability plot correlation coefficient; the default), `"Shapiro-Wilk"` (the Shapiro-Wilk goodness-of-fit statistic), and `"Log-Likelihood"` (the log-likelihood function).
`eps`	finite, positive numeric scalar. When the absolute value of `lambda` is less than `eps`, lambda is assumed to be 0 for the Box-Cox transformation. The default value is `eps=.Machine$double.eps`.
`include.x`	logical scalar indicating whether to include the finite, non-missing values of the argument `x` with the returned object. The default value is `include.x=TRUE`.
`...`	optional arguments for possible future methods. Currently not used.

Details

Two common assumptions for several standard parametric hypothesis tests are:

The observations all come from a normal distribution.
The observations all come from distributions with the same variance.

When the original data do not satisfy the above assumptions, data transformations are often used to attempt to satisfy these assumptions. The rest of this section is divided into two parts: one that discusses Box-Cox transformations in the context of the original observations, and one that discusses Box-Cox transformations in the context of linear models.

Box-Cox Transformations Based on the Original Observations
Box and Cox (1964) presented a formalized method for deciding on a data transformation. Given a random variable $X$ from some distribution with only positive values, the Box-Cox family of power transformations is defined as:

$Y$	=	$\frac{X^\lambda - 1}{\lambda}$	$\lambda \ne 0$

		$log(X)$	$\lambda = 0 \;\;\;\;\;\; (1)$

where $Y$ is assumed to come from a normal distribution. This transformation is continuous in $\lambda$ . Note that this transformation also preserves ordering. See the help file for boxcoxTransform for more information on data transformations.

Let $\underline{x} = x_1, x_2, \ldots, x_n$ denote a random sample of $n$ observations from some distribution and assume that there exists some value of $\lambda$ such that the transformed observations

$y_i$	=	$\frac{x_i^\lambda - 1}{\lambda}$	$\lambda \ne 0$

		$log(x_i)$	$\lambda = 0 \;\;\;\;\;\; (2)$

( $i = 1, 2, \ldots, n$ ) form a random sample from a normal distribution.

Box and Cox (1964) proposed choosing the appropriate value of $\lambda$ based on maximizing the likelihood function. Alternatively, an appropriate value of $\lambda$ can be chosen based on another objective, such as maximizing the probability plot correlation coefficient or the Shapiro-Wilk goodness-of-fit statistic.

In the case when optimize=TRUE, the function boxcox calls the R function nlminb to minimize the negative value of the objective (i.e., maximize the objective) over the range of possible values of $\lambda$ specified in the argument lambda. The starting value for the optimization is always $\lambda=1$ (i.e., no transformation).

The rest of this sub-section explains how the objective is computed for the various options for objective.name.

Objective Based on Probability Plot Correlation Coefficient (objective.name="PPCC")
When objective.name="PPCC", the objective is computed as the value of the normal probability plot correlation coefficient based on the transformed data (see the description of the Probability Plot Correlation Coefficient (PPCC) goodness-of-fit test in the help file for gofTest). That is, the objective is the correlation coefficient for the normal quantile-quantile plot for the transformed data. Large values of the PPCC tend to indicate a good fit to a normal distribution.

Objective Based on Shapiro-Wilk Goodness-of-Fit Statistic (objective.name="Shapiro-Wilk")
When objective.name="Shapiro-Wilk", the objective is computed as the value of the Shapiro-Wilk goodness-of-fit statistic based on the transformed data (see the description of the Shapiro-Wilk test in the help file for gofTest). Large values of the Shapiro-Wilk statistic tend to indicate a good fit to a normal distribution.

Objective Based on Log-Likelihood Function (objective.name="Log-Likelihood")
When objective.name="Log-Likelihood", the objective is computed as the value of the log-likelihood function. Assuming the transformed observations in Equation (2) above come from a normal distribution with mean $\mu$ and standard deviation $\sigma$ , we can use the change of variable formula to write the log-likelihood function as:

$log[L(\lambda, \mu, \sigma)] = \frac{-n}{2}log(2\pi) - \frac{n}{2}log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \mu)^2 + (\lambda - 1) \sum_{i=1}^n log(x_i) \;\;\;\;\;\; (3)$

where $y_i$ is defined in Equation (2) above (Box and Cox, 1964). For a fixed value of $\lambda$ , the log-likelihood function is maximized by replacing $\mu$ and $\sigma$ with their maximum likelihood estimators:

$\hat{\mu} = \frac{1}{n} \sum_{i=1}^n y_i \;\;\;\;\;\; (4)$

$\hat{\sigma} = [\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})^2]^{1/2} \;\;\;\;\;\; (5)$

Thus, when optimize=TRUE, Equation (3) is maximized by iteratively solving for $\lambda$ using the values for $\mu$ and $\sigma$ given in Equations (4) and (5). When optimize=FALSE, the value of the objective is computed by using Equation (3), using the values of $\lambda$ specified in the argument lambda, and using the values for $\mu$ and $\sigma$ given in Equations (4) and (5).

Box-Cox Transformation for Linear Models
In the case of a standard linear regression model with $n$ observations and $p$ predictors:

$Y_i = \beta_0 + \beta_1 X_{i1} + \ldots + \beta_p X_{ip} + \epsilon_i, \; i=1,2,\ldots,n \;\;\;\;\;\; (6)$

the standard assumptions are:

The error terms $\epsilon_i$ come from a normal distribution with mean 0.
The variance is the same for all of the error terms and does not depend on the predictor variables.

Assuming $Y$ is a random variable from some distribution that may depend on the predictor variables and $Y$ takes on only positive values, the Box-Cox family of power transformations is defined as:

$Y^*$	=	$\frac{Y^\lambda - 1}{\lambda}$	$\lambda \ne 0$

		$log(Y)$	$\lambda = 0 \;\;\;\;\;\; (7)$

where $Y^*$ becomes the new response variable and the errors are now assumed to come from a normal distribution with a mean of 0 and a constant variance.

In this case, the objective is computed as described above, but it is based on the residuals from the fitted linear model in which the response variable is now $Y^*$ instead of $Y$ .

Value

When x is an object of class "lm", boxcox returns a list of class "boxcoxLm" containing the results. See the help file for boxcoxLm.object for details.

When x is simply a numeric vector of positive numbers, boxcox returns a list of class "boxcox" containing the results. See the help file for boxcox.object for details.

Note

Data transformations are often used to induce normality, homoscedasticity, and/or linearity, common assumptions of parametric statistical tests and estimation procedures. Transformations are not “tricks” used by the data analyst to hide what is going on, but rather useful tools for understanding and dealing with data (Berthouex and Brown, 2002, p.61). Hoaglin (1988) discusses “hidden” transformations that are used everyday, such as the pH scale for measuring acidity. Johnson and Wichern (2007, p.192) note that "Transformations are nothing more than a reexpression of the data in different units."

In the case of a linear model, there are at least two approaches to improving a model fit: transform the $Y$ and/or $X$ variable(s), and/or use more predictor variables. Often in environmental data analysis, we assume the observations come from a lognormal distribution and automatically take logarithms of the data. For a simple linear regression (i.e., one predictor variable), if regression diagnostic plots indicate that a straight line fit is not adequate, but that the variance of the errors appears to be fairly constant, you may only need to transform the predictor variable $X$ or perhaps use a quadratic or cubic model in $X$ . On the other hand, if the diagnostic plots indicate that the constant variance and/or normality assumptions are suspect, you probably need to consider transforming the response variable $Y$ . Data transformations for linear regression models are discussed in Draper and Smith (1998, Chapter 13) and Helsel and Hirsch (1992, pp. 228-229).

One problem with data transformations is that translating results on the transformed scale back to the original scale is not always straightforward. Estimating quantities such as means, variances, and confidence limits in the transformed scale and then transforming them back to the original scale usually leads to biased and inconsistent estimates (Gilbert, 1987, p.149; van Belle et al., 2004, p.400). For example, exponentiating the confidence limits for a mean based on log-transformed data does not yield a confidence interval for the mean on the original scale. Instead, this yields a confidence interval for the median (see the help file for elnormAlt). It should be noted, however, that quantiles (percentiles) and rank-based procedures are invariant to monotonic transformations (Helsel and Hirsch, 1992, p.12).

Finally, there is no guarantee that a Box-Cox tranformation based on the “optimal” value of $\lambda$ will provide an adequate transformation to allow the assumption of approximate normality and constant variance. Any set of transformed data should be inspected relative to the assumptions you want to make about it (Johnson and Wichern, 2007, p.194).

Author(s)

Steven P. Millard ([email protected])

References

Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.

Box, G.E.P., and D.R. Cox. (1964). An Analysis of Transformations (with Discussion). Journal of the Royal Statistical Society, Series B 26(2), 211–252.

Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.47-53.

Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.

Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.

Hinkley, D.V., and G. Runger. (1984). The Analysis of Transformed Data (with Discussion). Journal of the American Statistical Association 79, 302–320.

Hoaglin, D.C., F.M. Mosteller, and J.W. Tukey, eds. (1983). Understanding Robust and Exploratory Data Analysis. John Wiley and Sons, New York, Chapter 4.

Hoaglin, D.C. (1988). Transformations in Everyday Experience. Chance 1, 40–45.

Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York, p.163.

Johnson, R.A., and D.W. Wichern. (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Pearson Prentice Hall, Upper Saddle River, NJ, pp.192–195.

Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.

Stoline, M.R. (1991). An Examination of the Lognormal and Box and Cox Family of Transformations in Fitting Environmental Data. Environmetrics 2(1), 85–106.

van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York.

Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ, Chapter 13.

Examples

  # Generate 30 observations from a lognormal distribution with 
  # mean=10 and cv=2.  Look at some values of various objectives 
  # for various transformations.  Note that for both the PPCC and 
  # the Log-Likelihood objective, the optimal value of lambda is 
  # about 0, indicating that a log transformation is appropriate.  
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 
  x <- rlnormAlt(30, mean = 10, cv = 2) 

  dev.new()
  hist(x, col = "cyan")

  # Using the PPCC objective:
  #--------------------------

  boxcox(x) 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  # lambda      PPCC
  #   -2.0 0.5423739
  #   -1.5 0.6402782
  #   -1.0 0.7818160
  #   -0.5 0.9272219
  #    0.0 0.9921702
  #    0.5 0.9581178
  #    1.0 0.8749611
  #    1.5 0.7827009
  #    2.0 0.7004547

  boxcox(x, optimize = TRUE)
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = 0.04530789
  #
  #Value of Objective:              PPCC = 0.9925919


  # Using the Log-Likelihodd objective
  #-----------------------------------

  boxcox(x, objective.name = "Log-Likelihood") 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  Log-Likelihood
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  # lambda Log-Likelihood
  #   -2.0     -154.94255
  #   -1.5     -128.59988
  #   -1.0     -106.23882
  #   -0.5      -90.84800
  #    0.0      -85.10204
  #    0.5      -88.69825
  #    1.0      -99.42630
  #    1.5     -115.23701
  #    2.0     -134.54125

  boxcox(x, objective.name = "Log-Likelihood", optimize = TRUE) 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  Log-Likelihood
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = 0.0405156
  #
  #Value of Objective:              Log-Likelihood = -85.07123

  #----------

  # Plot the results based on the PPCC objective
  #---------------------------------------------
  boxcox.list <- boxcox(x)
  dev.new()
  plot(boxcox.list)

  #Look at QQ-Plots for the candidate values of lambda
  #---------------------------------------------------
  plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) 

  #==========

  # The data frame Environmental.df contains daily measurements of 
  # ozone concentration, wind speed, temperature, and solar radiation
  # in New York City for 153 consecutive days between May 1 and 
  # September 30, 1973.  In this example, we'll plot ozone vs. 
  # temperature and look at the Q-Q plot of the residuals.  Then 
  # we'll look at possible Box-Cox transformations.  The "optimal" one 
  # based on the PPCC looks close to a log-transformation 
  # (i.e., lambda=0).  The power that produces the largest PPCC is 
  # about 0.2, so a cube root (lambda=1/3) transformation might work too.

  head(Environmental.df)
  #           ozone radiation temperature wind
  #05/01/1973    41       190          67  7.4
  #05/02/1973    36       118          72  8.0
  #05/03/1973    12       149          74 12.6
  #05/04/1973    18       313          62 11.5
  #05/05/1973    NA        NA          56 14.3
  #05/06/1973    28        NA          66 14.9

  tail(Environmental.df)
  #           ozone radiation temperature wind
  #09/25/1973    14        20          63 16.6
  #09/26/1973    30       193          70  6.9
  #09/27/1973    NA       145          77 13.2
  #09/28/1973    14       191          75 14.3
  #09/29/1973    18       131          76  8.0
  #09/30/1973    20       223          68 11.5


  # Fit the model with the raw Ozone data
  #--------------------------------------
  ozone.fit <- lm(ozone ~ temperature, data = Environmental.df) 

  # Plot Ozone vs. Temperature, with fitted line 
  #---------------------------------------------
  dev.new()
  with(Environmental.df, 
    plot(temperature, ozone, xlab = "Temperature (degrees F)",
      ylab = "Ozone (ppb)", main = "Ozone vs. Temperature"))
  abline(ozone.fit) 

  # Look at the Q-Q Plot for the residuals 
  #---------------------------------------
  dev.new()
  qqPlot(ozone.fit$residuals, add.line = TRUE) 

  # Look at Box-Cox transformations of Ozone 
  #-----------------------------------------
  boxcox.list <- boxcox(ozone.fit) 
  boxcox.list 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Linear Model:                    ozone.fit
  #
  #Sample Size:                     116
  #
  # lambda      PPCC
  #   -2.0 0.4286781
  #   -1.5 0.4673544
  #   -1.0 0.5896132
  #   -0.5 0.8301458
  #    0.0 0.9871519
  #    0.5 0.9819825
  #    1.0 0.9408694
  #    1.5 0.8840770
  #    2.0 0.8213675

  # Plot PPCC vs. lambda based on Q-Q plots of residuals 
  #-----------------------------------------------------
  dev.new()
  plot(boxcox.list) 

  # Look at Q-Q plots of residuals for the various transformation 
  #--------------------------------------------------------------
  plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE)

  # Compute the "optimal" transformation
  #-------------------------------------
  boxcox(ozone.fit, optimize = TRUE)
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Linear Model:                    ozone.fit
  #
  #Sample Size:                     116
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = 0.2004305
  #
  #Value of Objective:              PPCC = 0.9940222

  #==========

  # Clean up
  #---------
  rm(x, boxcox.list, ozone.fit)
  graphics.off()
# Generate 30 observations from a lognormal distribution with 
  # mean=10 and cv=2.  Look at some values of various objectives 
  # for various transformations.  Note that for both the PPCC and 
  # the Log-Likelihood objective, the optimal value of lambda is 
  # about 0, indicating that a log transformation is appropriate.  
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 
  x <- rlnormAlt(30, mean = 10, cv = 2) 

  dev.new()
  hist(x, col = "cyan")

  # Using the PPCC objective:
  #--------------------------

  boxcox(x) 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  # lambda      PPCC
  #   -2.0 0.5423739
  #   -1.5 0.6402782
  #   -1.0 0.7818160
  #   -0.5 0.9272219
  #    0.0 0.9921702
  #    0.5 0.9581178
  #    1.0 0.8749611
  #    1.5 0.7827009
  #    2.0 0.7004547

  boxcox(x, optimize = TRUE)
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = 0.04530789
  #
  #Value of Objective:              PPCC = 0.9925919


  # Using the Log-Likelihodd objective
  #-----------------------------------

  boxcox(x, objective.name = "Log-Likelihood") 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  Log-Likelihood
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  # lambda Log-Likelihood
  #   -2.0     -154.94255
  #   -1.5     -128.59988
  #   -1.0     -106.23882
  #   -0.5      -90.84800
  #    0.0      -85.10204
  #    0.5      -88.69825
  #    1.0      -99.42630
  #    1.5     -115.23701
  #    2.0     -134.54125

  boxcox(x, objective.name = "Log-Likelihood", optimize = TRUE) 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  Log-Likelihood
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = 0.0405156
  #
  #Value of Objective:              Log-Likelihood = -85.07123

  #----------

  # Plot the results based on the PPCC objective
  #---------------------------------------------
  boxcox.list <- boxcox(x)
  dev.new()
  plot(boxcox.list)

  #Look at QQ-Plots for the candidate values of lambda
  #---------------------------------------------------
  plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE) 

  #==========

  # The data frame Environmental.df contains daily measurements of 
  # ozone concentration, wind speed, temperature, and solar radiation
  # in New York City for 153 consecutive days between May 1 and 
  # September 30, 1973.  In this example, we'll plot ozone vs. 
  # temperature and look at the Q-Q plot of the residuals.  Then 
  # we'll look at possible Box-Cox transformations.  The "optimal" one 
  # based on the PPCC looks close to a log-transformation 
  # (i.e., lambda=0).  The power that produces the largest PPCC is 
  # about 0.2, so a cube root (lambda=1/3) transformation might work too.

  head(Environmental.df)
  #           ozone radiation temperature wind
  #05/01/1973    41       190          67  7.4
  #05/02/1973    36       118          72  8.0
  #05/03/1973    12       149          74 12.6
  #05/04/1973    18       313          62 11.5
  #05/05/1973    NA        NA          56 14.3
  #05/06/1973    28        NA          66 14.9

  tail(Environmental.df)
  #           ozone radiation temperature wind
  #09/25/1973    14        20          63 16.6
  #09/26/1973    30       193          70  6.9
  #09/27/1973    NA       145          77 13.2
  #09/28/1973    14       191          75 14.3
  #09/29/1973    18       131          76  8.0
  #09/30/1973    20       223          68 11.5


  # Fit the model with the raw Ozone data
  #--------------------------------------
  ozone.fit <- lm(ozone ~ temperature, data = Environmental.df) 

  # Plot Ozone vs. Temperature, with fitted line 
  #---------------------------------------------
  dev.new()
  with(Environmental.df, 
    plot(temperature, ozone, xlab = "Temperature (degrees F)",
      ylab = "Ozone (ppb)", main = "Ozone vs. Temperature"))
  abline(ozone.fit) 

  # Look at the Q-Q Plot for the residuals 
  #---------------------------------------
  dev.new()
  qqPlot(ozone.fit$residuals, add.line = TRUE) 

  # Look at Box-Cox transformations of Ozone 
  #-----------------------------------------
  boxcox.list <- boxcox(ozone.fit) 
  boxcox.list 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Linear Model:                    ozone.fit
  #
  #Sample Size:                     116
  #
  # lambda      PPCC
  #   -2.0 0.4286781
  #   -1.5 0.4673544
  #   -1.0 0.5896132
  #   -0.5 0.8301458
  #    0.0 0.9871519
  #    0.5 0.9819825
  #    1.0 0.9408694
  #    1.5 0.8840770
  #    2.0 0.8213675

  # Plot PPCC vs. lambda based on Q-Q plots of residuals 
  #-----------------------------------------------------
  dev.new()
  plot(boxcox.list) 

  # Look at Q-Q plots of residuals for the various transformation 
  #--------------------------------------------------------------
  plot(boxcox.list, plot.type = "Q-Q Plots", same.window = FALSE)

  # Compute the "optimal" transformation
  #-------------------------------------
  boxcox(ozone.fit, optimize = TRUE)
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Linear Model:                    ozone.fit
  #
  #Sample Size:                     116
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = 0.2004305
  #
  #Value of Objective:              PPCC = 0.9940222

  #==========

  # Clean up
  #---------
  rm(x, boxcox.list, ozone.fit)
  graphics.off()

S3 Class "boxcox"

Description

Objects of S3 class "boxcox" are returned by the EnvStats function boxcox, which computes objective values for user-specified powers, or computes the optimal power for the specified objective.

Details

Objects of class "boxcox" are lists that contain information about the powers that were used, the objective that was used, the values of the objective for the given powers, and whether an optimization was specified.

Value

Required Components
The following components must be included in a legitimate list of class "boxcox".

`lambda`	Numeric vector containing the powers used in the Box-Cox transformations. If the value of the `optimize` component is `FALSE`, then `lambda` contains the values of all of the powers at which the objective was evaluated. If the value of the `optimize` component is `TRUE`, then `lambda` is a scalar containing the value of the power that maximizes the objective.
`objective`	Numeric vector containing the value(s) of the objective for the given value(s) of $\lambda$ that are stored in the component `lambda`.
`objective.name`	character string indicating the objective that was used. The possible values are `"PPCC"` (probability plot correlation coefficient; the default), `"Shapiro-Wilk"` (the Shapiro-Wilk goodness-of-fit statistic), and `"Log-Likelihood"` (the log-likelihood function).
`optimize`	logical scalar indicating whether the objective was simply evaluted at the given values of `lambda` (`optimize=FALSE`), or instead the optimal power transformation was computed within the bounds specified by `lambda` (`optimize=TRUE`).
`optimize.bounds`	Numeric vector of length 2 with a names attribute indicating the bounds within which the optimization took place. When `optimize=FALSE`, this contains missing values.
`eps`	finite, positive numeric scalar indicating what value of `eps` was used. When the absolute value of `lambda` is less than `eps`, lambda is assumed to be 0 for the Box-Cox transformation.
`sample.size`	Numeric scalar indicating the number of finite, non-missing observations.
`data.name`	The name of the data object used for the Box-Cox computations.
`bad.obs`	The number of missing (`NA`), undefined (`NaN`) and/or infinite (`Inf`, `-Inf`) values that were removed from the data object prior to performing the Box-Cox computations.

Optional Component
The following component may optionally be included in a legitimate list of class "boxcox". It must be included if you want to call the function plot.boxcox and specify Q-Q plots or Tukey Mean-Difference Q-Q plots.

data

Numeric vector containing the data actually used for the Box-Cox computations (i.e., the original data without any missing or infinite values).

Methods

Generic functions that have methods for objects of class "boxcox" include:
plot, print.

Note

Since objects of class "boxcox" are lists, you may extract their components with the $ and [[ operators.

Author(s)

Steven P. Millard ([email protected])

Examples

  # Create an object of class "boxcox", then print it out.
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 
  x <- rlnormAlt(30, mean = 10, cv = 2) 

  dev.new()
  hist(x, col = "cyan")

  boxcox.list <- boxcox(x)

  data.class(boxcox.list)
  #[1] "boxcox"
  
  names(boxcox.list)
  # [1] "lambda"          "objective"       "objective.name" 
  # [4] "optimize"        "optimize.bounds" "eps"            
  # [7] "data"            "sample.size"     "data.name"      
  #[10] "bad.obs" 

  boxcox.list
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  # lambda      PPCC
  #   -2.0 0.5423739
  #   -1.5 0.6402782
  #   -1.0 0.7818160
  #   -0.5 0.9272219
  #    0.0 0.9921702
  #    0.5 0.9581178
  #    1.0 0.8749611
  #    1.5 0.7827009
  #    2.0 0.7004547

  boxcox(x, optimize = TRUE) 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = 0.04530789
  #
  #Value of Objective:              PPCC = 0.9925919 

  #----------

  # Clean up
  #---------
  rm(x, boxcox.list)
# Create an object of class "boxcox", then print it out.
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 
  x <- rlnormAlt(30, mean = 10, cv = 2) 

  dev.new()
  hist(x, col = "cyan")

  boxcox.list <- boxcox(x)

  data.class(boxcox.list)
  #[1] "boxcox"
  
  names(boxcox.list)
  # [1] "lambda"          "objective"       "objective.name" 
  # [4] "optimize"        "optimize.bounds" "eps"            
  # [7] "data"            "sample.size"     "data.name"      
  #[10] "bad.obs" 

  boxcox.list
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  # lambda      PPCC
  #   -2.0 0.5423739
  #   -1.5 0.6402782
  #   -1.0 0.7818160
  #   -0.5 0.9272219
  #    0.0 0.9921702
  #    0.5 0.9581178
  #    1.0 0.8749611
  #    1.5 0.7827009
  #    2.0 0.7004547

  boxcox(x, optimize = TRUE) 
  #Results of Box-Cox Transformation
  #---------------------------------
  #
  #Objective Name:                  PPCC
  #
  #Data:                            x
  #
  #Sample Size:                     30
  #
  #Bounds for Optimization:         lower = -2
  #                                 upper =  2
  #
  #Optimal Value:                   lambda = 0.04530789
  #
  #Value of Objective:              PPCC = 0.9925919 

  #----------

  # Clean up
  #---------
  rm(x, boxcox.list)

Boxcox Power Transformation for Type I Censored Data

Description

Compute the value(s) of an objective for one or more Box-Cox power transformations, or to compute an optimal power transformation based on a specified objective, based on Type I censored data.

Usage

  boxcoxCensored(x, censored, censoring.side = "left", 
    lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, optimize = FALSE, 
    objective.name = "PPCC", eps = .Machine$double.eps, 
    include.x.and.censored = TRUE, prob.method = "michael-schucany", 
    plot.pos.con = 0.375) 
boxcoxCensored(x, censored, censoring.side = "left", 
    lambda = {if (optimize) c(-2, 2) else seq(-2, 2, by = 0.5)}, optimize = FALSE, 
    objective.name = "PPCC", eps = .Machine$double.eps, 
    include.x.and.censored = TRUE, prob.method = "michael-schucany", 
    plot.pos.con = 0.375)

Arguments

`x`	a numeric vector of positive numbers. Missing (`NA`), undefined (`NaN`), and infinite (`-Inf, Inf`) values are allowed but will be removed.
`censored`	numeric or logical vector indicating which values of `x` are censored. This must be the same length as `x`. If the mode of `censored` is `"logical"`, `TRUE` values correspond to elements of `x` that are censored, and `FALSE` values correspond to elements of `x` that are not censored. If the mode of `censored` is `"numeric"`, it must contain only `1`'s and `0`'s; `1` corresponds to `TRUE` and `0` corresponds to `FALSE`. Missing (`NA`) values are allowed but will be removed.
`censoring.side`	character string indicating on which side the censoring occurs. The possible values are `"left"` (the default) and `"right"`.
`lambda`	numeric vector of finite values indicating what powers to use for the Box-Cox transformation. When `optimize=FALSE`, the default value is `lambda=seq(-2, 2, by=0.5)`. When `optimize=TRUE`, `lambda` must be a vector with two values indicating the range over which the optimization will occur and the range of these two values must include 1. In this case, the default value is `lambda=c(-2, 2)`.
`optimize`	logical scalar indicating whether to simply evalute the objective function at the given values of `lambda` (`optimize=FALSE`; the default), or to compute the optimal power transformation within the bounds specified by `lambda` (`optimize=TRUE`).
`objective.name`	character string indicating what objective to use. The possible values are `"PPCC"` (probability plot correlation coefficient; the default), `"Shapiro-Wilk"` (the Shapiro-Wilk goodness-of-fit statistic), and `"Log-Likelihood"` (the log-likelihood function).
`eps`	finite, positive numeric scalar. When the absolute value of `lambda` is less than `eps`, lambda is assumed to be 0 for the Box-Cox transformation. The default value is `eps=.Machine$double.eps`.
`include.x.and.censored`	logical scalar indicating whether to include the finite, non-missing values of the argument `x` and the corresponding values of `censored` with the returned object. The default value is `include.x.and.censored=TRUE`.
`prob.method`	for multiply censored data, character string indicating what method to use to compute the plotting positions (empirical probabilities) when `objective.name="PPCC"`. Possible values are: `"kaplan-meier"` (product-limit method of Kaplan and Meier (1958)), `"modified kaplan-meier"` (same as `"kaplan-meier"` with the maximum value included), `"nelson"` (hazard plotting method of Nelson (1972)), `"michael-schucany"` (generalization of the product-limit method due to Michael and Schucany (1986)), and `"hirsch-stedinger"` (generalization of the product-limit method due to Hirsch and Stedinger (1987)). The default value is `prob.method="michael-schucany"`. The `"nelson"` method is only available for `censoring.side="right"`, and the `"modified kaplan-meier"` is only available for `censoring.side="left"`. See the DETAILS section for more explanation. This argument is ignored if `objective.name` is not equal to `"PPCC"` and/or the data are singly censored.
`plot.pos.con`	for multiply censored data, numeric scalar between 0 and 1 containing the value of the plotting position constant when `objective.name="PPCC"`. The default value is `plot.pos.con=0.375`. See the DETAILS section for more information. This argument is used only if `prob.method` is equal to `"michael-schucany"` or `"hirsch-stedinger"`. This argument is ignored if `objective.name` is not equal to `"PPCC"` and/or the data are singly censored.

Details

Two common assumptions for several standard parametric hypothesis tests are:

The observations all come from a normal distribution.
The observations all come from distributions with the same variance.

When the original data do not satisfy the above assumptions, data transformations are often used to attempt to satisfy these assumptions. Box and Cox (1964) presented a formalized method for deciding on a data transformation. Given a random variable $X$ from some distribution with only positive values, the Box-Cox family of power transformations is defined as:

$Y$	=	$\frac{X^\lambda - 1}{\lambda}$	$\lambda \ne 0$

		$log(X)$	$\lambda = 0 \;\;\;\;\;\; (1)$

Shumway et al. (1989) investigated extending the method of Box and Cox (1964) to the case of Type I censored data, motivated by the desire to produce estimated means and confidence intervals for air monitoring data that included censored values.

In the case when optimize=TRUE, the function boxcoxCensored calls the R function nlminb to minimize the negative value of the objective (i.e., maximize the objective) over the range of possible values of $\lambda$ specified in the argument lambda. The starting value for the optimization is always $\lambda=1$ (i.e., no transformation).

The next section explains assumptions and notation, and the section after that explains how the objective is computed for the various options for objective.name.

Assumptions and Notation
Let $\underline{x}$ denote a random sample of $N$ observations from some continuous distribution. Assume $n$ ( $0 < n < N$ ) of these observations are known and $c$ ( $c=N-n$ ) of these observations are all censored below (left-censored) or all censored above (right-censored) at $k$ fixed censoring levels

$T_1, T_2, \ldots, T_K; \; K \ge 1 \;\;\;\;\;\; (2)$

For the case when $K \ge 2$ , the data are said to be Type I multiply censored. For the case when $K=1$ , set $T = T_1$ . If the data are left-censored and all $n$ known observations are greater than or equal to $T$ , or if the data are right-censored and all $n$ known observations are less than or equal to $T$ , then the data are said to be Type I singly censored (Nelson, 1982, p.7), otherwise they are considered to be Type I multiply censored.

Let $c_j$ denote the number of observations censored below or above censoring level $T_j$ for $j = 1, 2, \ldots, K$ , so that

$\sum_{i=1}^K c_j = c \;\;\;\;\;\; (3)$

Let $x_{(1)}, x_{(2)}, \ldots, x_{(N)}$ denote the “ordered” observations, where now “observation” means either the actual observation (for uncensored observations) or the censoring level (for censored observations). For right-censored data, if a censored observation has the same value as an uncensored one, the uncensored observation should be placed first. For left-censored data, if a censored observation has the same value as an uncensored one, the censored observation should be placed first.

Note that in this case the quantity $x_{(i)}$ does not necessarily represent the $i$ 'th “largest” observation from the (unknown) complete sample.

Finally, let $\Omega$ (omega) denote the set of $n$ subscripts in the “ordered” sample that correspond to uncensored observations, and let $\Omega_j$ denote the set of $c_j$ subscripts in the “ordered” sample that correspond to the censored observations censored at censoring level $T_j$ for $j = 1, 2, \ldots, k$ .

We assume that there exists some value of $\lambda$ such that the transformed observations

$y_i$	=	$\frac{x_i^\lambda - 1}{\lambda}$	$\lambda \ne 0$

		$log(x_i)$	$\lambda = 0 \;\;\;\;\;\; (4)$

( $i = 1, 2, \ldots, n$ ) form a random sample of Type I censored data from a normal distribution.

Note that for the censored observations, Equation (4) becomes:

$y_{(i)} = T_j^*$	=	$\frac{T_j^\lambda - 1}{\lambda}$	$\lambda \ne 0$

		$log(T_j)$	$\lambda = 0 \;\;\;\;\;\; (5)$

where $i \in \Omega_j$ .

Computing the Objective

Objective Based on Probability Plot Correlation Coefficient (objective.name="PPCC")
When objective.name="PPCC", the objective is computed as the value of the normal probability plot correlation coefficient based on the transformed data (see the description of the Probability Plot Correlation Coefficient (PPCC) goodness-of-fit test in the help file for gofTestCensored). That is, the objective is the correlation coefficient for the normal quantile-quantile plot for the transformed data. Large values of the PPCC tend to indicate a good fit to a normal distribution.

Objective Based on Shapiro-Wilk Goodness-of-Fit Statistic (objective.name="Shapiro-Wilk")
When objective.name="Shapiro-Wilk", the objective is computed as the value of the Shapiro-Wilk goodness-of-fit statistic based on the transformed data (see the description of the Shapiro-Wilk test in the help file for gofTestCensored). Large values of the Shapiro-Wilk statistic tend to indicate a good fit to a normal distribution.

Objective Based on Log-Likelihood Function (objective.name="Log-Likelihood")
When objective.name="Log-Likelihood", the objective is computed as the value of the log-likelihood function. Assuming the transformed observations in Equation (4) above come from a normal distribution with mean $\mu$ and standard deviation $\sigma$ , we can use the change of variable formula to write the log-likelihood function as follows.

For Type I left censored data, the likelihood function is given by:

$log[L(\lambda, \mu, \sigma)] = log[{N \choose c_1 c_2 \ldots c_k n}] + \sum_{j=1}^k c_j log[F(T_j^*)] + \sum_{i \in \Omega} log\{f[y_{(i)}]\} + (\lambda - 1) \sum_{i \in \Omega} log[x_{(i)}] \;\;\;\;\;\; (6)$

where $f$ and $F$ denote the probability density function (pdf) and cumulative distribution function (cdf) of the population. That is,

$f(t) = \phi(\frac{t-\mu}{\sigma}) \;\;\;\;\;\; (7)$

$F(t) = \Phi(\frac{t-\mu}{\sigma}) \;\;\;\;\;\; (8)$

where $\phi$ and $\Phi$ denote the pdf and cdf of the standard normal distribution, respectively (Shumway et al., 1989). For left singly censored data, Equation (6) simplifies to:

$log[L(\lambda, \mu, \sigma)] = log[{N \choose c}] + c log[F(T^*)] + \sum_{i = c+1}^N log\{f[y_{(i)}]\} + (\lambda - 1) \sum_{i = c+1}^N log[x_{(i)}] \;\;\;\;\;\; (9)$

Similarly, for Type I right censored data, the likelihood function is given by:

$log[L(\lambda, \mu, \sigma)] = log[{N \choose c_1 c_2 \ldots c_k n}] + \sum_{j=1}^k c_j log[1 - F(T_j^*)] + \sum_{i \in \Omega} log\{f[y_{(i)}]\} + (\lambda - 1) \sum_{i \in \Omega} log[x_{(i)}] \;\;\;\;\;\; (10)$

and for right singly censored data this simplifies to:

$log[L(\lambda, \mu, \sigma)] = log[{N \choose c}] + c log[1 - F(T^*)] + \sum_{i = 1}^n log\{f[y_{(i)}]\} + (\lambda - 1) \sum_{i = 1}^n log[x_{(i)}] \;\;\;\;\;\; (11)$

For a fixed value of $\lambda$ , the log-likelihood function is maximized by replacing $\mu$ and $\sigma$ with their maximum likelihood estimators (see the section Maximum Likelihood Estimation in the help file for enormCensored).

Thus, when optimize=TRUE, Equation (6) or (10) is maximized by iteratively solving for $\lambda$ using the MLEs for $\mu$ and $\sigma$ . When optimize=FALSE, the value of the objective is computed by using Equation (6) or (10), using the values of $\lambda$ specified in the argument lambda, and using the MLEs of $\mu$ and $\sigma$ .

Value

boxcoxCensored returns a list of class "boxcoxCensored" containing the results. See the help file for boxcoxCensored.object for details.

Note

Stoline (1991) compared the goodness-of-fit of Box-Cox transformed data (based on using the “optimal” power transformation from a finite set of values between -1.5 and 1.5) with log-transformed data for 17 groundwater chemistry variables. Using the Probability Plot Correlation Coefficient statistic for censored data as a measure of goodness-of-fit (see gofTest), Stoline (1991) found that only 6 of the variables were adequately modeled by a Box-Cox transformation (p >0.10 for these 6 variables). Of these variables, five were adequately modeled by a a log transformation. Ten of variables were “marginally” fit by an optimal Box-Cox transformation, and of these 10 only 6 were marginally fit by a log transformation. Based on these results, Stoline (1991) recommends checking the assumption of lognormality before automatically assuming environmental data fit a lognormal distribution.

One problem with data transformations is that translating results on the transformed scale back to the original scale is not always straightforward. Estimating quantities such as means, variances, and confidence limits in the transformed scale and then transforming them back to the original scale usually leads to biased and inconsistent estimates (Gilbert, 1987, p.149; van Belle et al., 2004, p.400). For example, exponentiating the confidence limits for a mean based on log-transformed data does not yield a confidence interval for the mean on the original scale. Instead, this yields a confidence interval for the median (see the help file for elnormAltCensored). It should be noted, however, that quantiles (percentiles) and rank-based procedures are invariant to monotonic transformations (Helsel and Hirsch, 1992, p.12).

Author(s)

Steven P. Millard ([email protected])

References

Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.

Box, G.E.P., and D.R. Cox. (1964). An Analysis of Transformations (with Discussion). Journal of the Royal Statistical Society, Series B 26(2), 211–252.

Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, pp.50–59.