Lidong Huang*a,
Qian Zhenga,
Hongyang Wangb and
Daquan Sunc
aCollege of Resources and Environmental Sciences, Inner Mongolia Agricultural University, Key Laboratory of Agricultural Ecological Security and Green Development at Universities of Inner Mongolia Autonomous, Inner Mongolia Key Laboratory of Soil Quality and Nutrient Resource, Hohhot 010018, China. E-mail: ldhuangnz@163.com
bLitian Statistics Studio, Nanjing 210044, China
cInstitute of Microbiology, Czech Academy of Sciences, Vídeňská, 1083, Praha 4, 14220, Czech Republic
First published on 18th June 2025
In the analysis of pollutant data, concentrations below the analytical detection limit are commonly handled by substituting a constant value between zero and the limit of detection (LOD). However, this substitution can introduce significant bias under certain conditions. To address this issue, we have derived weight expressions that eliminate bias for lognormal and gamma data. These weights, applied to LOD/2 substitutions, can be calculated using available ranges of means, standard deviations and censoring proportions. We evaluated the performance of our weighted substitution (ωLOD/2) method using both simulated datasets with censoring proportions ranging from 5% to 50% and actual atmospheric α-HCH and HCB data from Tibet's Namco Lake. The ωLOD/2 method was compared against LOD/2 substitution, maximum likelihood estimation (MLE), and regression on order statistics (ROS). The results demonstrate that with small sample sizes (<160), although MLE and ROS did not show larger bias, ωLOD/2 outperforms both methods in estimating arithmetic and geometric means in most scenarios. It is also worth noting that ROS is currently limited to estimating summary statistics under the assumption of a lognormal distribution and cannot be applied to gamma-distributed data. In addition, ωLOD/2 provides standard deviation estimates comparable to those from MLE, with biases remaining within 5% in the majority of cases. Therefore, the proposed method is particularly suitable for situations involving small sample sizes. The application of our method to six censored atmospheric organochloride pesticide concentrations from Namco Lake further highlights its advantages in practical settings. To facilitate easy adoption by researchers, a free web app was developed that integrates our proposed weighting method with censored data distribution fitting.
Environmental significanceAccurate estimation of atmospheric pollutants is vital for assessing environmental risks and implementing effective protection measures. Traditional methods for handling data below detection limits often introduce bias, leading to potentially misleading conclusions about pollutant levels. Our study presents a weighted substitution method (ωLOD/2) that significantly improves the accuracy of pollutant estimates, particularly for lognormal and gamma distributions. Applied to data from Tibet's Namco Lake, this method ensures more reliable environmental assessments. By enhancing the precision of pollutant monitoring, our approach supports better-informed environmental policies and strategies, ultimately contributing to more effective environmental protection efforts. |
Tibet, known as the “Third Pole” and the “Water Tower of Asia,” plays a crucial role in the region's environmental health. The surrounding countries' industrial activities have led to significant organic pollution, including OCPs. These pollutants are transported through atmospheric currents and eventually settle in Tibet's pristine environment. Monitoring OCP levels in Tibet is essential not only for protecting its unique ecosystem and public health but also for understanding the broader impacts of transboundary pollution. Previous studies showed seasonal variations in OCPs in the Tibet Namco Lake atmosphere, with some compounds falling below limits of detection (LOD).4,5 This complicates accurate monitoring and risk assessment, as low detection limits hinder the ability to quantify exposure levels, evaluate long-term environmental impacts, and develop effective regulatory policies for public health protection. In the same region, while most OCPs are below detection limits in lake water, they are fully detected in fish.5 This suggests that low environmental levels of OCPs do not necessarily equate to low risk. Even trace pollutants can bioaccumulate in organisms through biomagnification, posing significant ecological and health risks. Therefore, accurately estimating the statistics of datasets with non-detectable concentrations is crucial for understanding pollution characteristics and assessing risks. Such data, where some measurements fall below the detection limit, are commonly referred to as left-censored datasets. A diverse array of methods have been developed to cope with this problem caused by censored data, such as substitution ,6,7 the maximum likelihood method (MLE),6,8,9 regression on order statistics (ROS),6,10 and the fitting distribution curve method.11 Besides, Tobit models, originally developed in econometrics, are designed to address left-censored dependent variables by modeling an underlying latent variable through MLE.12,13 The Tobit model is particularly valuable when the goal is regression-based inference.14 However, the Tobit model requires strong assumptions—most notably, that the latent variable follows a normal distribution, which may not hold in environmental datasets where variables often follow gamma distributions.15 Additionally, Tobit models are primarily designed for estimating regression coefficients, rather than directly estimating summary statistics such as means and standard deviations, which are the focus of our study. Survival analysis, particularly the Cox proportional hazards model and Kaplan–Meier estimator, has long been applied to right-censoring and more complex censoring structures.6,16,17 The Cox proportional hazards model is a widely used method in survival analysis for modeling the relationship between the time until an event occurs and one or more predictor variables.18 Similar to Tobit models, the Cox proportional hazards model is primarily designed for regression analysis involving covariates, rather than for estimating summary statistics such as the mean or standard deviation. In addition, existing literature suggests that the Kaplan–Meier estimator may be less accurate than MLE or ROS when estimating distributional summary statistics in environmental datasets.6,19
Among these methods, substitution is widely used, but it remains a subject of considerable debate.20 In some scenarios, the substitution method performs exceptionally well. According to George et al.,21 “Threshold/2 substitution was the least biased” method for determining the concentration means and standard deviations (sd) of dibenzo[a,h]anthracene in stove chimney emissions. Furthermore, many studies have concluded that substitution is sometimes comparable to other methods in estimating means.7,16 When discussing which method is recommended for handling censored data, Hites commented “Clearly, as long as more than half of the data are above the LOD, using the median or geometric mean with the <LOD or missing values replaced by LOD/2 causes little bias”. Besides, to our knowledge, the U.S. CDC uses a method where concentrations below the LOD are assigned a value equal to the for calculating geometric means in studies on human exposure to environmental chemicals.22
Simultaneously, we must also acknowledge that the substitution method is not indefensible. For example, it tends to be less accurate than MLE and ROS in estimating the population mean.6 Moreover, the common choices for substitution are LOD/2 or , while the results strongly depend on the substituted values. The geometric diagram of the principle of substitution (Fig. S1) indicates that several factors might influence the accuracy of substitution: (1) sample size, which determines the smoothness and step length of the Empirical Cumulative Distribution Function (ECDF) curve; (2) the percent of observations below the LOD (<LOD%) and (3) distribution parameters, which affect the steepness and shape of the ECDF curve. LOD/2 substitution assumes that all censored values are equal to the midpoint, which ignores the true variability of values below the detection limit. Inspired by this question, we propose using a weighted substitution value that ensures consistent accuracy. Our proposed method is based on the idea that the estimation of summary statistics for left-censored values can be informed by the observed portion of the data, due to the inherent structure of the dataset. —much like reconstructing missing sections of a broken picture based on the remaining visible parts.
Generally, it is known that concentrations of pollutants are skewedly distributed.6,23,24 The vast majority of environmental scientists assume that pollutant distributions follow a lognormal pattern.6,7,24–26 Under such an assumption, it is suggested that the geometric mean (gm) should represent central tendency, as the median equals gm for lognormal data (Table S1‡).7,16,24,27 However, we found that more than half of OCPs in the atmosphere of Namco Lake followed a gamma distribution (Fig. S2‡). Notably, the median of gamma data does not align with the geometric mean (Table S1‡). This introduces a pertinent question: in cases where the gamma distribution better fits the data, should one use the arithmetic mean (am) or gm to represent data? Additionally, many studies overlook the crucial aspect of justifying whether censored data actually follow a lognormal or gamma distribution. The lack of this justified process can potentially lead to misleading conclusions. In this context, EnvStats is an R package that provides a wide range of tools for analyzing censored data, estimating distribution parameters, conducting goodness-of-fit tests, and generating visual diagnostics.23 It includes functions such as elnormCensored() and egammaCensored(), which implement MLE for left-censored data under lognormal and gamma distributions. However, MLE can exhibit greater bias than substitution methods when applied to small sample sizes, as it relies on asymptotic properties that may not hold in such cases.19,28 To facilitate the process and simplify the coding work, A user-friendly fitting tool would enable more informed decision-making and enhance the reliability of environmental studies.
In summary, the objectives of this study are to (i) explore the factors impacting the accuracy of LOD/2 substitution; (ii) develop a more accurate substitution method by weighting LOD/2. We approximate the weight through a function of the following form:
Weight ∼f(<LOD%, E(X|X > LOD), SD(X|X > LOD)); (iii) apply this improved method to deal with censored OCPs in the Tibet Namco Lake atmosphere.
![]() | (1) |
![]() | (2) |
Based on μL = μG and σL2 = σG2, once we set the parameters for the lognormal distribution, the parameters for gamma are also fixed. The true sample mean and sd for the complete data are known. After artificially censoring, we assessed the bias in the mean and sd estimation at each censoring level. We used the relative bias (RB) to compare the accuracy of different statistical methods:
RB(%) = (estimated − true)/true × 100 | (3) |
![]() | (4) |
![]() | (5) |
Let the estimated amω equal the true am (assuming 0% bias), formula (4) = formula (5), then we can easily derive the ω:
![]() | (6) |
Likewise, we could also get the weight for calculating gm, ω′:
![]() | (7) |
Obviously, weight computation needs the conditional mean of <LOD X, which is unrealistic for actual measured censored data, because no one knows the value. To address this issue, our approach involves constructing a range of weight variations using simulated data and leveraging the available information from censored data to predict the weights.
![]() | (8) |
The detailed mathematical derivation process of the conditional expectation for the lognormal and gamma distribution can be found in the ESI,‡ and we also summarized the related formulae in Table S1.‡
Utilizing the mathematical expressions for E(X|X < c) and E[gm|X < c] (Table S1‡), we can easily compute the ω (ω′) based on formula (8). For censored data, the known information includes the proportion below the detection limit (<LOD%) and the mean and standard deviation of the data above the LOD. So, we tried to explore the relationships between ω (ω′) and E(X|X ≥ c) and sd(X|X ≥ c) at different censored levels. And the expressions of E(X|X ≥ c) and sd(X|X ≥ c) for lognormal and gamma are derived in the ESI,‡ and in our previous paper.8 After extensive trials, the following nested models were found to be the best to fit
![]() | (9) |
And . The corresponding R code is listed in the ESI.‡
Kaplan–Meier and non-parametric quantile methods were excluded because of poor performance.6,19 Referring to George et al.,6 simulated data values were generated for 1000 data sets with n = 50 for each censoring level from two lognormal and two gamma distributions, representing two skewed levels and RBs are averaged.
Since both α-HCH and HCB have been completely detected, their statistics, such as the am, gm and sd, are known. Biases of different estimation methods were compared by artificially censoring the levels at 10%, 30%, and 50%. MLE and ROS estimates are computed based on the distribution test of α-HCH and HCB. ROS is only for censoring data where lognormal distribution is assumed.
;
; where
is the mean of logarithm x, which is detected.
R mle (…) and ros (…) functions were applied to implement the MLE and ROS estimation. lognormal and gamma distributions were considered candidates for best-fitting parametric distributions for GC-MS-determined OCP measurements. However, it is well known that testing the distributions of censored data, compared to completely observed data, is challenging.6,30 To simplify the process of fitting distributions of OCPs, we have developed a web app using the R Shiny package to censor environmental data. This app is available at https://lidonghuang.shinyapps.io/Censored_fitting_weighting_substitution/. In the app, the function gofTestCensored (…) from package ‘EnvStats’ is used to perform goodness-of-fit tests. These tools help visually and statistically assess how well the chosen distribution fits the observed data. The Shapiro–Wilk test, ECDF plot and QQ plot were used to evaluate and indicate the better distribution between the candidates, lognormal and gamma. To enable other scientists to conveniently utilize the weighted substitution method proposed in this study, we have integrated it into the web app as well. Instructions on how to use the app are provided on the home page of the app.
![]() | ||
Fig. 1 Concentrations of OCPs in the Tibet Namco lake atmosphere. The distribution fittings were performed using a web app developed in this paper (Fig. S3‡). |
![]() | ||
Fig. 3 Relationship between weight (ω) and ![]() |
Aiming to predict weights using available information , various conventional nonlinear models are used to fit the relationship between ω and
at fixed < LOD%. The power function and exponential functions as shown in formula (9) are found best to predict ω for lognormal and gamma distribution (Fig. S6 and S7‡). To embrace the <LOD% in the prediction, constants of formula (9) are linked with <LOD% using linear and quadratic equations for lognormal and gamma types respectively (Fig. S6 and S7‡). The full models (incorporating both
and <LOD%) are summarized in Table 1. The selected model shows a better goodness of fit to the weights with an R2 > 0.98.
Distribution | Type of mean | Models to get ω | R2 |
---|---|---|---|
a ω is weight; ![]() |
|||
Lognormal | am | ω = a·tb | 0.996 |
a = 10−3 [1482 − 9.164 c] | |||
b = – 10−3 [166 + 2.569 c] | |||
gm | ω = a·tb | 0.992 | |
a = 10−2[142.327 − 1.098 c] | |||
b = −10−3[202.6 + 3.906 c] | |||
Gamma | am | ω = a·eb·t | 0.999 |
a = 10−4 [1.215c2 −131.5 c + 23390] | |||
b = −10−4[1.652c2 + 5.432 c + 8939] | |||
gm | ω = a·eb·t | 0.983 | |
a = 10−4 [1.403 c2 − 123.6 c + 29240] | |||
b = –10−4 [ 2.253 c2 + 111.1 c + 14210] |
Fig. 5 and 6 show the RB of the weighted substitution in comparison with other sophisticated methods in simulated data.7,19,30 The weighted substitution yields surprisingly accurate results for mean (am and gm) estimation in both lognormal and gamma censored data (Fig. 5 and 6). For mean estimation, the weighted method is more accurate than the MLE and ROS methods. This improvement permits the choice to eliminate the bias. Regarding standard deviation (sd), its performance on lognormal data exhibits slightly more bias compared to the ROS method but is superior to the MLE method (Fig. 5). For gamma data, the bias of this method is less favorable compared to MLE (Fig. 5). Nevertheless, irrespective of the distribution type or censored level, the relative bias in sd estimation remains below 5%. This weighted method is primarily designed as an unbiased estimator for the mean and does not achieve unbiased estimation for the sd.31 However, the method has also significantly improved the accuracy of sd estimation, which can be attributed to the more accurate mean estimation. The sample variance is given by . For censored values, we can write
. Both terms in the variance formula depend on the estimated mean
. When
is more accurately estimated—as in our proposed method—the variance calculation more accurately reflects the true spread of the data, thereby stabilizing variance estimation. Fig. 5 and 6 exhibit the method accuracy responding to the moderately skewed distributions, as defined in a publication.6 Fig. S11 and S12‡ show the accuracy of the weighted method with highly skewed data. The RBs of weighted substitution are within 10%, except for estimating the gm of gamma data. Overall, considering that ROS is only applicable to lognormal distributions and MLE can produce significant bias in certain situations (Fig. 6), the weighted method proposed in this paper can greatly enrich the current toolbox for accurately analyzing censored data.
![]() | ||
Fig. 6 Comparisons of relative bias (RB) in estimates of the am, gm and sd from the simulation study. The simulated data were generated by gamma distribution with α = 3.5, β = 1.1, and n = 50. |
Our results demonstrated that determining the distribution type of censored dataset is essential for accurate statistical estimation. For example, the precision of MLE depends on correctly specifying the likelihood function, which is derived from the probability density function of the data distribution. Moreover, methods such as ROS are currently applicable only to lognormal distributions, necessitating prior knowledge of whether the data conform to this distribution. However, previous studies often assume that censored environmental data follow a lognormal distribution,7,16,33 a supposition that this study's findings suggest that it may not always be valid. Besides, study has indicated that applying the ROS estimator directly to gamma-distributed data results in highly biased estimates and large variances.33 In such cases, the ROS estimator proves to be not robust against deviations from the assumed distribution.
MLE, when correctly implemented using the CDF for censored data, has strong theoretical advantages, such as consistency and asymptotic efficiency. The consistency and asymptotic efficiency of MLE make it most reliable when sample sizes are sufficiently large. However, in real-world environmental applications, particularly when sample sizes are small, MLE estimates can have larger bias.28 The weighted substitution method generally performed as well as, or better than, MLE in our simulated scenarios. Therefore, our method serves as a practical and stable alternative, particularly in cases where MLE performance is limited due to small sample sizes. We also emphasize that our proposed method is not meant to replace MLE universally, but rather to act as a flexible and computationally efficient option in specific contexts (small datasets or data scarcity). The underlying mathematical principle may explain why MLE occasionally produces larger bias in standard deviation estimates, particularly under small sample conditions. As shown in formula (1), since sd depends exponentially on both μ log and σ log, small estimation errors in these parameters can lead to larger bias in the standard deviation due to error propagation. MLE of SD for lognormal is not unbiased.
In many real-world environmental datasets, the LOD may vary across samples due to differences in analytical methods or instruments or matrix effects. In our current study, we assumed a common LOD for simplicity and clarity in presenting the method and simulations. However, handling heterogeneous LODs is important for broader applicability. The proposed ωLOD/2 method can be extended to accommodate varying LODs by applying individualized weights based on each sample's specific detection limit and its corresponding censoring level. The ωLOD/2 framework is inherently flexible and can be extended to accommodate heterogeneous LODs. Specifically, each censored value can be substituted using an individually weighted value.
For censored data, procedures for implementing distribution fitting were mentioned.34 However, users must become proficient in the R programming language to effectively utilize these methods. Apparently, the app developed in this study incorporates methods for fitting censored data distributions, greatly enhancing the ease of use and hoping to improve the accuracy of statistical estimation.
OCP | Distribution | Methods | am | sd | gm | Median |
---|---|---|---|---|---|---|
a Median is not available because <LOD% is greater than 50%.b ROS is applied only for lognormal data. | ||||||
β-HCH | Gamma | ωLOD/2 | 0.45 | 0.41 | 0.27 | NAa |
MLE | 0.42 | 0.47 | 0.21 | |||
γ-HCH | Gamma | ωLOD/2 | 2.11 | 1.79 | 1.32 | 1.6 |
MLE | 2.11 | 1.93 | 1.32 | |||
o,p′-DDE | Lognormal | ωLOD/2 | 1.20 | 2.13 | 0.57 | 0.5 |
MLE | 1.17 | 2.22 | 0.55 | |||
ROSb | 1.18 | 2.13 | 0.52 | |||
p,p′-DDE | Gamma | ωLOD/2 | 1.97 | 2.40 | 0.85 | 0.9 |
MLE | 1.89 | 2.23 | 0.82 | |||
o,p′-DDT | Gamma | ωLOD/2 | 3.61 | 3.92 | 1.82 | 1.6 |
MLE | 3.61 | 3.90 | 1.82 | |||
p,p′-DDT | Gamma | ωLOD/2 | 1.19 | 1.18 | 0.62 | 0.65 |
MLE | 1.18 | 1.33 | 0.55 |
Although this study focuses on atmospheric OCPs, the proposed weighted substitution approach is not limited to a specific pollutant type. As long as the data exhibit left-censoring and the distributional assumption (e.g., lognormal or gamma) is reasonably appropriate, the method can be applied to a wide range of environmental contaminants, including heavy metals, PCBs, PAHs, and others, which may encounter the censored data processing challenge.35,36
Footnotes |
† The original R code for the web app is archived on GitHub (https://github.com/Lidong-Huang/Censored.DF/tree/main). |
‡ Electronic supplementary information (ESI) available: The mathematical processes for computing conditional population statistics. Fig. S1 and S8 illustrate the geometric principle of the LOD/2 substitution method. Fig. S2 and Table S2 present the parameter settings for data simulations. Fig. S3 shows the distribution fitting results for censored OCPs. Fig. S4, S5 and Tables S3–S6 demonstrate the influence of parameters on the accuracy of the LOD/2 substitution. Fig. S6 and S7 depict the nonlinear relationship between the known information of censored data and weights. Fig. S9 highlights the difference between the geometric mean and median for gamma data. Fig. S10–S12 compare the accuracy of different methods. Fig. S13 provides a screenshot of the web app. Table S1 lists the formulae used in this study, and Table S7 summarizes the weight (ω). See DOI: https://doi.org/10.1039/d5ea00005j |
This journal is © The Royal Society of Chemistry 2025 |