Access
You are not currently logged in.
Access JSTOR through your library or other institution:
Performance of Some Resistant Rules for Outlier Labeling
David C. Hoaglin, Boris Iglewicz and John W. Tukey
Journal of the American Statistical Association
Vol. 81, No. 396 (Dec., 1986), pp. 991999
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
DOI: 10.2307/2289073
Stable URL: http://www.jstor.org/stable/2289073
Page Count: 9
 Item Type
 Article
 Thumbnails
 References
Abstract
The techniques of exploratory data analysis include a resistant rule for identifying possible outliers in univariate data. Using the lower and upper fourths, F_{L} and F_{U} (approximate quartiles), it labels as "outside" any observations below F_{L}  1.5(F_{U}  F_{L}) or above F_{U} + 1.5(F_{U}  F_{L}). For example, in the ordered sample 5, 2, 0, 1, 8, F_{L} = 2 and F_{U} = 1, so any observation below 6.5 or above 5.5 is outside. Thus the rule labels 8 as outside. Some related rules also use cutoffs of the form F_{L}  k(F_{U}  F_{L}) and F_{U} + k(F_{U}  F_{L}). This approach avoids the need to specify the number of possible outliers in advance; as long as they are not too numerous, any outliers do not affect the location of the cutoffs. To describe the performance of these rules, we define the someoutside rate per sample as the probability that a sample will contain one or more outside observations. Its complement is the allinside rate per sample. We also define the outside rate per observation as the average fraction of outside observations. For Gaussian data the population allinside rate per sample (0) and the population outside rate per observation (.7%) substantially understate the corresponding smallsample values. Simulation studies using Gaussian samples with n between 5 and 300 yield detailed information on the resistant rules. The main resistant rule (k = 1.5) has an allinside rate per sample between 67% and 86% for 5 ≤ n ≤ 20, and corresponding estimates of its outside rate per observation range from 8.6% to 1.7%. Both characteristics vary with n in ways that lead to good empirical approximations. Because of the way in which the fourths are defined, the sample sizes separate into four classes, according to whether dividing n by 4 leaves a remainder of 0, 1, 2, or 3. Within these four classes the allinside rate per sample shows a roughly linear decrease with n over the range 9 ≤ n ≤ 50, and the outside rate per observation decreases linearly in 1/n for n ≥ 9. A more theoretical approximation for the allinside rate per sample works with the order statistics X_{(1)} ≤ ⋯ ≤ X_{(n)}. In this notation the fourths are X_{(f)} and X_{(n + 1  f)} with f = 1/2[ (n + 3)/2], where [ ·] is the greatestinteger function. A sample has no observations outside whenever {X_{(f)}  X_{(1)}}/{X_{(n + 1  f)}  X_{(f)}} ≤ k and {X_{(n)}  X_{(n + 1  f)}}/{X_{(n + 1  f)}  X_{(f)}} ≤ k. We first approximate the numerators and denominator in these ratios by constant multiples of chisquared random variables with the same mean and variance. We then approximate the logarithm of each ratio by a Gaussian random variable, and we calculate the correlation between these variables from the fact that the ratios have the same denominator. Finally, a bivariate Gaussian probability calculation yields the approximate allinside rate per sample. The error of the result relative to the simulation estimate is typically from 1% to 2% for 5 ≤ n ≤ 50. To provide an indication of how the two rates behave in alternative "null" situations, the simulation studies included samples from five heaviertailed members of the family of hdistributions. For a given sample size, the allinside rate per sample decreases as the tails become heavier, and the outside rate per observation increases.
Page Thumbnails

991

992

993

994

995

996

997

998

999
Journal of the American Statistical Association © 1986 American Statistical Association