What is Robustness in Statistics? A Brief Intro to Robust Estimators

A Brief Intro to Robust EstimatorsEce MutluBlockedUnblockFollowFollowingJun 25Robust statistics are the statistics that are resistant to outliers.

In other words, if there are low or high number of outliers in your samples, non-robust estimators provide you poor estimates of the population parameters.

For example, if your experimental data includes the repeated measurements as 10, 10.

3, 10.

2, 10.

1 and 100 and your last data is obviously wrong due to a systematic error.

The mean will give you the location estimate as 28.

12 since it is susceptible to even one outlier, while the median is not affected by only one outlier and will give you the result as 10.

2 since it is a robust estimation methods.

Statistical inferences and estimation methods are based on samples from population, and certain assumptions are required for their underlying distributions.

i.

e.

independently and identically distributed (i.

i.

d), or normally distributed observations.

There is no guarantee that these assumptions will be actually realized, however their use may be justified with the help of continuity and stability principle: a minor error in the model is likely to yield small errors in conclusions.

Unfortunately, this is not always true; sensitivity of traditional methods to minor deviations from assumed models sometimes necessitiates use of alternative robust procedures.

Does robust mean outlier resistant only?There is a subtle difference between robustness and outlier resistance.

The term outlier resistant is used to imply that appropriate statistics is minimally affected in the existence of outliers, e.

g.

median is one of the commonly used outlier resistant estimators.

Robustness, on the other hand, indicates insensitivity of a procedure despite the deviations in underlying assumptions.

Concept of robustness may be clarified emphasizing its difference from non-parametric methods.

In non-parametric methods, no assumption is made about the underlying joint distribution.

Robustness, on the other hand, is related with the strength against the deviations from the assumptions.

Models are based on assumptions, which hold for the majority of observations, however a fraction of observations may have a different pattern than the majority, or may have no pattern.

These observations are called as outliers.

Deviations may be handled via nonparametric methods, but it is possible to use parametrized set of underlying distribution for the majority of the observations and arbitrary distribution for the outliers via robust statistics.

In the presence of outliers, traditional methods are not efficient in determining process parameters, due to increase in bias and variance, therefore outlier resistant statistics are employed to remove outliers before estimating parameters.

How we measure robustness?Robustness of an estimator can be measured by the local stability assessed via influence function (IF), and global reliability via breakdown point (BDP).

Asymptotic BDP measures the fraction of contamination, which is capable to make the estimate useless, while IF measures effect of a single observation, varying from -∞ to ∞ on the estimate.

A robust estimator is expected to have a bounded and small IF and high BDP to resist against outliers in a dataset.

Robust Dispersion EstimatorsSample standard deviation (S) is known to be the most efficient conventional scale estimator under normality.

If each observation in a sample is represented with ????_i, (???? = 1,2, … n), then standard deviation estimate is as follows:(bar is the sample average).

Sample standard deviation has 0 % breakdown point, i.

e.

a single outlier has the potential to change the scale estimate indefinitely.

As a result, influence function of sample standard deviation is unbounded.

Median absolute deviation (MAD) is the unbiased median estimate of absolute deviations from median.

(tilda is used for the sample median).

MAD has 50% breakdown point and bounded influence function, making MAD less sensitive to outliers.

It is one of the widely used median based estimators due to its good robustness properties and its simplicity.

Although MAD has low (37%) Gaussian efficiency, it may be highly efficient when the sampled population is contaminated.

Another type of scale estimator named Qn estimator was suggested by Rousseuw & Croux (1993).

While Qn has also 50% breakdown point and bounded influence function, but, discontinuities in the influence functions of MAD and Qn make the application of these estimators less favorable in small samples.

The advantage Qn over MAD is its high Gaussian efficiency (≈83 %).

Robust Location EstimatorsSample mean (also called the sample average) is the most widely used location estimator.

Sample mean is not resistant against disturbances due to low breakdown point (0%) and unbounded influence function.

Especially when the sample size is small, sample mean is affected from outliers excessively.

Another well-known robust location estimator is Harrell- Davis quantile estimator, which is the weighted average of all order statistics (Harrell & Davis, 1982).

One of the direct applications of Harrell-Davis estimators is in estimating the sample median.

Compared to sample median, Harrell-Davis estimator is a highly efficient, since it uses all of the observations rather than the order statistics.

It should, however, be noted that a single outlier, if sufficiently distant from the rest of the observations, may render this statistic useless.

Another location estimator is the Hodges-Lehmann estimator which is median of the Walsh averages (Hampel, et al.

, 1986).

Because it is a median based estimator, its influence function is bounded.

Despite its robustness, its influence function may coincide with the sample mean when sample size is low.

Additionally, its breakdown point is ~29%, which is relatively low.

These are the most commonly used robust statistics.

However, our study showed that M estimators provide more accurate estimates.

Since these methods are more complex to employ, you should search these methods in detail before using them.

Our study showed that these two estimators are more powerful among other robust statistics.

M logistic scale estimator is an M-estimator of dispersion with psi-function equal to (????c − 1)/(????c + 1) and with auxiliary location estimate.

A parameter can be adjusted to attain the desired breakdown point.

Influence function of MSLOG is smooth and bounded.

Therefore, fully iterated M estimator with logistic psi function is used to prevent sudden bumps in influence function.

One of robust location estimators to resist contaminations is the Huber location M-estimator, which is suggested by Huber (Huber,1964).

The maximal breakdown point for a location estimator (50%) can be achieved with M- estimator, which has a bounded influence function.

You can find more details in: Mutlu, EÇ, Alakent, B.

Revisiting reweighted robust standard deviation estimators for univariate Shewhart S‐charts.

Qual Reliab Engng Int.

2019; 35: 995– 1009.

https://doi.

org/10.

1002/qre.

2441.. More details

Leave a Reply